Using Scrapy with Selenium for Advanced Techniques

Learn how to integrate Scrapy with Selenium for advanced web scraping. Enhance your scraping skills to handle dynamic content and interactive sites.

Get Started free
How to use Scrapy with Selenium for Advanced Web Scraping
Home Guide How to use Scrapy with Selenium for Advanced Web Scraping

How to use Scrapy with Selenium for Advanced Web Scraping

Web scraping has become an indispensable tool for data extraction from websites, enabling businesses, researchers, and developers to gather valuable insights from the vast pool of online information. With the right tools, web scraping can be both efficient and powerful, allowing you to automate data collection and analysis processes.

Overview

Web Scraping with Scrapy and Selenium

  • Selenium automates browsers to interact with dynamic websites, while Scrapy is a Python framework designed for efficient web scraping.
  • Combining Scrapy and Selenium enables scraping from JavaScript-heavy websites and handling complex web interactions, like logging in or navigating dynamic content.

Combining Scrapy and Selenium

  • When to use Scrapy alone: Best for static websites with structured data.
  • When to use Selenium alone: Essential for dynamic sites requiring JavaScript execution, form submissions, or session handling.
  • Using both: Leverage Scrapy’s speed for large-scale data collection while using Selenium for handling dynamic elements.

This article will guide you through combining Scrapy and Selenium to perform advanced web scraping, enabling you to handle complex web pages and extract data that would otherwise be challenging to gather with a standard scraper.

What is Web Scraping?

Web scraping is the process of extracting data from websites. It involves making HTTP requests to a website, fetching the HTML content, and then parsing and analyzing the data to gather useful information. Web scraping can be applied to various tasks, such as:

  • Collecting product data from e-commerce websites.
  • Extracting job listings or news articles.
  • Gathering real-time financial or sports data.
  • Mining academic or research content.

While many websites offer data through APIs, web scraping remains essential when APIs are unavailable, or when users need to gather data from multiple sources or navigate complex web pages. Web scraping allows you to automate and streamline data extraction, providing a quick and efficient way to collect large amounts of information from the web.

Web Scraping with Scrapy and Selenium

Selenium is a widely used tool for controlling web browsers automatically. It can click buttons, fill out forms, and navigate websites, making it useful for testing web applications and automating browser tasks.

Scrapy is a Python framework built for web scraping. It helps collect data from websites by sending requests, extracting information, and storing it efficiently. Its design allows for fast and large-scale data extraction from multiple web pages.

The choice between Scrapy alone or Scrapy with Selenium depends on the website’s complexity and the scraping requirements.

When to use Scrapy alone?

Scrapy is best for static websites with structured data.

It works well in these cases:

  • Fast and efficient: Handles multiple pages quickly with minimal resources.
  • Large-scale scraping: Can crawl entire websites and follow links automatically.
  • Low memory usage: Uses significantly less memory than Selenium.
  • Customizable: Supports proxies, retries, and headers for advanced scraping.

When to use Selenium alone?

Selenium is essential for JavaScript-heavy websites that require interaction.

Use Selenium when:

  • Content loads dynamically (JavaScript, AJAX).
  • Actions like clicking, scrolling, or filling forms are required.
  • Logging in and maintaining sessions is necessary.
  • Handling CAPTCHAs is part of the process.
  • Multiple programming languages are needed (Selenium supports Java, C#, JavaScript, etc.).

Combining Scrapy and Selenium

For projects that involve both structured data extraction and complex web interactions, combining Scrapy and Selenium can provide a powerful solution. This approach allows developers to leverage Scrapy’s efficiency for large-scale data processing while using Selenium to handle JavaScript rendering and user interactions when needed

Feature Scrapy OnlyScrapy + Selenium
Static HTML scrapingYesNot needed
JavaScript-loaded contentNoRequired
Handling logins & formsNoRequired
Pagination (static links)YesYes
Infinite scrolling / AJAXNoRequired
High-speed data extractionYesSlower due to browser rendering
Interacting with buttons/dropdownsNoRequired

 

Talk to an Expert

Setting Up Your Environment

Know how to set up the environment

Installation Guide

To set up Scrapy with Selenium, follow these steps:

Step 1. Install Scrapy

pip install scrapy

Step 2. Install Selenium

pip install selenium

Step 3. Install scrapy-selenium

pip install scrapy-selenium

Step 4. Download the latest WebDriver for the browser you wish to use, or install webdriver-manager by running the command.

pip install webdriver-manager

Also, install BeautifulSoup by running the below command

pip install beautifulsoup4

Scraping Product Data Using Scrapy and Selenium

Scenario:

Extract product names and prices from the bstackdemo website, which dynamically loads content using JavaScript. Selenium ensures all products are visible before Scrapy and BeautifulSoup extract the data.

Example Code:

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

class BStackDemoSpider(scrapy.Spider):
name = “bstack_demo”
start_urls = [“https://bstackdemo.com/”] # Target website

def __init__(self):
# Set up Selenium WebDriver
service = Service(“chromedriver.exe”) # Ensure ChromeDriver is installed
options = webdriver.ChromeOptions()
options.add_argument(“–headless”) # Run without opening a browser
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)

# Ensure JavaScript-loaded content is fully visible by scrolling
self.driver.find_element(By.TAG_NAME, “body”).send_keys(Keys.END)
time.sleep(2) # Allow time for content to load

# Parse page source with BeautifulSoup
soup = BeautifulSoup(self.driver.page_source, “html.parser”)

# Extract product details
for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit() # Close browser session

How it works:

Selenium opens the bstackdemo website and ensures all products are visible by scrolling down. Once the content is fully loaded, BeautifulSoup extracts product names and prices from the page source. Scrapy then processes the data and stores it, while Selenium closes to free system resources.

Running Selenium in Headless Mode

Running Selenium in headless mode allows web scraping without opening a browser window, making it faster and more efficient.

Scenario:

Extract brand names and product availability from thebstackdemo website while running Selenium in headless mode for a faster and resource-efficient scraping process.

Example Code :

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

class BStackDemoHeadlessSpider(scrapy.Spider):
name = “bstack_headless”
start_urls = [“https://bstackdemo.com/”] # Target website

def __init__(self):
# Set up Selenium WebDriver in headless mode
service = Service(“chromedriver.exe”) # Ensure ChromeDriver is installed
options = webdriver.ChromeOptions()
options.add_argument(“–headless”) # Run browser in headless mode
options.add_argument(“–disable-gpu”) # Improve performance in some systems
options.add_argument(“–no-sandbox”) # Bypass OS security restrictions
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)
time.sleep(2) # Allow JavaScript to load content

# Get page source and parse with BeautifulSoup
soup = BeautifulSoup(self.driver.page_source, “html.parser”)

# Extract brand names and availability status
for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“brand”: product.find(“div”, class_=”shelf-item__brand”).text.strip(),
“availability”: product.find(“p”, class_=”shelf-item__buy-btn”).text.strip(),
}

self.driver.quit() # Close browser session

How it works:

Selenium runs in headless mode, loading the bstackdemo website without opening a browser window. After waiting for JavaScript to load, BeautifulSoup extracts brand names and availability status from the page source. Scrapy processes and stores the data while Selenium closes, ensuring an efficient and lightweight scraping process.

Pro Tip: Running Selenium tests on BrowserStack allows automated testing on real cloud-based browsers without local setup. This ensures faster execution, better scalability, and access to multiple browser environments. Cloud-based testing reduces system load and improves reliability for large-scale web scraping and automation projects.

Basic Web Scraping with Scrapy and Selenium

Scrapy and Selenium together make it easy to extract JavaScript-rendered content, handle pagination, and automate logins for scraping data from websites that rely on dynamic elements.

Extracting JavaScript-Rendered Content

Some websites load content dynamically using JavaScript, meaning Scrapy alone cannot extract the data. Selenium loads the page completely before Scrapy processes it.

Example: Extracting Product Names and Prices

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
class JavaScriptContentSpider(scrapy.Spider):
name = “js_content”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
service = Service(“chromedriver.exe”) # Ensure ChromeDriver is installed
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)
time.sleep(2) # Wait for JavaScript to load

soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit()

How it works:

Selenium loads the page, waits for JavaScript to render products, and BeautifulSoup extracts the data before Scrapy processes it.

Handling Basic Pagination with Selenium

Some websites have multiple pages for listings. Selenium can click the “Next” button to navigate through pages while Scrapy extracts data.

Example: Scraping Products from Multiple Pages

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

class PaginationSpider(scrapy.Spider):
name = “pagination_spider”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)

while True:
time.sleep(2) # Wait for page to load
soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

try:
next_button = self.driver.find_element(By.CLASS_NAME, “pagination-next”)
if “disabled” in next_button.get_attribute(“class”):
break # Stop if no next page
next_button.click()
except:
break

self.driver.quit()

How it works :

Selenium clicks the “Next” button, loads the next set of products, and Scrapy extracts the data until all pages are scraped.

Automating Login for Authenticated Pages

Some websites require login before accessing data. Selenium can enter credentials, submit the form, and maintain the session while Scrapy extracts data.

Example: Logging in and Scraping User-Specific Data

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

class LoginSpider(scrapy.Spider):
name = “login_spider”
start_urls = [“https://bstackdemo.com/signin”] # Login page URL

def __init__(self):
service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)

# Enter username and password
self.driver.find_element(By.ID, “react-select-2-input”).send_keys(“demouser”)
self.driver.find_element(By.ID, “react-select-3-input”).send_keys(“testingpassword”)
self.driver.find_element(By.CLASS_NAME, “login-button”).click()
time.sleep(2)

soup = BeautifulSoup(self.driver.page_source, “html.parser”)

# Extract user-specific data
user_info = soup.find(“div”, class_=”user-info”)
if user_info:
yield {“user”: user_info.text.strip()}

self.driver.quit()

How it works :

Selenium fills in login credentials, submits the form, and Scrapy extracts user-specific data after login.

Advanced Scrapy + Selenium Techniques

Here are some of the advanced techniques for using Scrapy along with Selenium:

Handling Dynamic and Interactive Elements

Here are examples on how to handle dynamic and interactive elements:

Scraping Infinite Scrolling Pages

Many modern websites use infinite scrolling and AJAX to load content dynamically instead of traditional pagination. Selenium can scroll the page and wait for new elements to appear before Scrapy extracts data.

Scenario: Some websites, like social media feeds, load new content as the user scrolls down. Selenium’s execute_script() function can simulate scrolling to trigger content loading.

Code Example:

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

class InfiniteScrollSpider(scrapy.Spider):
name = “infinite_scroll”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)
scroll_pause_time = 2

last_height = self.driver.execute_script(“return document.body.scrollHeight”)

while True:
self.driver.execute_script(“window.scrollTo(0, document.body.scrollHeight);”)
time.sleep(scroll_pause_time)

new_height = self.driver.execute_script(“return document.body.scrollHeight”)
if new_height == last_height:
break
last_height = new_height

soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit()

How It Works:

Selenium scrolls down to load more products. It waits briefly and checks if new content appears. Once all products are loaded, Scrapy extracts the data.

BrowserStack Automate Banner 10

Extracting AJAX-Loaded Content

Some websites fetch data dynamically using AJAX requests, meaning content appears after the page initially loads. Selenium waits until the new data is available before Scrapy processes it.

Scenario: Waiting for AJAX to Load Products on bstackdemo

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

class AjaxContentSpider(scrapy.Spider):
name = “ajax_content”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)

# Wait until products are loaded
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, “shelf-item”))
)

soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit()

How It Works:

Selenium waits for AJAX-loaded content to appear using WebDriverWait(). Once products are available, Scrapy extracts the data. This ensures no data is missed due to slow-loading elements.

Bypassing Anti-Scraping Mechanisms

Websites often implement anti-scraping techniques to block automated bots. To avoid detection and ensure smooth data extraction, several strategies can be used.

Avoiding Detection with Headless Browsers & Randomized User-Agents

Using a headless browser allows scraping without opening a visible window. Randomizing user-agents helps mimic different browsers to prevent blocking.

Scenario: Using Headless Mode and Randomized User-Agents

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from fake_useragent import UserAgent
from bs4 import BeautifulSoup

class AntiDetectionSpider(scrapy.Spider):
name = “anti_detection”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
ua = UserAgent()
user_agent = ua.random # Randomize user-agent

service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
options.add_argument(f”user-agent={user_agent}”)

self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)
soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit()

How It Works: Headless mode prevents detection by running without a visible browser. Randomized user-agents make requests look like they come from real users.

Managing Cookies & Sessions for Continuity

Websites use cookies and sessions to track users. Managing them helps maintain login states and avoid detection.

Scenario: Preserving Cookies Across Requests

 import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pickle
import time

class CookieManagementSpider(scrapy.Spider):
name = “cookie_management”
start_urls = [“https://bstackdemo.com/signin”]

def __init__(self):
service = Service(“chromedriver.exe”)
self.driver = webdriver.Chrome(service=service)

def parse(self, response):
self.driver.get(response.url)

# Login process
self.driver.find_element(By.ID, “react-select-2-input”).send_keys(“demouser”)
self.driver.find_element(By.ID, “react-select-3-input”).send_keys(“testingpassword”)
self.driver.find_element(By.CLASS_NAME, “login-button”).click()
time.sleep(2)

# Save cookies after login
pickle.dump(self.driver.get_cookies(), open(“cookies.pkl”, “wb”))

# Load cookies for future requests
self.driver.get(“https://bstackdemo.com/”)
for cookie in pickle.load(open(“cookies.pkl”, “rb”)):
self.driver.add_cookie(cookie)

self.driver.refresh()
time.sleep(2)

# Extract data after maintaining session
yield {“message”: “Logged in and session maintained”}

self.driver.quit()

How It Works: Saves cookies after login for reuse in future requests. Restores session to avoid repeated logins and detection.

Solving CAPTCHAs Using Third-Party Services

Some websites use CAPTCHAs to block bots. Third-party services like 2Captcha or Anti-Captcha can solve them automatically.

Scenario : Sending CAPTCHA to 2Captcha for Solving

 import requestsimport time

API_KEY = “your_2captcha_api_key”
captcha_site_key = “site-key-from-bstackdemo”
url = “https://bstackdemo.com/”

# Step 1: Request CAPTCHA solving
response = requests.post(
“http://2captcha.com/in.php”,
data={“key”: API_KEY, “method”: “userrecaptcha”, “googlekey”: captcha_site_key, “pageurl”: url}
)

captcha_id = response.text.split(“|”)[-1]

# Step 2: Wait for CAPTCHA solution
time.sleep(15)
solution_response = requests.get(f”http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}”)

while “CAPCHA_NOT_READY” in solution_response.text:
time.sleep(5)
solution_response = requests.get(f”http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}”)

captcha_solution = solution_response.text.split(“|”)[-1]

# Step 3: Use CAPTCHA solution in web scraping
print(“CAPTCHA Solved:”, captcha_solution)

How It Works: Sends CAPTCHA to 2Captcha for solving. Waits for the solution and applies it to bypass restrictions.

Pro Tip : Websites track IPs and browser behavior to detect bots. Running Scrapy + Selenium on BrowserStack allows scrapers to:

  • Use multiple real browsers to mimic organic traffic.
  • Rotate IP addresses to reduce detection risks.
  • Run tests in the cloud for better efficiency and scalability.

Large-Scale Scraping and Performance Optimization

When scraping large volumes of data from websites, performance and scalability are key factors. Combining Scrapy with Selenium can be a powerful solution, but it also requires careful optimization to handle multiple requests efficiently and avoid being blocked.

Efficiently Managing Browser Sessions to Reduce Memory Usage

Running Selenium in headless mode can help reduce memory usage. Additionally, managing browser sessions effectively-such as closing unused sessions-is crucial to avoid memory overload during long scraping operations.

Scenario : Closing Browser Sessions After Use

 import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

class EfficientScraperSpider(scrapy.Spider):
name = “efficient_scraper”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”) # Use headless mode
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
self.driver.get(response.url)
time.sleep(2)

soup = BeautifulSoup(self.driver.page_source, “html.parser”)

# Extract data
for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit() # Properly close the browser session

How It Works: Headless mode reduces memory usage by not rendering the browser UI. Closing the browser session with self.driver.quit() ensures no memory is wasted after use.

Running Multiple Selenium Instances with Scrapy’s CrawlSpider

Scrapy’s CrawlSpider allows for structured crawling, and by running multiple Selenium instances, you can scrape multiple pages at once. This parallelism helps accelerate data collection.

Scenario : Using CrawlSpider with Multiple Selenium Instances

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup

class MultiSeleniumSpider(CrawlSpider):
name = “multi_selenium”
allowed_domains = [“bstackdemo.com”]
start_urls = [“https://bstackdemo.com/”]

rules = (
Rule(LinkExtractor(), callback=”parse_item”, follow=True),
)

def __init__(self):
service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

def parse_item(self, response):
self.driver.get(response.url)
soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit()

How It Works: CrawlSpider allows automatic following of links to scrape multiple pages. Each page is handled by a separate Selenium instance, running in headless mode to optimize performance.

Using Proxies and Rotating IPs to Avoid Blocks

Websites often block scrapers after too many requests from a single IP. Using proxies or rotating IPs helps avoid detection and bans.

Scenario : Rotating IPs with Scrapy Middleware

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import random

class ProxyRotatorSpider(scrapy.Spider):
name = “proxy_rotator”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
self.proxies = [
“http://proxy1.com”, “http://proxy2.com”, “http://proxy3.com”
]
service = Service(“chromedriver.exe”)
options = webdriver.ChromeOptions()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

def parse(self, response):
proxy = random.choice(self.proxies) # Rotate proxy
self.driver.get(response.url)
self.driver.execute_cdp_cmd(‘Network.enable’, {})
self.driver.execute_cdp_cmd(‘Network.setCacheDisabled’, {“cacheDisabled”: True})
self.driver.get(response.url)

soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit()

How It Works: Proxies are rotated randomly for each request to distribute traffic. Using Scrapy middleware allows handling of rotating IPs to avoid being blocked.

Asynchronous Execution: Combining Scrapy’s Concurrency with Selenium’s Actions

Scrapy is built for asynchronous operations, allowing multiple requests to run concurrently. By combining Scrapy’s concurrency with Selenium’s page interactions, large-scale scraping tasks can be performed much faster.

Scenario : Combining Asynchronous Scrapy with Selenium for Faster Scraping

 import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from scrapy import Request
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import asyncio

class AsyncScraperSpider(scrapy.Spider):
name = “async_scraper”
start_urls = [“https://bstackdemo.com/”]

def __init__(self):
service = Service(“chromedriver.exe”)
options = Options()
options.add_argument(“–headless”)
self.driver = webdriver.Chrome(service=service, options=options)

async def parse(self, response):
self.driver.get(response.url)
soup = BeautifulSoup(self.driver.page_source, “html.parser”)

for product in soup.find_all(“div”, class_=”shelf-item”):
yield {
“name”: product.find(“p”, class_=”shelf-item__title”).text.strip(),
“price”: product.find(“div”, class_=”val”).text.strip(),
}

self.driver.quit()

def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse)

How It Works: Scrapy’s asynchronous model runs requests concurrently. Selenium interacts with web pages while Scrapy handles multiple requests simultaneously for faster scraping.

Why choose BrowserStack for Selenium Testing?

BrowserStack Automate is a cloud-based platform designed for automated cross-browser testing of web applications. It enables teams to run Selenium, Appium, and other automation frameworks on real devices and browsers in the cloud, without the need for local setups. Key features include:

  • Real Device & Browser Testing: Access 3,500+ real devices and browsers for accurate testing.
  • Scalability: Run parallel tests to speed up and scale your testing.
  • No Setup Required: Instantly test without the need for physical devices or complex setups.
  • Cross-Platform Support: Test on both mobile and desktop environments.
  • CI/CD Integration: Easily integrate with Jenkins, CircleCI, and GitHub Actions.
  • Real-Time Debugging: Access logs, screenshots, and videos for quick issue identification.
  • Wide Browser & OS Coverage: Supports multiple browsers and OS combinations.
  • Global Availability: Test from any location with fast, reliable access.

Try BrowserStack Automate

Conclusion

Scrapy is great for extracting data from simple websites, while Selenium is necessary for handling JavaScript-heavy sites that require interaction. Basic scraping tasks involve straightforward data extraction, while advanced techniques tackle issues like infinite scrolling, dynamic content, and bot detection.

The future of web scraping will see AI-powered tools that adapt to complex sites and bypass anti-scraping measures. Scraping tools will become more automated, offering features like self-adjusting schedules and automatic data cleaning. Ethical and compliant scraping will also grow in importance, ensuring data privacy and legal adherence. Lastly, scraping tools will integrate with big data platforms, enabling real-time decision-making.

Running Selenium tests on a real device cloud gives more accurate results because it simulates real user conditions. BrowserStack Automate offers access to over 3500 real device-browser combinations, allowing for thorough testing of web applications to ensure a smooth and consistent user experience.

 

 

 

 

Tags
Automation Testing Selenium

Get answers on our Discord Community

Join our Discord community to connect with others! Get your questions answered and stay informed.

Join Discord Community
Discord