#1
Advanced Python Web Scraper - JS rendering, auto popup handling, cookies, crawling, APIs, and more
For Python: 3.10–3.12
Install:
Code:
pip install requests beautifulsoup4 lxml aiohttp fake-useragent jsonpath-ng pandas pyarrow openpyxl playwright && playwright install chromium
 
What it does:
  • Crawls entire sites (BFS or DFS), follows links recursively to any depth
  • Renders JavaScript/React/SPA pages via Playwright - gets the actual DOM, not the skeleton
  • Auto-handles popups, cookie banners, age gates, overlays - detects and dismisses them before scraping
  • Extracts emails, phone numbers, social links, HTML tables, JSON-LD, microdata, OpenGraph, RSS/Atom feeds
  • Downloads media (images, PDFs, ZIPs, video, audio) to disk
  • Captures cookies - full browser cookie store including httpOnly and sameSite flags
  • Saves raw rendered HTML per page
  • Paginates REST APIs with page, offset, and cursor styles
  • Supports Basic, Bearer, Form login, OAuth2, and API key auth
  • Proxy rotation with per-proxy health tracking - auto-disables dead proxies
  • Token bucket rate limiter, circuit breaker, exponential backoff with jitter
  • Bloom filter URL deduplication (memory efficient at millions of URLs)
  • Change detection - tracks which pages changed between runs via SQLite
  • Outputs to JSON, JSONL, CSV, SQLite, Parquet, Excel, HTML
Example - rip everything off a JS site:
Code:
py a.py https://site.com --crawl --max-links 9999 --max-depth 10 --js --js-scroll --no-robots --emails --phones --social --tables --media --cookies --save-html --format json --out results --out-dir ./output --verbose

TARGETING
  url                         target URL (required)
  --urls url1 url2 ...        multiple URLs

CRAWLING
  --crawl                     walk every link on the site
  --crawl-strategy bfs|dfs    default: bfs
  --max-links 9999            max pages to visit
  --max-depth 10              how many levels deep
  --link-selector "a.post"    only follow matching links
  --link-pattern "/blog/\d+"  filter links by regex
  --cross-domain              follow links to other domains
  --no-dedup                  don't skip already-seen URLs
  --no-bloom                  use plain set instead of bloom filter
  --bloom-capacity 1000000    bloom filter capacity

JAVASCRIPT
  --js                        render via Playwright (use for React/SPA)
  --js-wait networkidle       load | domcontentloaded | networkidle
  --js-scroll                 infinite scroll before extracting
  --js-scroll-count 5         scroll attempts
  --js-screenshot             save full-page PNG per URL
  --js-screenshot-dir ./shots where to save screenshots

PAGINATION
  --paginate                  follow next-page links
  --next "li.next > a"        CSS selector for next link
  --max-pages 50              max pages to follow

API
  --api                       API mode
  --api-style page|offset|cursor
  --api-page-param page
  --api-size-param per_page
  --api-page-size 100
  --api-offset-param offset
  --api-cursor-param cursor
  --api-cursor-path data.next_cursor
  --api-results-path data.items
  --api-max-pages 999

EXTRACTION
  --field name "h1" css       extract single value (css/xpath/regex/json/jsonpath)
  --multi-field tags ".tag" css  extract all matches as list
  --attr fieldname href        extract attribute instead of text
  --emails                    extract all emails
  --phones                    extract all phone numbers
  --social                    extract social media links
  --tables                    extract all HTML tables
  --feeds                     parse RSS/Atom feeds
  --cookies                   capture full cookie store per page
  --save-html                 save raw rendered HTML per page
  --html-dir ./pages          where to save HTML files

OUTPUT
  --format json|jsonl|csv|sqlite|parquet|excel|html
  --out filename              output name without extension
  --out-dir ./folder          output directory
  --encoding utf-8

HTTP
  --header "Name" "Value"     add request header (repeatable)
  --cookie "name" "value"     add cookie (repeatable)
  --proxy http://user:pass@host:port  add proxy (repeatable)
  --timeout 15                read timeout seconds
  --connect-timeout 5         connect timeout seconds
  --retries 3                 retry attempts
  --backoff-base 2.0          exponential backoff base
  --backoff-max 60            max backoff seconds
  --rate 1.0                  requests per second
  --rate-capacity 5.0         burst capacity
  --jitter 0.3                random delay variance
  --no-rotate-ua              disable user-agent rotation
  --mobile                    use mobile user-agents
  --no-ssl-verify             skip SSL verification
  --no-robots                 ignore robots.txt

AUTH
  --auth-basic user pass
  --auth-bearer TOKEN
  --auth-form https://site/login '{"username":"x","password":"y"}'
  --auth-oauth2 https://auth/token client_id client_secret
  --auth-apikey "X-API-Key" your_key

MEDIA
  --media                     download images, PDFs, ZIPs etc
  --media-types image pdf zip video audio
  --media-selector "img[src]" CSS selector for media
  --media-dir ./downloads

MISC
  --sitemap                   auto-discover and scrape from sitemap.xml
  --sitemap-url https://...   explicit sitemap URL
  --change-detection          track changed pages between runs
  --change-db changes.db      change detection database
  --no-captcha                disable captcha detection
  --no-circuit-breaker        disable circuit breaker
  --cb-threshold 5            failures before circuit opens
  --cb-timeout 30             seconds before retry
  --no-metadata               skip og/twitter/canonical
  --no-structured-data        skip JSON-LD and microdata
  --no-stats                  skip scraper_stats.json
  --verbose                   debug logging

THIS IS NOT THE PERFECT TOOL SO JUST BE AWARE IT WILL HAVE BUGS AND DEFLECTS