OP 15 March, 2026 - 11:09 PM
Advanced Python Web Scraper - JS rendering, auto popup handling, cookies, crawling, APIs, and more
For Python: 3.10–3.12
Install:
What it does:
TARGETING
url target URL (required)
--urls url1 url2 ... multiple URLs
CRAWLING
--crawl walk every link on the site
--crawl-strategy bfs|dfs default: bfs
--max-links 9999 max pages to visit
--max-depth 10 how many levels deep
--link-selector "a.post" only follow matching links
--link-pattern "/blog/\d+" filter links by regex
--cross-domain follow links to other domains
--no-dedup don't skip already-seen URLs
--no-bloom use plain set instead of bloom filter
--bloom-capacity 1000000 bloom filter capacity
JAVASCRIPT
--js render via Playwright (use for React/SPA)
--js-wait networkidle load | domcontentloaded | networkidle
--js-scroll infinite scroll before extracting
--js-scroll-count 5 scroll attempts
--js-screenshot save full-page PNG per URL
--js-screenshot-dir ./shots where to save screenshots
PAGINATION
--paginate follow next-page links
--next "li.next > a" CSS selector for next link
--max-pages 50 max pages to follow
API
--api API mode
--api-style page|offset|cursor
--api-page-param page
--api-size-param per_page
--api-page-size 100
--api-offset-param offset
--api-cursor-param cursor
--api-cursor-path data.next_cursor
--api-results-path data.items
--api-max-pages 999
EXTRACTION
--field name "h1" css extract single value (css/xpath/regex/json/jsonpath)
--multi-field tags ".tag" css extract all matches as list
--attr fieldname href extract attribute instead of text
--emails extract all emails
--phones extract all phone numbers
--social extract social media links
--tables extract all HTML tables
--feeds parse RSS/Atom feeds
--cookies capture full cookie store per page
--save-html save raw rendered HTML per page
--html-dir ./pages where to save HTML files
OUTPUT
--format json|jsonl|csv|sqlite|parquet|excel|html
--out filename output name without extension
--out-dir ./folder output directory
--encoding utf-8
HTTP
--header "Name" "Value" add request header (repeatable)
--cookie "name" "value" add cookie (repeatable)
--proxy http://user:pass@host:port add proxy (repeatable)
--timeout 15 read timeout seconds
--connect-timeout 5 connect timeout seconds
--retries 3 retry attempts
--backoff-base 2.0 exponential backoff base
--backoff-max 60 max backoff seconds
--rate 1.0 requests per second
--rate-capacity 5.0 burst capacity
--jitter 0.3 random delay variance
--no-rotate-ua disable user-agent rotation
--mobile use mobile user-agents
--no-ssl-verify skip SSL verification
--no-robots ignore robots.txt
AUTH
--auth-basic user pass
--auth-bearer TOKEN
--auth-form https://site/login '{"username":"x","password":"y"}'
--auth-oauth2 https://auth/token client_id client_secret
--auth-apikey "X-API-Key" your_key
MEDIA
--media download images, PDFs, ZIPs etc
--media-types image pdf zip video audio
--media-selector "img[src]" CSS selector for media
--media-dir ./downloads
MISC
--sitemap auto-discover and scrape from sitemap.xml
--sitemap-url https://... explicit sitemap URL
--change-detection track changed pages between runs
--change-db changes.db change detection database
--no-captcha disable captcha detection
--no-circuit-breaker disable circuit breaker
--cb-threshold 5 failures before circuit opens
--cb-timeout 30 seconds before retry
--no-metadata skip og/twitter/canonical
--no-structured-data skip JSON-LD and microdata
--no-stats skip scraper_stats.json
--verbose debug logging
THIS IS NOT THE PERFECT TOOL SO JUST BE AWARE IT WILL HAVE BUGS AND DEFLECTS
For Python: 3.10–3.12
Install:
Code:
pip install requests beautifulsoup4 lxml aiohttp fake-useragent jsonpath-ng pandas pyarrow openpyxl playwright && playwright install chromium
- Crawls entire sites (BFS or DFS), follows links recursively to any depth
- Renders JavaScript/React/SPA pages via Playwright - gets the actual DOM, not the skeleton
- Auto-handles popups, cookie banners, age gates, overlays - detects and dismisses them before scraping
- Extracts emails, phone numbers, social links, HTML tables, JSON-LD, microdata, OpenGraph, RSS/Atom feeds
- Downloads media (images, PDFs, ZIPs, video, audio) to disk
- Captures cookies - full browser cookie store including httpOnly and sameSite flags
- Saves raw rendered HTML per page
- Paginates REST APIs with page, offset, and cursor styles
- Supports Basic, Bearer, Form login, OAuth2, and API key auth
- Proxy rotation with per-proxy health tracking - auto-disables dead proxies
- Token bucket rate limiter, circuit breaker, exponential backoff with jitter
- Bloom filter URL deduplication (memory efficient at millions of URLs)
- Change detection - tracks which pages changed between runs via SQLite
- Outputs to JSON, JSONL, CSV, SQLite, Parquet, Excel, HTML
Code:
py a.py https://site.com --crawl --max-links 9999 --max-depth 10 --js --js-scroll --no-robots --emails --phones --social --tables --media --cookies --save-html --format json --out results --out-dir ./output --verboseTARGETING
url target URL (required)
--urls url1 url2 ... multiple URLs
CRAWLING
--crawl walk every link on the site
--crawl-strategy bfs|dfs default: bfs
--max-links 9999 max pages to visit
--max-depth 10 how many levels deep
--link-selector "a.post" only follow matching links
--link-pattern "/blog/\d+" filter links by regex
--cross-domain follow links to other domains
--no-dedup don't skip already-seen URLs
--no-bloom use plain set instead of bloom filter
--bloom-capacity 1000000 bloom filter capacity
JAVASCRIPT
--js render via Playwright (use for React/SPA)
--js-wait networkidle load | domcontentloaded | networkidle
--js-scroll infinite scroll before extracting
--js-scroll-count 5 scroll attempts
--js-screenshot save full-page PNG per URL
--js-screenshot-dir ./shots where to save screenshots
PAGINATION
--paginate follow next-page links
--next "li.next > a" CSS selector for next link
--max-pages 50 max pages to follow
API
--api API mode
--api-style page|offset|cursor
--api-page-param page
--api-size-param per_page
--api-page-size 100
--api-offset-param offset
--api-cursor-param cursor
--api-cursor-path data.next_cursor
--api-results-path data.items
--api-max-pages 999
EXTRACTION
--field name "h1" css extract single value (css/xpath/regex/json/jsonpath)
--multi-field tags ".tag" css extract all matches as list
--attr fieldname href extract attribute instead of text
--emails extract all emails
--phones extract all phone numbers
--social extract social media links
--tables extract all HTML tables
--feeds parse RSS/Atom feeds
--cookies capture full cookie store per page
--save-html save raw rendered HTML per page
--html-dir ./pages where to save HTML files
OUTPUT
--format json|jsonl|csv|sqlite|parquet|excel|html
--out filename output name without extension
--out-dir ./folder output directory
--encoding utf-8
HTTP
--header "Name" "Value" add request header (repeatable)
--cookie "name" "value" add cookie (repeatable)
--proxy http://user:pass@host:port add proxy (repeatable)
--timeout 15 read timeout seconds
--connect-timeout 5 connect timeout seconds
--retries 3 retry attempts
--backoff-base 2.0 exponential backoff base
--backoff-max 60 max backoff seconds
--rate 1.0 requests per second
--rate-capacity 5.0 burst capacity
--jitter 0.3 random delay variance
--no-rotate-ua disable user-agent rotation
--mobile use mobile user-agents
--no-ssl-verify skip SSL verification
--no-robots ignore robots.txt
AUTH
--auth-basic user pass
--auth-bearer TOKEN
--auth-form https://site/login '{"username":"x","password":"y"}'
--auth-oauth2 https://auth/token client_id client_secret
--auth-apikey "X-API-Key" your_key
MEDIA
--media download images, PDFs, ZIPs etc
--media-types image pdf zip video audio
--media-selector "img[src]" CSS selector for media
--media-dir ./downloads
MISC
--sitemap auto-discover and scrape from sitemap.xml
--sitemap-url https://... explicit sitemap URL
--change-detection track changed pages between runs
--change-db changes.db change detection database
--no-captcha disable captcha detection
--no-circuit-breaker disable circuit breaker
--cb-threshold 5 failures before circuit opens
--cb-timeout 30 seconds before retry
--no-metadata skip og/twitter/canonical
--no-structured-data skip JSON-LD and microdata
--no-stats skip scraper_stats.json
--verbose debug logging
THIS IS NOT THE PERFECT TOOL SO JUST BE AWARE IT WILL HAVE BUGS AND DEFLECTS
Virustotal link https://www.virustotal.com/gui/file/8391b1f096a7135b405e864bc514027fc616065e5e2f25760591aca27ad6aef5?nocache=1