[FREE] Web Scraper | FULL SRC | Python

OP 15 March, 2026 - 11:09 PM

Advanced Python Web Scraper - JS rendering, auto popup handling, cookies, crawling, APIs, and more
For Python: 3.10–3.12
Install:

Code:
pip install requests beautifulsoup4 lxml aiohttp fake-useragent jsonpath-ng pandas pyarrow openpyxl playwright && playwright install chromium

What it does:

Crawls entire sites (BFS or DFS), follows links recursively to any depth
Renders JavaScript/React/SPA pages via Playwright - gets the actual DOM, not the skeleton
Auto-handles popups, cookie banners, age gates, overlays - detects and dismisses them before scraping
Extracts emails, phone numbers, social links, HTML tables, JSON-LD, microdata, OpenGraph, RSS/Atom feeds
Downloads media (images, PDFs, ZIPs, video, audio) to disk
Captures cookies - full browser cookie store including httpOnly and sameSite flags
Saves raw rendered HTML per page
Paginates REST APIs with page, offset, and cursor styles
Supports Basic, Bearer, Form login, OAuth2, and API key auth
Proxy rotation with per-proxy health tracking - auto-disables dead proxies
Token bucket rate limiter, circuit breaker, exponential backoff with jitter
Bloom filter URL deduplication (memory efficient at millions of URLs)
Change detection - tracks which pages changed between runs via SQLite
Outputs to JSON, JSONL, CSV, SQLite, Parquet, Excel, HTML

Example - rip everything off a JS site:

Code:
py a.py https://site.com --crawl --max-links 9999 --max-depth 10 --js --js-scroll --no-robots --emails --phones --social --tables --media --cookies --save-html --format json --out results --out-dir ./output --verbose

TARGETING
url target URL (required)
--urls url1 url2 ... multiple URLs

CRAWLING
--crawl walk every link on the site
--crawl-strategy bfs|dfs default: bfs
--max-links 9999 max pages to visit
--max-depth 10 how many levels deep
--link-selector "a.post" only follow matching links
--link-pattern "/blog/\d+" filter links by regex
--cross-domain follow links to other domains
--no-dedup don't skip already-seen URLs
--no-bloom use plain set instead of bloom filter
--bloom-capacity 1000000 bloom filter capacity

JAVASCRIPT
--js render via Playwright (use for React/SPA)
--js-wait networkidle load | domcontentloaded | networkidle
--js-scroll infinite scroll before extracting
--js-scroll-count 5 scroll attempts
--js-screenshot save full-page PNG per URL
--js-screenshot-dir ./shots where to save screenshots

PAGINATION
--paginate follow next-page links
--next "li.next > a" CSS selector for next link
--max-pages 50 max pages to follow

API
--api API mode
--api-style page|offset|cursor
--api-page-param page
--api-size-param per_page
--api-page-size 100
--api-offset-param offset
--api-cursor-param cursor
--api-cursor-path data.next_cursor
--api-results-path data.items
--api-max-pages 999

EXTRACTION
--field name "h1" css extract single value (css/xpath/regex/json/jsonpath)
--multi-field tags ".tag" css extract all matches as list
--attr fieldname href extract attribute instead of text
--emails extract all emails
--phones extract all phone numbers
--social extract social media links
--tables extract all HTML tables
--feeds parse RSS/Atom feeds
--cookies capture full cookie store per page
--save-html save raw rendered HTML per page
--html-dir ./pages where to save HTML files

OUTPUT
--format json|jsonl|csv|sqlite|parquet|excel|html
--out filename output name without extension
--out-dir ./folder output directory
--encoding utf-8

HTTP
--header "Name" "Value" add request header (repeatable)
--cookie "name" "value" add cookie (repeatable)
--proxy http://user:pass@host:port add proxy (repeatable)
--timeout 15 read timeout seconds
--connect-timeout 5 connect timeout seconds
--retries 3 retry attempts
--backoff-base 2.0 exponential backoff base
--backoff-max 60 max backoff seconds
--rate 1.0 requests per second
--rate-capacity 5.0 burst capacity
--jitter 0.3 random delay variance
--no-rotate-ua disable user-agent rotation
--mobile use mobile user-agents
--no-ssl-verify skip SSL verification
--no-robots ignore robots.txt

AUTH
--auth-basic user pass
--auth-bearer TOKEN
--auth-form https://site/login '{"username":"x","password":"y"}'
--auth-oauth2 https://auth/token client_id client_secret
--auth-apikey "X-API-Key" your_key

MEDIA
--media download images, PDFs, ZIPs etc
--media-types image pdf zip video audio
--media-selector "img[src]" CSS selector for media
--media-dir ./downloads

MISC
--sitemap auto-discover and scrape from sitemap.xml
--sitemap-url https://... explicit sitemap URL
--change-detection track changed pages between runs
--change-db changes.db change detection database
--no-captcha disable captcha detection
--no-circuit-breaker disable circuit breaker
--cb-threshold 5 failures before circuit opens
--cb-timeout 30 seconds before retry
--no-metadata skip og/twitter/canonical
--no-structured-data skip JSON-LD and microdata
--no-stats skip scraper_stats.json
--verbose debug logging

THIS IS NOT THE PERFECT TOOL SO JUST BE AWARE IT WILL HAVE BUGS AND DEFLECTS

Virustotal link https://www.virustotal.com/gui/file/8391b1f096a7135b405e864bc514027fc616065e5e2f25760591aca27ad6aef5?nocache=1

Login
Username:
Password:	Lost Password?
	Remember me

Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account Sign up for a new account in our community. It's easy! Register a new account	or	Sign in Already have an account? Sign in here. Sign in now

[FREE] Web Scraper | FULL SRC | Python

About Cracked.ax

Navigation

Extras

Help

Account