How to Build a Web Scraping Pipeline That Actually Works on Protected Sites

Stop using one tool for every job.

Here’s the mistake most scraping teams make: they pick one scraping service and use it for everything. If they chose Bright Data, they’re paying $25 per 1,000 requests for URLs that could be scraped with a $1 proxy. If they chose a cheap service, they’re getting 0% success on anti-bot protected sites and wondering why their data is incomplete.

The correct architecture is obvious once you see it: use cheap tools for easy URLs and specialized tools for hard ones. This post gives you the exact blueprint.

The scraping pipeline architecture

URLs → Classification → Router → [Easy: Cheap Service] → Data Store
                                → [Hard: UltraWebScrapingAPI] → Data Store

Simple. Three components: classify, route, collect. Let’s build each one.

Step 1: Classify URLs by protection level

Not all URLs are created equal. Before scraping a single page, classify your target URLs into protection tiers:

Tier 1: No protection (70% of the web)

These sites have no anti-bot protection at all. Static sites, small business websites, blogs, government data portals, most news sites.

Detection method: Send a simple HTTP GET with a basic User-Agent header. If you get a 200 with full HTML content, it’s Tier 1.

Cost to scrape: Nearly free. A basic proxy or even direct requests work fine.

Tier 2: Basic protection (20% of the web)

These sites have basic protections: rate limiting, simple bot checks, Cloudflare Free/Pro tier, basic JavaScript rendering requirements.

Detection method: If a simple GET returns a challenge page (Cloudflare “checking your browser” interstitial), a 429 status, or requires JavaScript to render content, it’s Tier 2.

Cost to scrape: Cheap. Bright Data, ScraperAPI, or Oxylabs handle these fine. $2-5 per 1,000 pages.

Tier 3: Advanced anti-bot protection (10% of the web)

These sites use enterprise anti-bot systems: Akamai Bot Manager, DataDome, PerimeterX (HUMAN), Kasada, Cloudflare Enterprise with Bot Management, Imperva Advanced Bot Protection.

Detection method: If the page loads but then presents a challenge, if you see _abck cookies (Akamai), datadome cookies (DataDome), _px cookies (PerimeterX), or if basic rendering returns empty content despite a 200 status — it’s Tier 3.

Cost to scrape with generic services: Effectively infinite. Bright Data charges you $25/1K for requests that fail 90%+ of the time. You’ll burn hundreds of dollars and get no data.

Cost to scrape with UltraWebScrapingAPI: $0.05/request ($50/1K) with 99%+ success.

Automated classification script

Here’s a practical approach to auto-classify URLs:

import requests

def classify_url(url):
    # Tier 1 check: plain HTTP request
    try:
        resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
        if resp.status_code == 200 and len(resp.text) > 1000:
            if "akamai" not in resp.text.lower() and "datadome" not in resp.text.lower():
                return "tier1"
    except:
        pass

    # Tier 2 check: look for basic protection markers
    markers_tier3 = ["_abck", "datadome", "_px", "kasada", "cf-mitigated"]
    for marker in markers_tier3:
        if marker in resp.headers.get("set-cookie", "").lower():
            return "tier3"
        if marker in resp.text.lower():
            return "tier3"

    return "tier2"

Run this against your URL list before scraping. It takes seconds and saves thousands of dollars.

Step 2: Route easy URLs to Bright Data or ScraperAPI

For Tier 1 and Tier 2 URLs, generic services work great and are cost-effective.

Bright Data for Tier 2 URLs

Bright Data’s Web Unlocker handles basic Cloudflare challenges, simple rate limiting, and JavaScript-rendered pages well. At $25/1K requests with 90%+ success on Tier 2 sites, the effective cost is about $25-28 per 1,000 successful pages.

ScraperAPI for budget Tier 2

ScraperAPI is cheaper ($49/month for 100K requests = ~$0.49/1K). Their success rate on Tier 2 sites is 70-80%, so the effective cost is about $0.60-0.70 per 1,000 successful pages. Excellent value for non-critical data.

Direct requests for Tier 1

For Tier 1 URLs, just send direct HTTP requests through a basic proxy. $1/GB residential proxy costs are typical. For most scraping volumes, this is essentially free.

def scrape_easy(url, tier):
    if tier == "tier1":
        return requests.get(url, proxies={"https": basic_proxy}).text
    elif tier == "tier2":
        return scraperapi_client.get(url)  # or bright_data_client.get(url)

Don’t overthink this part. These URLs are the solved problem. Use whatever’s cheapest and reliable enough for your needs.

Step 3: Route anti-bot URLs to UltraWebScrapingAPI

This is where the real engineering challenge is — and where your choice of service determines whether you get data or get blocked.

Why generic services fail on Tier 3

We’ve written extensively about this, but the summary is:

Bright Data: IP rotation doesn’t beat fingerprinting-based detection. 0-10% success on Akamai/DataDome sites.
ScraperAPI: No anti-bot bypass capability at all. Near 0% on any Tier 3 site.
Oxylabs: Same proxy-rotation approach, same failures. Claims “100% success rate” but asterisk that to exclude anti-bot sites.
ZenRows: Better than ScraperAPI but still uses shared browser pools. 10-30% on Tier 3 at best.
Apify: Platform, not a solution. You build the bypass yourself. Good luck.

UltraWebScrapingAPI for Tier 3

UltraWebScrapingAPI is built specifically for the URLs that generic services can’t handle. Our per-site custom analysis approach achieves 99%+ success on Akamai, DataDome, PerimeterX, Kasada, Cloudflare Enterprise, and Imperva protected sites.

def scrape_hard(url):
    response = requests.get(
        "https://api.ultrawebscrapingapi.com/v1/scrape",
        params={
            "url": url,
            "api_key": ULTRA_API_KEY
        }
    )
    return response.json()["html"]

Simple API call. No proxy management. No browser configuration. No anti-bot debugging. You send a URL, you get HTML back.

The routing logic

def scrape(url):
    tier = classify_url(url)

    if tier == "tier1":
        return direct_request(url)
    elif tier == "tier2":
        return scraperapi_get(url)
    elif tier == "tier3":
        return ultra_scrape(url)

That’s the entire pipeline. Classify, route, collect. Add retry logic and error handling as needed, but the core architecture is this simple.

Cost optimization math

Let’s say you need to scrape 100,000 URLs per month. Based on typical distributions:

Tier	URLs	Service	Cost per 1K	Success Rate	Total Cost
Tier 1 (70%)	70,000	Direct + basic proxy	$0.10	99%	$7
Tier 2 (20%)	20,000	ScraperAPI	$0.49	80%	$12.25
Tier 3 (10%)	10,000	UltraWebScrapingAPI	$50	99%+	$500
Total	100,000	Mixed	—	—	~$520

Now compare: using Bright Data for everything at $25/1K:

Tier	URLs	Service	Cost per 1K	Success Rate	Effective Cost
Tier 1	70,000	Bright Data	$25	99%	$1,768
Tier 2	20,000	Bright Data	$25	90%	$556
Tier 3	10,000	Bright Data	$25	5%	$5,000
Total	100,000	Bright Data	—	—	~$7,324

$520 vs. $7,324. Same data. Same 100,000 URLs. The tiered pipeline is 14x cheaper.

And that’s being generous to Bright Data. Their 5% success rate on Tier 3 means you’re probably not getting that data at all — you’re just paying for failed requests until you give up.

Success rate maximization

Cost isn’t the only metric. If you’re scraping for business-critical data, missed pages mean missed revenue.

With the tiered pipeline:

Tier 1: 99% success = 69,300 successful pages
Tier 2: 80% success (with retries: ~95%) = 19,000 successful pages
Tier 3: 99%+ success = 9,900+ successful pages
Total: ~98,200 successful pages out of 100,000

With Bright Data for everything:

Tier 1: 99% = 69,300
Tier 2: 90% = 18,000
Tier 3: 5% = 500
Total: ~87,800 successful pages — and 9,500 of your hardest, most valuable URLs are missing

The Tier 3 URLs are often the most valuable — they’re protected because the data behind them is worth protecting. Airline prices, financial data, competitive intelligence from major e-commerce platforms. Missing 95% of those pages isn’t just a technical failure. It’s a business failure.

Implementation checklist

Build your URL classifier. Start simple — check for anti-bot cookies and challenge pages. Refine over time as you learn which sites are Tier 2 vs. Tier 3.
Set up Tier 1 scraping. Basic HTTP client + proxy pool. Don’t overthink it.
Set up Tier 2 scraping. ScraperAPI or Bright Data. Pick based on your volume and budget.
Set up Tier 3 scraping with UltraWebScrapingAPI. Sign up, get an API key, submit your hardest URLs for custom analysis.
Build the router. Simple conditional logic based on classification. Add monitoring to track success rates per tier and per service.
Monitor and reclassify. Sites change their protection. A Tier 2 site might add DataDome and become Tier 3. A Tier 3 site might relax protections. Re-classify monthly.

Stop using a sledgehammer for everything

Bright Data is a great sledgehammer. But you don’t need a sledgehammer for most URLs, and a sledgehammer doesn’t work on the hardest ones. Build a pipeline that uses the right tool for each job, and you’ll get better data at a fraction of the cost.

Ready to handle the hard URLs? Try UltraWebScrapingAPI in our free playground — paste your toughest Tier 3 URLs and see 99%+ success rates in action. Check out our documentation for integration guides and our pricing for cost details.