How Anti-Bot Systems Use Machine Learning to Detect Scrapers

Rule-based detection is dead. ML took over.

Five years ago, anti-bot detection was simple. Block known bot User-Agents. Rate-limit IPs that send too many requests. Check for missing headers. If you could spoof the right headers and rotate proxies, you were invisible.

That era is over.

DataDome, Akamai Bot Manager, and PerimeterX (now HUMAN Security) have replaced rule-based detection with machine learning models that analyze hundreds of signals simultaneously. These models don’t look for a single “gotcha” — they evaluate the probability that a request comes from a human, and they do it in under 2 milliseconds.

If you’re still relying on header spoofing and proxy rotation in 2026, you’re bringing a knife to a machine gun fight.

How ML-based bot detection actually works

Training data: billions of labeled requests

The foundation of any ML model is training data. DataDome processes over 5 trillion requests per year. Akamai sees roughly 30% of all global web traffic. PerimeterX protects thousands of the world’s largest websites.

Each of these companies has built massive datasets of labeled traffic — requests that are confirmed human and requests that are confirmed bot. When your scraper hits an Akamai-protected site, it’s being evaluated by a model that has seen billions of bot requests before yours. Every proxy service. Every headless browser. Every stealth plugin. The model has seen them all.

Feature extraction: what the models analyze

ML models don’t look at one signal. They extract hundreds of features from each request and session:

Network-level features:

TLS fingerprint (JA3/JA4 hash)
HTTP/2 settings and frame ordering
Header order and capitalization
TCP window size and connection behavior
IP reputation and ASN classification

Browser-level features:

JavaScript execution environment (navigator properties, window object)
Canvas and WebGL rendering output (learn more about canvas fingerprinting)
Audio context fingerprint
Font enumeration results
Screen dimensions, color depth, pixel ratio
Timezone, language, platform consistency

Behavioral features:

Mouse movement patterns (velocity, acceleration, curvature)
Scroll behavior (speed, direction changes, scroll depth)
Keystroke dynamics (typing speed, key hold duration)
Click patterns (position variance, timing)
Page navigation flow (referrer chain, resource loading order)
Request timing distribution (human browsing follows a Poisson-like distribution; bots don’t)

Session-level features:

Cookie consistency across requests
Fingerprint stability over time
Session duration and page view count
Cross-session fingerprint clustering

The model takes all of these features, weights them, and outputs a bot probability score. If the score exceeds a threshold, you’re blocked. If it’s borderline, you get a CAPTCHA or JavaScript challenge.

Model architectures behind the big three

DataDome uses real-time inference with sub-2ms latency. Their system evaluates the first request with network-level features (TLS, headers, IP) and progressively enriches the classification as JavaScript signals arrive. They’ve published research showing ensemble models that combine gradient-boosted trees for structured features with neural networks for behavioral sequence data.

Akamai Bot Manager leverages their massive edge network. Their models run at CDN edge nodes, meaning classification happens before the request even reaches the origin server. Akamai’s approach emphasizes device fingerprinting and what they call “behavioral biometrics” — continuous authentication based on how users interact with pages. Their sensor script (the infamous _abck cookie generator) collects over 150 signals per page load.

PerimeterX (HUMAN Security) focuses heavily on behavioral analysis. Their models track mouse movements, keystrokes, and touch events to build behavioral profiles. They’re particularly good at detecting automation that tries to simulate human behavior — their models can distinguish between real mouse movements and programmatically generated ones based on mathematical properties like jerk (the third derivative of position).

Why rule-based evasion fails against ML

Here’s the fundamental problem: rule-based evasion tries to fix individual signals. ML-based detection evaluates the coherence of all signals together.

Spoofing headers doesn’t work

You can set a perfect Chrome User-Agent. But if your TLS fingerprint says “Python requests library” while your User-Agent says “Chrome 120,” that inconsistency alone flags you. An ML model doesn’t need a rule for this — it learns from the training data that real Chrome browsers have consistent TLS and User-Agent combinations.

Stealth plugins don’t work

Puppeteer-extra-stealth overrides navigator.webdriver, patches chrome.runtime, and hides other automation markers. DataDome’s ML model has seen millions of requests from stealth-patched Puppeteer. The patches themselves create a detectable pattern — a specific combination of present and absent browser APIs that doesn’t match any real browser distribution.

It’s an arms race you can’t win. Every time the stealth community patches one detection vector, anti-bot companies add it to their training data. The ML model updates. Your “undetectable” setup becomes detectable within weeks.

Residential proxies don’t work

Bright Data has 72 million residential IPs. Impressive number. Completely irrelevant against ML detection.

The ML model might use IP reputation as one feature among hundreds. But when 99% of the signal comes from browser fingerprinting, behavioral analysis, and session correlation, switching from a datacenter IP to a residential IP changes almost nothing in the final probability score.

You went from “definitely a bot” to “definitely a bot with a residential IP.” Congratulations.

Even “undetected” tools get detected

undetected-chromedriver, Playwright stealth, Botright — these tools patch known detection vectors. But ML models detect what’s missing just as well as what’s present. A browser that lacks certain minor API quirks, that has suspiciously perfect timing, or that loads resources in a slightly unnatural order gets flagged — not by a rule, but by statistical deviation from the model’s learned distribution of real browser behavior.

How we approach ML-based detection differently

Every scraping service out there is playing the same losing game: try to look like a real browser, get detected, patch one more thing, get detected again. It’s a hamster wheel.

We don’t play that game.

We use real browsers. Not headless browsers pretending to be real. Not patched Chromium builds. Actual Chrome browsers running with real GPU rendering, real system fonts, real screen dimensions, and authentic browser APIs. When DataDome’s ML model evaluates our fingerprint, it sees a genuine Chrome session — because it is one.

We reverse-engineer each site’s specific ML configuration. Not all DataDome deployments are the same. Each site configures detection sensitivity, challenge thresholds, and blocking rules differently. We analyze the specific anti-bot configuration for each target site and build a bypass strategy tailored to that exact setup.

We generate authentic behavioral signals. Our sessions exhibit natural browsing patterns — realistic mouse movements with proper acceleration curves, natural scroll behavior, human-like timing between actions. Not simulated. Authentic.

We maintain fingerprint consistency. ML models track fingerprint stability across sessions. Our session management ensures consistent, coherent fingerprints that match real user patterns — no cross-session leakage, no impossible fingerprint transitions.

The result: 99%+ success rates on DataDome, Akamai Bot Manager, and PerimeterX-protected sites. Not because we trick the ML model. Because we present genuine signals that the model correctly classifies as human-like.

The arms race is over — if you’re on the right side

ML-based bot detection will only get better. More training data, more signals, faster inference. If your approach is “spoof more things,” you’ve already lost. The only sustainable approach is authenticity.

Stop wasting money on Bright Data proxy rotation. Stop patching Puppeteer stealth plugins. Stop pretending that the next workaround will be the one that sticks.

Try our playground with any DataDome, Akamai, or PerimeterX-protected URL. See what happens when you stop trying to fool ML models and start presenting real browser sessions instead. Check our documentation to integrate in minutes.