Web Scraping at Scale: Patterns for Crawlers That Survive

🎯

Engineering patterns for production web scrapers: decoupled pipelines, bounded concurrency, polite rate limiting, retries with backoff, anti-bot handling, resilient parsing, and idempotent storage.

Introduction

Writing a scraper that pulls data from one page is a fifteen-minute job: fetch, parse, done. Writing one that reliably pulls data from a hundred thousand pages, every day, without getting blocked, melting the target server, or silently corrupting your dataset — that's a different discipline entirely.

The gap between those two is where most scraping projects fall apart. The naive version works on your laptop against ten URLs and then falls over the moment you point it at real volume: timeouts pile up, the target starts returning 429s, your parser breaks on one weird page and takes the whole run down, and three days later you discover half your records are duplicates.

I've built a lot of crawlers — for dictionaries, pricing data, audio resources, business registries — and the same handful of patterns separate the ones that survive from the ones that don't. This post is about those patterns: architecture, concurrency, politeness, retries, anti-bot reality, resilient parsing, and the observability that keeps it all honest. The examples are in TypeScript/Node, but the ideas apply to any stack.

What "At Scale" Actually Means

"Scale" here isn't only about volume. A scraper is operating at scale when it has to deal with:

Volume — thousands to millions of pages, more than fits in one in-memory loop.
Time — runs that take hours or days, where a crash at hour 6 must not mean starting over.
Unreliability — the target will return errors, change its HTML, rate-limit you, and occasionally serve garbage.
Repetition — you re-crawl regularly, so yesterday's work should make today's cheaper.

Design for those four realities up front and the rest follows.

Architecture: Decouple Fetching from Parsing

The single most important structural decision: separate the act of fetching a page from the act of parsing it. Don't write one function that downloads HTML and immediately extracts fields. Split the pipeline into stages connected by a queue:

URL frontier  →  Fetcher  →  raw HTML store  →  Parser  →  structured data
   (queue)       (network)      (cheap)        (CPU)        (database)

Why this matters:

Fetching is I/O-bound and flaky; parsing is CPU-bound and deterministic. They scale differently and fail differently. Keep them apart.
If your parser has a bug, you don't want to re-download everything. Store the raw HTML, fix the parser, and re-run parsing offline against the saved pages. This alone will save you days.
Each stage can be retried independently. A fetch failure and a parse failure are different problems with different fixes.

For modest projects the "queue" can be a database table with a status column (pending / fetched / parsed / failed). For bigger ones, a real queue (Redis/BullMQ, SQS) gives you retries, concurrency, and back-pressure for free.

Concurrency Without Melting the Target — or Yourself

The naive await Promise.all(urls.map(fetchPage)) fires every request at once. With 10 URLs it's fine; with 10,000 it opens 10,000 sockets, exhausts memory, and hammers the target into rate-limiting you. You need a bounded concurrency limit:

import pLimit from 'p-limit'

const limit = pLimit(5) // at most 5 in-flight requests

const results = await Promise.all(
  urls.map(url => limit(() => fetchPage(url)))
)

Five to ten concurrent requests against a single domain is a sane starting point — high enough to be fast, low enough to be polite. The right number depends on the target; tune it down the moment you see 429s. If you crawl many domains, limit per-domain, not globally, so one slow site doesn't starve the others.

Rate Limiting and Politeness

Concurrency caps how many requests run at once; rate limiting caps how many run per unit time. You want both. A target that tolerates 5 concurrent requests may still ban you if you send 5,000 per minute.

Add a small delay between requests to the same host, and jitter it so your traffic doesn't look like a metronome:

const sleep = (ms: number) => new Promise(r => setTimeout(r, ms))

async function politeFetch(url: string) {
  const res = await fetch(url, { headers: DEFAULT_HEADERS })
  // 500ms–1500ms gap, randomised
  await sleep(500 + Math.random() * 1000)
  return res
}

Also: honour robots.txt and Crawl-delay. It costs little, keeps you on the right side of the site's stated rules, and is the baseline of good-citizen scraping. Cache robots.txt per host so you're not re-fetching it constantly.

Retries, Backoff, and Transient Failures

At scale, transient failures are not edge cases — they are the normal case. Network blips, 503s, and 429s happen constantly. The fix is retry with exponential backoff and jitter, but only for failures worth retrying:

async function fetchWithRetry(url: string, maxRetries = 4): Promise<Response> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const res = await fetch(url, { headers: DEFAULT_HEADERS })

      // Respect the server's own backoff signal
      if (res.status === 429 || res.status >= 500) {
        const retryAfter = Number(res.headers.get('retry-after')) || 0
        const backoff = retryAfter * 1000 || 2 ** attempt * 1000
        await sleep(backoff + Math.random() * 500)
        continue
      }
      return res
    } catch (err) {
      if (attempt === maxRetries) throw err
      await sleep(2 ** attempt * 1000 + Math.random() * 500)
    }
  }
  throw new Error(`Failed after ${maxRetries} retries: ${url}`)
}

Two rules that matter:

Don't retry 4xx (except 429). A 404 or 403 won't fix itself; retrying just wastes requests and looks more abusive.
Always cap retries and record permanent failures. A URL that fails four times goes into a failed bucket for later inspection — it does not silently vanish.

Getting Blocked: The Anti-Bot Reality

Eventually a target will try to stop you. The escalation ladder, cheapest to most expensive:

Send realistic headers. A missing or python-requests-style User-Agent is the easiest tell. Send a real browser UA and the usual Accept, Accept-Language headers.
Reuse sessions/cookies. Many sites set a cookie on first visit and expect it back. A persistent cookie jar makes you look like a returning browser, not a fresh bot each time.
Rotate IPs with a proxy pool when a single IP gets rate-limited or banned. Rotate per-request or per-session depending on how the target tracks you.
Use a headless browser (Playwright/Puppeteer) only when you must — i.e. the data is rendered client-side by JavaScript, or there's a challenge that requires a real browser engine. It's 10–50× more expensive in CPU and memory than a plain HTTP fetch, so reach for it last, not first.

A practical rule: always try the cheap HTTP request first. Open the network tab and look — a surprising amount of "JavaScript-rendered" data is actually served by a clean JSON API the page calls. Hitting that endpoint directly is faster, more stable, and far easier to parse than scraping rendered DOM.

Parsing That Survives HTML Changes

HTML is the most fragile part of any scraper because you don't control it. The site redesigns, a class name changes, and your brittle selector silently returns null for ten thousand records. Defensive parsing:

Prefer stable anchors — id attributes, data-* attributes, semantic tags, microdata/JSON-LD — over deep CSS chains like div > div:nth-child(3) > span.
Validate every extraction. If a field you expect on every page comes back empty, that's a signal the page structure changed, not a value to store. Treat it as a parse failure.
Fail loudly per record, not per run. Wrap each page's parse in a try/catch so one malformed page is logged and skipped — it must not crash the entire job.

function parseProduct(html: string, url: string): Product | null {
  try {
    const doc = parse(html)
    const name = doc.querySelector('[itemprop="name"]')?.text?.trim()
    const price = doc.querySelector('[itemprop="price"]')?.attributes['content']
    if (!name || !price) {
      logger.warn({ url }, 'missing required fields — layout may have changed')
      return null
    }
    return { name, price: Number(price), url }
  } catch (err) {
    logger.error({ url, err }, 'parse failed')
    return null
  }
}

Deduplication and Idempotency

Re-crawls and overlapping URL paths mean you will see the same item twice. If your pipeline isn't idempotent, you get duplicate rows and inflated counts. Make every record carry a stable, deterministic key — a canonical URL, a product ID, or a content hash — and upsert on it:

await db.product.upsert({
  where: { sourceId: product.sourceId },
  update: product,
  create: product,
})

Now running the scraper twice produces the same dataset as running it once. That property — re-running is safe — is what lets you crash and resume without fear.

Storage and Incremental Crawls

The first full crawl is the expensive one. Every subsequent run should do less work, not the same work:

Persist crawl state (which URLs are done) so a crash at hour 6 resumes at hour 6, not hour 0.
Use conditional requests — send If-Modified-Since / If-None-Match and skip the parse when the server answers 304 Not Modified.
Track a "last seen" timestamp per record so you can prioritise stale data and detect items that disappeared from the source.

Observability: You Can't Fix What You Can't See

A scraper is a long-running batch job touching a system you don't control, which means it will drift. Without metrics you won't notice until your data is already wrong. Track, at minimum:

Success / failure / retry counts per run and per domain.
HTTP status distribution — a sudden spike in 403s or 429s means you've been detected; a spike in empty parses means the layout changed.
Throughput and queue depth — is the frontier growing faster than you can drain it?

A simple structured log line per page plus a end-of-run summary catches the majority of problems early: "this run, 2% of pages returned 403 where yesterday it was 0%" is the kind of signal that saves a dataset.

Respect: Legal and Ethical Boundaries

Scale amplifies impact, so the responsibilities scale too:

Read the Terms of Service and respect robots.txt. Some data is explicitly off-limits.
Don't degrade the target. Your crawl should be invisible in their load graphs. If you can feel your scraper in their response times, you're going too hard — back off.
Be careful with personal data. Just because something is publicly visible doesn't mean aggregating and storing it is lawful in your jurisdiction.
Prefer official APIs when they exist. They're more stable, faster, and unambiguous about what you're allowed to do.

Good scraping is quiet, polite, and leaves no mark.

Conclusion

Scraping at scale is less about clever extraction tricks and more about engineering for failure. The page you're parsing will change, the network will flake, the target will try to block you, and your own job will crash halfway through — and a well-built scraper shrugs all of that off.

The patterns that get you there are consistent: decouple fetching from parsing, bound your concurrency, be polite, retry transient failures with backoff, parse defensively, make every write idempotent, and watch the whole thing with real metrics. Get those right and the difference between scraping ten pages and ten million stops being a matter of luck — and becomes just a matter of time.