Puppeteer and Playwright are the standard browser automation libraries. Crawlstack takes a different approach: running scraping scripts inside a real browser. Here's when to use each for web scraping.
Puppeteer and Playwright are the two most popular browser automation libraries. If you've written a web scraper in the last five years, you've probably used one of them. They're excellent tools — well-documented, battle-tested, and supported by Google and Microsoft respectively.
But they were designed for browser testing, not web scraping. When you use them for scraping, you end up building a lot of infrastructure yourself: data storage, deduplication, scheduling, anti-bot evasion, session management, error recovery, and more.
Crawlstack was designed specifically for web scraping. It runs your extraction scripts inside a real browser context and bundles the infrastructure that Puppeteer and Playwright leave to you. Here's how they compare.
Puppeteer is a Node.js library that controls Chrome/Chromium via the Chrome DevTools Protocol (CDP). You launch a browser process, open pages, and interact with them programmatically from your Node script. The script runs outside the browser — every DOM interaction is a remote procedure call over CDP.
Playwright is similar but broader. Built by the team that originally created Puppeteer at Google (now at Microsoft), it supports Chrome, Firefox, and WebKit. It uses a custom protocol layer on top of CDP/WebKit/Firefox debugging protocols. Same model: your script runs outside, controlling the browser remotely.
Crawlstack flips this model. Your scraping script runs inside the browser tab, in the same JavaScript context as the page. There's no remote protocol — when you call document.querySelector(), you're calling it directly on the live DOM. The browser extension (or Docker container) provides the runtime, scheduling, and data pipeline around your script.
Puppeteer/Playwright: Node.js script ──CDP──> Browser Process ──> Page
Crawlstack: Browser ──> Tab loads page ──> Your script runs in-page| Feature | Puppeteer | Playwright | Crawlstack |
|---|---|---|---|
| Browser support | Chrome/Chromium | Chrome, Firefox, WebKit | Chrome/Chromium |
| Execution model | Headless (external) | Headless (external) | In-browser (native) |
| Language | Node.js | Node.js, Python, Java, C# | JavaScript |
| Stealth | Low (detectable) | Low (detectable) | High (real browser) |
| Data storage | None | None | SQLite/libSQL |
| Deduplication | None | None | Built-in |
| Scheduling | None | None | Built-in |
| Webhook delivery | None | None | Built-in |
| Debugging | Headful mode + slowMo | Trace viewer, inspector | DevTools + flight recorder |
| Distributed | Manual | Manual | Built-in clustering |
| Auth handling | Manual cookies | Manual browser contexts | Native sessions |
| Anti-bot | stealth plugins | stealth plugins | Real browser fingerprint |
| Turnstile solving | Manual / third-party | Manual / third-party | Automatic |
| Setup | npm install + Chrome binary | npm install + browsers | Install extension or Docker |
This is the single biggest reason people move from Puppeteer/Playwright to Crawlstack for scraping.
Headless Chrome is detectable. Even with stealth plugins (puppeteer-extra-plugin-stealth, Playwright's --disable-blink-features), anti-bot systems like Cloudflare, PerimeterX, and DataDome can identify automated browsers through:
navigator.webdriver is set to true in automated browsers (stealth plugins try to hide this, but it's a cat-and-mouse game)window.chrome behaves differently in headless modeStealth plugins help, but they're fighting an arms race. Every time they patch a detection vector, bot protection companies find new ones.
Crawlstack sidesteps this entirely. In Chrome extension mode, your scraper runs in your real browser — the same browser you use for everything else. Same fingerprint, same behavioral patterns, same everything. There's nothing to detect because there's nothing fake.
In Docker mode, Crawlstack uses Cloakbrowser — a stealth-hardened Chromium that passes fingerprint checks. Combined with the built-in Cloudflare Turnstile solver and human simulation helpers (runner.humanClick() with Bézier mouse curves, runner.humanScrollInView() with realistic scroll physics), it's significantly harder to detect than Puppeteer or Playwright.
This is a common pattern — loading all items from a feed that uses infinite scroll, then extracting the data.
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/feed');
let previousHeight;
while (true) {
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForTimeout(2000);
const newHeight = await page.evaluate('document.body.scrollHeight');
if (newHeight === previousHeight) break;
}
const items = await page.evaluate(() =>
[...document.querySelectorAll('.feed-item')].map(el => ({
text: el.innerText,
link: el.querySelector('a')?.href,
}))
);
await browser.close();
// items is a plain array — store it yourselfconst { chromium } = require('playwright');
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/feed');
while (true) {
const prevHeight = await page.evaluate(() => document.body.scrollHeight);
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
const newHeight = await page.evaluate(() => document.body.scrollHeight);
if (newHeight === prevHeight) break;
}
const items = await page.evaluate(() =>
[...document.querySelectorAll('.feed-item')].map(el => ({
text: el.innerText,
link: el.querySelector('a')?.href,
}))
);
await browser.close();await runner.onLoad();
// Human-like scrolling to load all items
let previousHeight = 0;
while (true) {
const lastItem = document.querySelector('.feed-item:last-child');
if (lastItem) await runner.humanScrollInView(lastItem);
await runner.sleep(1500, 2500);
const newHeight = document.body.scrollHeight;
if (newHeight === previousHeight) break;
previousHeight = newHeight;
}
const items = [...document.querySelectorAll('.feed-item')].map(el => ({
id: el.querySelector('a')?.href || el.innerText.slice(0, 50),
data: {
text: el.innerText,
link: el.querySelector('a')?.href,
}
}));
await runner.publishItems(items);A few things to notice:
launch() or close(). The browser is already running.page.evaluate() wrapper: DOM access is direct. No serialization boundary.runner.humanScrollInView() produces realistic scroll behavior. Puppeteer/Playwright's scrollTo is instant and robotic.runner.sleep(1500, 2500) picks a random delay in the range, mimicking human behavior.runner.publishItems() stores items with deduplication. The Puppeteer/Playwright versions give you a plain array that you need to store yourself.After extracting data, what happens next? With Puppeteer and Playwright: nothing. They return data to your script, and everything else is your problem.
A production Puppeteer/Playwright scraping setup typically requires:
That's a lot of infrastructure for what started as "scrape some data."
Crawlstack bundles all of this:
changefreq and versioningBoth Puppeteer and Playwright excel at multi-step browser interactions — that's what they were built for. Form filling, navigation, waiting for elements, handling dialogs. Playwright especially has excellent auto-wait semantics and locator strategies.
Crawlstack supports similar interactions through its runner API, but the style is different:
// Playwright
await page.goto('https://example.com/search');
await page.fill('#search-input', 'web scraping');
await page.click('#search-button');
await page.waitForSelector('.results');
const results = await page.locator('.result-item').allInnerTexts();await runner.onLoad();
const input = document.querySelector('#search-input');
input.value = 'web scraping';
input.dispatchEvent(new Event('input', { bubbles: true }));
await runner.humanClick(document.querySelector('#search-button'));
await runner.waitFor('.results');
const results = [...document.querySelectorAll('.result-item')]
.map(el => el.innerText);Playwright's locator API is arguably more ergonomic for complex interaction chains. Crawlstack's approach is more verbose but gives you full control — and the human simulation makes it undetectable.
Here's the key insight: Puppeteer and Playwright are testing tools that happen to be useful for scraping. Crawlstack is a scraping tool.
If you're building a test suite for your web application, use Playwright. Its auto-wait, locators, trace viewer, and cross-browser support are purpose-built for testing. Crawlstack is not a testing tool.
If you're building a production scraping pipeline, Crawlstack gives you the full infrastructure. Puppeteer and Playwright give you browser control — everything else is DIY.
If you're building a quick one-off script to grab some data, any of the three work. Pick whichever you're most comfortable with.
Puppeteer/Playwright: Scaling means running multiple processes/machines and coordinating them yourself. Libraries like puppeteer-cluster help, but you're still managing browser instances, distributing work, handling failures, and aggregating results across nodes.
Crawlstack: Deploy browser nodes via Docker on multiple machines, connect them to the relay server, and the cluster distributes tasks automatically. Each node runs a real browser with full stealth. The relay server coordinates work distribution, and all nodes feed data into the same pipeline.
Crawlstack is not a replacement for Puppeteer or Playwright in every scenario. Here's where they're still better:
npm install puppeteer and 10 lines of code is hard to beat.But for serious scraping — the kind where you need reliability, stealth, scheduling, storage, and scale — Crawlstack provides the platform that Puppeteer and Playwright leave you to build yourself.
Crawlstack is a self-hosted scraping infrastructure that runs inside your browser or Docker. Get started for free.
Get started with Crawlstack today and experience the future of scraping.