scraper

Web scraping and data extraction with browser automation, JavaScript rendering, and intelligent parsing for structured data collection.

Overview

The scraper primitive provides powerful web scraping capabilities including headless browser automation, CSS/XPath selectors, and automatic pagination for extracting data from websites.

Quick Example

import { scraper } from 'sdk.do'

// Simple scraping
const data = await scraper.scrape({
  url: 'https://example.com/products',
  selectors: {
    title: 'h1.product-title',
    price: '.price',
    description: '.description',
  },
})

// Scrape list of items
const products = await scraper.scrapeList({
  url: 'https://example.com/products',
  itemSelector: '.product',
  fields: {
    name: 'h2',
    price: '.price',
    image: 'img@src',
  },
  pagination: {
    selector: 'a.next-page',
    maxPages: 10,
  },
})

// Browser automation
const result = await scraper.withBrowser(async (page) => {
  await page.goto('https://example.com')
  await page.click('.load-more')
  await page.waitForSelector('.products-loaded')
  return await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.product')).map((el) => el.textContent)
  })
})

Core Capabilities

Headless Browser - Full Chrome/Firefox automation
CSS/XPath Selectors - Flexible element selection
JavaScript Rendering - Scrape dynamic content
Pagination - Automatic multi-page scraping
Rate Limiting - Respect robots.txt and rate limits

Access Methods

SDK

TypeScript/JavaScript library for scraping

await scraper.scrape({ url: 'https://example.com', selectors: { title: 'h1' } })

→ SDK Documentation

CLI

Command-line tool for web scraping

do scraper scrape https://example.com --selector "title:h1"

→ CLI Documentation

API

REST/RPC endpoints for scraping operations

curl -X POST https://api.do/v1/scraper/scrape -d '{"url":"https://example.com"}'

→ API Documentation

MCP

Model Context Protocol for AI-driven scraping

Scrape https://example.com and extract the title from h1 element

→ MCP Documentation

browse - Browser automation
fetch - HTTP requests
extract - Data extraction

scraper

On this page