.do
Integration

scraper

Web scraping and data extraction

scraper

Web scraping and data extraction with browser automation, JavaScript rendering, and intelligent parsing for structured data collection.

Overview

The scraper primitive provides powerful web scraping capabilities including headless browser automation, CSS/XPath selectors, and automatic pagination for extracting data from websites.

Quick Example

import { scraper } from 'sdk.do'

// Simple scraping
const data = await scraper.scrape({
  url: 'https://example.com/products',
  selectors: {
    title: 'h1.product-title',
    price: '.price',
    description: '.description',
  },
})

// Scrape list of items
const products = await scraper.scrapeList({
  url: 'https://example.com/products',
  itemSelector: '.product',
  fields: {
    name: 'h2',
    price: '.price',
    image: 'img@src',
  },
  pagination: {
    selector: 'a.next-page',
    maxPages: 10,
  },
})

// Browser automation
const result = await scraper.withBrowser(async (page) => {
  await page.goto('https://example.com')
  await page.click('.load-more')
  await page.waitForSelector('.products-loaded')
  return await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.product')).map((el) => el.textContent)
  })
})

Core Capabilities

  • Headless Browser - Full Chrome/Firefox automation
  • CSS/XPath Selectors - Flexible element selection
  • JavaScript Rendering - Scrape dynamic content
  • Pagination - Automatic multi-page scraping
  • Rate Limiting - Respect robots.txt and rate limits

Access Methods

SDK

TypeScript/JavaScript library for scraping

await scraper.scrape({ url: 'https://example.com', selectors: { title: 'h1' } })

SDK Documentation

CLI

Command-line tool for web scraping

do scraper scrape https://example.com --selector "title:h1"

CLI Documentation

API

REST/RPC endpoints for scraping operations

API Documentation

MCP

Model Context Protocol for AI-driven scraping

Scrape https://example.com and extract the title from h1 element

MCP Documentation