.do
Integration

extract

Data extraction and transformation

extract

Intelligent data extraction from various sources with pattern matching, schema mapping, and automatic type conversion for structured data processing.

Overview

The extract primitive provides powerful data extraction capabilities including parsing unstructured text, extracting entities from documents, and transforming data between formats with automatic schema inference.

Quick Example

import { extract } from 'sdk.do'

// Extract structured data from text
const data = await extract.fromText({
  text: 'John Smith works at Acme Corp as a Software Engineer',
  schema: {
    name: 'string',
    company: 'string',
    title: 'string',
  },
})

// Extract from documents
const invoices = await extract.fromPDF({
  file: './invoice.pdf',
  type: 'invoice',
  fields: {
    invoiceNumber: 'string',
    amount: 'number',
    date: 'date',
    items: 'array',
  },
})

// Extract with AI
const entities = await extract.withAI({
  content: document,
  extract: ['people', 'organizations', 'locations', 'dates'],
  model: 'gpt-5',
})

Core Capabilities

  • Pattern Matching - Regex and semantic pattern extraction
  • Schema Mapping - Transform data between formats
  • Entity Extraction - Identify people, places, organizations
  • Format Conversion - Parse JSON, XML, CSV, PDF, HTML
  • AI-Powered - LLM-based extraction for complex data

Access Methods

SDK

TypeScript/JavaScript library for data extraction

await extract.fromText({ text: 'John Smith at Acme', schema: { name: 'string', company: 'string' } })

SDK Documentation

CLI

Command-line tool for extraction operations

do extract pdf invoice.pdf --type invoice --output invoice.json

CLI Documentation

API

REST/RPC endpoints for extraction services

curl -X POST https://api.do/v1/extract -d '{"text":"John at Acme","schema":{"name":"string"}}'

API Documentation

MCP

Model Context Protocol for AI-driven extraction

Extract name and company from "John Smith works at Acme Corp"

MCP Documentation