.do
Datasets

Datasets

Comprehensive collection of 115+ datasets across ontologies, infrastructure, government, scientific, business, knowledge graphs, developer, and geographic categories

The .do platform provides access to 115+ curated datasets covering everything from occupational data to government APIs, scientific publications to business processes. All datasets are free, standardized, and updated regularly from authoritative sources.

Dataset Categories

Datasets are organized into 8 major categories covering different domains and use cases:

Quick Start

Installation

# Install specific dataset package
pnpm add industries.org.ai

# Or install the complete ontologies collection
pnpm add @dotdo/ontologies

Basic Usage

SDK

import { $ } from 'sdk.do'

// Query occupations
const developers = await $.Occupation.list()
  .where({ name: { contains: 'Software' } })
  .limit(10)

// Get industry details
const software = await $.Industry.get('5112') // Software Publishers

CLI

# List all datasets
do datasets list

# Explore a specific dataset
do datasets explore industries.org.ai

# Search across datasets
do datasets search "software"

MCP (Model Context Protocol)

<!-- Access datasets via MCP in Claude Desktop -->
<use_mcp_tool>
<server_name>datasets</server_name>
<tool_name>search</tool_name>
<arguments>
{
  "query": "software development occupations",
  "datasets": ["occupations.org.ai", "skills.org.ai"]
}
</arguments>
</use_mcp_tool>

Dataset Highlights

  • occupations.org.ai - 923 O*NET occupations with salary, education, and skills data
  • industries.org.ai - 1,170 NAICS industry classifications
  • places.org.ai - 11M+ geographic locations from GeoNames
  • actions.org.ai - 14,116+ Zapier actions from 7,000+ apps
  • services.org.ai - 50,000+ service codes

Recently Added

  • products.org.ai - 6,000+ product categories
  • processes.org.ai - 1,000+ APQC business processes
  • education.org.ai - 6,500+ educational institutions

Coming Soon

  • TLDs, ASNs, IP ranges (Infrastructure)
  • Federal Register, SEC EDGAR (Government)
  • arXiv, PubMed (Scientific)
  • Wikidata, Schema.org (Knowledge Graphs)
  • NPM, GitHub, PyPI (Developer)

Data Quality

All datasets in the platform adhere to strict quality standards:

  • Authoritative Sources - Data from official sources (government agencies, standards bodies, trusted organizations)
  • Regular Updates - Automated updates from source systems (monthly, quarterly, or continuous)
  • Type-Safe - Complete TypeScript definitions for all datasets
  • Versioned - Full git history tracking all changes
  • Documented - Comprehensive documentation with schemas and examples
  • Free & Open - Public domain or open-source licenses

Common Use Cases

Job Matching & Recruitment

Match candidates to jobs based on skills, experience, and education:

// Find jobs requiring JavaScript
const jobs = await $.Occupation.list().related($.requires, $.Skill).where({ 'skill.name': 'JavaScript' })

// Get salary data
console.log(jobs[0].data.medianWage) // $120,730

Industry Analysis

Analyze industries and market segments:

// Get all software-related industries
const industries = await $.Industry.list().where({ naicsCode: { startsWith: '511' } })

// Find companies in industry
const companies = await $.Organization.list().where({ industry: '5112' })

Business Process Automation

Map processes to automation tools:

// Get business process
const process = await $.Process.get('10001')

// Find automation actions
const actions = await $.Action.list().where({ category: 'CRM' })

Location-Based Services

Find places and geographic data:

// Search places
const places = await $.Place.search('San Francisco')

// Get nearby locations
const nearby = await $.Place.nearby(37.7749, -122.4194, {
  radius: 10, // miles
  type: 'restaurant',
})

API Reference

Dataset Packages

All datasets are available as npm packages and via API:

  • Package: dataset-name.org.ai (e.g., industries.org.ai)
  • API: https://apis.do/datasets/{namespace}
  • MCP: mcp://datasets.do/{namespace}

Standard Interface

Every dataset exports the same interface:

import { getAllTypes, getType, search } from 'dataset.org.ai'

// Get all records
const all = getAllTypes()

// Get by ID
const record = getType(id)

// Search (if supported)
const results = search(query)

SDK Integration

Access datasets through the unified SDK:

import { $, db } from 'sdk.do'

// List with filters
const results = await db.list($.Type, {
  where: { category: 'Software' },
  limit: 10,
  orderBy: { name: 'asc' },
})

// Get with relationships
const related = await db.related(entity, $.predicate, $.Object)

// Aggregations
const stats = await db.aggregate($.Occupation, {
  groupBy: ['category'],
  count: true,
  avg: ['medianWage'],
})

Contributing

We welcome contributions to expand dataset coverage:

  1. Suggest datasets - Open an issue on GitHub proposing a new dataset
  2. Add sources - Contribute data sources in ai/sources/datasets/
  3. Improve documentation - Enhance dataset documentation
  4. Report issues - File bugs or data quality issues on GitHub

Support


Last updated: November 2, 2025 Total datasets: 115+ Total records: 11M+