Datasets
Comprehensive collection of 115+ datasets across ontologies, infrastructure, government, scientific, business, knowledge graphs, developer, and geographic categories
The .do platform provides access to 115+ curated datasets covering everything from occupational data to government APIs, scientific publications to business processes. All datasets are free, standardized, and updated regularly from authoritative sources.
Dataset Categories
Datasets are organized into 8 major categories covering different domains and use cases:
Ontology Datasets
17 core .org.ai vocabularies including industries, occupations, skills, processes, products, and services
Infrastructure
Essential developer datasets: TLDs, ASNs, IP ranges, timezones, countries, currencies, and more
Government & Public
20+ U.S. government datasets including Federal Register, SEC filings, USPTO patents, FDA data, and Congress API
Scientific
15+ scientific datasets including arXiv, PubMed, PubChem, NCBI genes, and protein databases
Business & Finance
10+ business datasets including Crunchbase, FRED economic data, World Bank indicators, and company registries
Knowledge Graphs
10+ linked data sources including Wikidata, Schema.org, DBpedia, MusicBrainz, and IMDB
Developer Ecosystems
15+ package registries and developer platforms: NPM, GitHub, PyPI, Docker Hub, RubyGems, and more
Geographic
5+ geographic datasets including GeoNames places, OpenStreetMap, and transit data
Quick Start
Installation
# Install specific dataset package
pnpm add industries.org.ai
# Or install the complete ontologies collection
pnpm add @dotdo/ontologiesBasic Usage
SDK
import { $ } from 'sdk.do'
// Query occupations
const developers = await $.Occupation.list()
.where({ name: { contains: 'Software' } })
.limit(10)
// Get industry details
const software = await $.Industry.get('5112') // Software PublishersCLI
# List all datasets
do datasets list
# Explore a specific dataset
do datasets explore industries.org.ai
# Search across datasets
do datasets search "software"MCP (Model Context Protocol)
<!-- Access datasets via MCP in Claude Desktop -->
<use_mcp_tool>
<server_name>datasets</server_name>
<tool_name>search</tool_name>
<arguments>
{
"query": "software development occupations",
"datasets": ["occupations.org.ai", "skills.org.ai"]
}
</arguments>
</use_mcp_tool>Dataset Highlights
Most Popular
- occupations.org.ai - 923 O*NET occupations with salary, education, and skills data
- industries.org.ai - 1,170 NAICS industry classifications
- places.org.ai - 11M+ geographic locations from GeoNames
- actions.org.ai - 14,116+ Zapier actions from 7,000+ apps
- services.org.ai - 50,000+ service codes
Recently Added
- products.org.ai - 6,000+ product categories
- processes.org.ai - 1,000+ APQC business processes
- education.org.ai - 6,500+ educational institutions
Coming Soon
- TLDs, ASNs, IP ranges (Infrastructure)
- Federal Register, SEC EDGAR (Government)
- arXiv, PubMed (Scientific)
- Wikidata, Schema.org (Knowledge Graphs)
- NPM, GitHub, PyPI (Developer)
Data Quality
All datasets in the platform adhere to strict quality standards:
- Authoritative Sources - Data from official sources (government agencies, standards bodies, trusted organizations)
- Regular Updates - Automated updates from source systems (monthly, quarterly, or continuous)
- Type-Safe - Complete TypeScript definitions for all datasets
- Versioned - Full git history tracking all changes
- Documented - Comprehensive documentation with schemas and examples
- Free & Open - Public domain or open-source licenses
Common Use Cases
Job Matching & Recruitment
Match candidates to jobs based on skills, experience, and education:
// Find jobs requiring JavaScript
const jobs = await $.Occupation.list().related($.requires, $.Skill).where({ 'skill.name': 'JavaScript' })
// Get salary data
console.log(jobs[0].data.medianWage) // $120,730Industry Analysis
Analyze industries and market segments:
// Get all software-related industries
const industries = await $.Industry.list().where({ naicsCode: { startsWith: '511' } })
// Find companies in industry
const companies = await $.Organization.list().where({ industry: '5112' })Business Process Automation
Map processes to automation tools:
// Get business process
const process = await $.Process.get('10001')
// Find automation actions
const actions = await $.Action.list().where({ category: 'CRM' })Location-Based Services
Find places and geographic data:
// Search places
const places = await $.Place.search('San Francisco')
// Get nearby locations
const nearby = await $.Place.nearby(37.7749, -122.4194, {
radius: 10, // miles
type: 'restaurant',
})API Reference
Dataset Packages
All datasets are available as npm packages and via API:
- Package:
dataset-name.org.ai(e.g.,industries.org.ai) - API:
https://apis.do/datasets/{namespace} - MCP:
mcp://datasets.do/{namespace}
Standard Interface
Every dataset exports the same interface:
import { getAllTypes, getType, search } from 'dataset.org.ai'
// Get all records
const all = getAllTypes()
// Get by ID
const record = getType(id)
// Search (if supported)
const results = search(query)SDK Integration
Access datasets through the unified SDK:
import { $, db } from 'sdk.do'
// List with filters
const results = await db.list($.Type, {
where: { category: 'Software' },
limit: 10,
orderBy: { name: 'asc' },
})
// Get with relationships
const related = await db.related(entity, $.predicate, $.Object)
// Aggregations
const stats = await db.aggregate($.Occupation, {
groupBy: ['category'],
count: true,
avg: ['medianWage'],
})Contributing
We welcome contributions to expand dataset coverage:
- Suggest datasets - Open an issue on GitHub proposing a new dataset
- Add sources - Contribute data sources in
ai/sources/datasets/ - Improve documentation - Enhance dataset documentation
- Report issues - File bugs or data quality issues on GitHub
Support
- Documentation: Full docs for each dataset category
- Community: GitHub Discussions
- Issues: Report bugs
- Updates: Follow @dotdo for dataset releases
Last updated: November 2, 2025 Total datasets: 115+ Total records: 11M+