Data Storage
datasets
Dataset management and distribution
datasets
Manage, version, and distribute datasets for AI training, analytics, and data pipelines with automatic versioning and lineage tracking.
Overview
The datasets primitive provides a unified interface for managing large datasets with features like versioning, schema validation, and efficient distribution across training and inference pipelines.
Parent Primitive: database - Universal database interface
SDK Object Mapping
This primitive maps to the db SDK object with dataset-specific operations:
import { db, datasets } from 'sdk.do'
// Create dataset
const dataset = await datasets.create({
name: 'customer-data',
schema: {
id: 'string',
email: 'string',
createdAt: 'timestamp',
},
})
// Version control
await dataset.version('v1.0.0', {
description: 'Initial release',
metadata: { rows: 10000, features: 5 },
})
// Query dataset with db interface
const data = await db.query(dataset, { limit: 100 })Quick Example
import { datasets } from 'sdk.do'
// Create dataset
const dataset = await datasets.create({
name: 'customer-data',
schema: {
id: 'string',
email: 'string',
createdAt: 'timestamp',
},
})
// Upload data
await dataset.upload('./data/customers.csv')
// Version dataset
await dataset.version('v1.0.0', {
description: 'Initial release',
metadata: { rows: 10000, features: 5 },
})
// Query dataset
const data = await dataset.query({ limit: 100 })Core Capabilities
- Version Control - Track dataset versions and changes
- Schema Validation - Enforce data quality and consistency
- Lineage Tracking - Track data transformations and sources
- Efficient Storage - Compressed and optimized storage
- Distribution - Fast access for training and inference
Access Methods
SDK
TypeScript/JavaScript library for dataset operations
await datasets.create({ name: 'customer-data', schema: {...} })CLI
Command-line tool for dataset management
do dataset create customer-data --schema schema.jsonAPI
REST/RPC endpoints for dataset operations
curl -X POST https://api.do/v1/datasets -d '{"name":"customer-data","schema":{...}}'MCP
Model Context Protocol for AI-driven dataset operations
Create a dataset named "customer-data" with schema fields: id, email, createdAtRelated Primitives
Parent Primitive
- database - Universal database interface and structured data storage
Sibling Primitives
- databases - Multi-database management
Related
- embeddings - Vector embeddings from datasets
- analytics - Dataset analytics
- llm - AI model training with datasets