.do
Data Storage

datasets

Dataset management and distribution

datasets

Manage, version, and distribute datasets for AI training, analytics, and data pipelines with automatic versioning and lineage tracking.

Overview

The datasets primitive provides a unified interface for managing large datasets with features like versioning, schema validation, and efficient distribution across training and inference pipelines.

Parent Primitive: database - Universal database interface

SDK Object Mapping

This primitive maps to the db SDK object with dataset-specific operations:

import { db, datasets } from 'sdk.do'

// Create dataset
const dataset = await datasets.create({
  name: 'customer-data',
  schema: {
    id: 'string',
    email: 'string',
    createdAt: 'timestamp',
  },
})

// Version control
await dataset.version('v1.0.0', {
  description: 'Initial release',
  metadata: { rows: 10000, features: 5 },
})

// Query dataset with db interface
const data = await db.query(dataset, { limit: 100 })

Quick Example

import { datasets } from 'sdk.do'

// Create dataset
const dataset = await datasets.create({
  name: 'customer-data',
  schema: {
    id: 'string',
    email: 'string',
    createdAt: 'timestamp',
  },
})

// Upload data
await dataset.upload('./data/customers.csv')

// Version dataset
await dataset.version('v1.0.0', {
  description: 'Initial release',
  metadata: { rows: 10000, features: 5 },
})

// Query dataset
const data = await dataset.query({ limit: 100 })

Core Capabilities

  • Version Control - Track dataset versions and changes
  • Schema Validation - Enforce data quality and consistency
  • Lineage Tracking - Track data transformations and sources
  • Efficient Storage - Compressed and optimized storage
  • Distribution - Fast access for training and inference

Access Methods

SDK

TypeScript/JavaScript library for dataset operations

await datasets.create({ name: 'customer-data', schema: {...} })

SDK Documentation

CLI

Command-line tool for dataset management

do dataset create customer-data --schema schema.json

CLI Documentation

API

REST/RPC endpoints for dataset operations

curl -X POST https://api.do/v1/datasets -d '{"name":"customer-data","schema":{...}}'

API Documentation

MCP

Model Context Protocol for AI-driven dataset operations

Create a dataset named "customer-data" with schema fields: id, email, createdAt

MCP Documentation

Parent Primitive

  • database - Universal database interface and structured data storage

Sibling Primitives

  • embeddings - Vector embeddings from datasets
  • analytics - Dataset analytics
  • llm - AI model training with datasets