.do
Implementation Guides

Monitoring and Observability Guide

Complete guide to implementing monitoring, logging, tracing, alerting, and dashboards for Services-as-Software

Learn how to build comprehensive monitoring and observability systems for your Services-as-Software, ensuring visibility into service health, performance, and customer experience.

Why Monitoring and Observability Matter

Effective monitoring and observability are critical for Services-as-Software:

  1. Proactive Issue Detection - Catch problems before customers experience them
  2. Performance Optimization - Identify bottlenecks and optimization opportunities
  3. SLA Compliance - Track and enforce service level agreements
  4. Customer Experience - Understand and improve customer satisfaction
  5. Business Intelligence - Drive product and business decisions with data
  6. Incident Response - Quickly diagnose and resolve issues
  7. Cost Management - Optimize resource usage and costs

Service Metrics

Basic Metrics Collection

Collect fundamental service metrics:

import { db, on } from 'sdk.do'

class MetricsCollector {
  async recordExecution(execution: any) {
    // Record execution metrics
    await db.create($.Metric, {
      serviceId: execution.serviceId,
      customerId: execution.customerId,
      timestamp: new Date(),
      type: 'execution',
      metrics: {
        duration: execution.duration,
        status: execution.status,
        inputSize: execution.inputSize,
        outputSize: execution.outputSize,
        tokensUsed: execution.tokensUsed,
        cost: execution.cost,
      },
      tags: {
        environment: process.env.ENVIRONMENT,
        version: execution.serviceVersion,
        region: execution.region,
      },
    })

    // Update aggregates
    await this.updateAggregates(execution)
  }

  private async updateAggregates(execution: any) {
    const serviceKey = `service:${execution.serviceId}:metrics`
    const now = new Date()

    // Update hourly aggregates
    await db.increment($.MetricAggregate, {
      where: {
        serviceId: execution.serviceId,
        period: 'hour',
        timestamp: new Date(now.getFullYear(), now.getMonth(), now.getDate(), now.getHours()),
      },
      data: {
        totalExecutions: 1,
        successfulExecutions: execution.status === 'completed' ? 1 : 0,
        failedExecutions: execution.status === 'failed' ? 1 : 0,
        totalDuration: execution.duration,
        totalCost: execution.cost,
      },
    })

    // Update daily aggregates
    await db.increment($.MetricAggregate, {
      where: {
        serviceId: execution.serviceId,
        period: 'day',
        timestamp: new Date(now.getFullYear(), now.getMonth(), now.getDate()),
      },
      data: {
        totalExecutions: 1,
        successfulExecutions: execution.status === 'completed' ? 1 : 0,
        failedExecutions: execution.status === 'failed' ? 1 : 0,
        totalDuration: execution.duration,
        totalCost: execution.cost,
      },
    })
  }
}

// Collect metrics on every execution
on($.ServiceExecution.complete, async (execution) => {
  const collector = new MetricsCollector()
  await collector.recordExecution(execution)
})

Advanced Metrics

Track detailed performance metrics:

class AdvancedMetrics {
  async recordDetailedMetrics(execution: any) {
    // Performance metrics
    const performanceMetrics = {
      // Timing breakdown
      timing: {
        queueTime: execution.startTime - execution.createdTime,
        executionTime: execution.endTime - execution.startTime,
        totalTime: execution.endTime - execution.createdTime,
      },

      // Resource usage
      resources: {
        cpuTime: execution.cpuTime,
        memoryPeak: execution.memoryPeak,
        memoryAverage: execution.memoryAverage,
        networkIn: execution.networkIn,
        networkOut: execution.networkOut,
      },

      // AI/ML metrics
      ai: {
        model: execution.model,
        tokensInput: execution.tokensInput,
        tokensOutput: execution.tokensOutput,
        tokensTotal: execution.tokensTotal,
        temperature: execution.temperature,
        quality: execution.qualityScore,
      },

      // Business metrics
      business: {
        revenue: execution.revenue,
        cost: execution.cost,
        margin: execution.revenue - execution.cost,
        marginPercent: ((execution.revenue - execution.cost) / execution.revenue) * 100,
      },

      // Quality metrics
      quality: {
        accuracy: execution.accuracy,
        completeness: execution.completeness,
        relevance: execution.relevance,
        customerSatisfaction: execution.customerSatisfaction,
      },
    }

    await db.create($.DetailedMetric, {
      serviceId: execution.serviceId,
      executionId: execution.id,
      timestamp: new Date(),
      metrics: performanceMetrics,
    })

    return performanceMetrics
  }

  async calculateDerivedMetrics(serviceId: string, period: { start: Date; end: Date }) {
    // Get raw metrics
    const executions = await db.list($.ServiceExecution, {
      where: {
        serviceId,
        timestamp: { gte: period.start, lte: period.end },
      },
    })

    if (executions.length === 0) {
      return null
    }

    // Calculate derived metrics
    const successful = executions.filter((e) => e.status === 'completed')
    const failed = executions.filter((e) => e.status === 'failed')
    const durations = successful.map((e) => e.duration).sort((a, b) => a - b)

    return {
      // Availability
      availability: successful.length / executions.length,
      uptime: (successful.length / executions.length) * 100,

      // Performance percentiles
      p50: durations[Math.floor(durations.length * 0.5)],
      p75: durations[Math.floor(durations.length * 0.75)],
      p90: durations[Math.floor(durations.length * 0.9)],
      p95: durations[Math.floor(durations.length * 0.95)],
      p99: durations[Math.floor(durations.length * 0.99)],

      // Throughput
      throughput: executions.length / ((period.end.getTime() - period.start.getTime()) / 1000),
      requestsPerSecond: executions.length / ((period.end.getTime() - period.start.getTime()) / 1000),

      // Error rates
      errorRate: failed.length / executions.length,
      errorCount: failed.length,

      // Quality
      averageQuality: successful.reduce((sum, e) => sum + (e.qualityScore || 0), 0) / successful.length,

      // Business
      totalRevenue: executions.reduce((sum, e) => sum + (e.revenue || 0), 0),
      totalCost: executions.reduce((sum, e) => sum + (e.cost || 0), 0),
      averageMargin: successful.reduce((sum, e) => sum + ((e.revenue - e.cost) / e.revenue) * 100, 0) / successful.length,
    }
  }
}

Logging Strategies

Structured Logging

Implement structured logging:

class ServiceLogger {
  private context: Record<string, any>

  constructor(context: Record<string, any>) {
    this.context = context
  }

  async log(level: string, message: string, data?: Record<string, any>) {
    const logEntry = {
      timestamp: new Date().toISOString(),
      level,
      message,
      context: this.context,
      data: data || {},
      environment: process.env.ENVIRONMENT,
      version: process.env.VERSION,
    }

    // Write to database
    await db.create($.LogEntry, logEntry)

    // Also log to console in development
    if (process.env.ENVIRONMENT === 'development') {
      console.log(JSON.stringify(logEntry, null, 2))
    }

    // Send critical logs to alerting system
    if (level === 'error' || level === 'critical') {
      await this.sendAlert(logEntry)
    }
  }

  async info(message: string, data?: Record<string, any>) {
    await this.log('info', message, data)
  }

  async warn(message: string, data?: Record<string, any>) {
    await this.log('warn', message, data)
  }

  async error(message: string, error?: Error, data?: Record<string, any>) {
    await this.log('error', message, {
      ...data,
      error: error
        ? {
            message: error.message,
            stack: error.stack,
            name: error.name,
          }
        : undefined,
    })
  }

  async critical(message: string, error?: Error, data?: Record<string, any>) {
    await this.log('critical', message, {
      ...data,
      error: error
        ? {
            message: error.message,
            stack: error.stack,
            name: error.name,
          }
        : undefined,
    })
  }

  private async sendAlert(logEntry: any) {
    await send($.Alert.create, {
      type: 'log-alert',
      severity: logEntry.level,
      message: logEntry.message,
      context: logEntry.context,
      data: logEntry.data,
    })
  }
}

// Usage in service execution
on($.ServiceRequest.created, async (request) => {
  const logger = new ServiceLogger({
    serviceId: request.serviceId,
    requestId: request.id,
    customerId: request.customerId,
  })

  logger.info('Service request received', {
    inputs: request.inputs,
  })

  try {
    logger.info('Starting service execution')

    const result = await executeService(request)

    logger.info('Service execution completed', {
      duration: result.duration,
      outputs: result.outputs,
    })
  } catch (error) {
    logger.error('Service execution failed', error, {
      inputs: request.inputs,
    })
  }
})

Log Aggregation

Aggregate and search logs:

class LogAggregator {
  async searchLogs(query: { serviceId?: string; customerId?: string; level?: string; startTime?: Date; endTime?: Date; message?: string; limit?: number }) {
    const where: any = {}

    if (query.serviceId) where['context.serviceId'] = query.serviceId
    if (query.customerId) where['context.customerId'] = query.customerId
    if (query.level) where.level = query.level
    if (query.startTime || query.endTime) {
      where.timestamp = {}
      if (query.startTime) where.timestamp.gte = query.startTime
      if (query.endTime) where.timestamp.lte = query.endTime
    }
    if (query.message) where.message = { contains: query.message }

    return await db.list($.LogEntry, {
      where,
      orderBy: { timestamp: 'desc' },
      take: query.limit || 100,
    })
  }

  async getErrorLogs(serviceId: string, hours: number = 24) {
    const startTime = new Date(Date.now() - hours * 60 * 60 * 1000)

    return await this.searchLogs({
      serviceId,
      level: 'error',
      startTime,
    })
  }

  async analyzeLogPatterns(serviceId: string, period: { start: Date; end: Date }) {
    const logs = await this.searchLogs({
      serviceId,
      startTime: period.start,
      endTime: period.end,
      limit: 10000,
    })

    // Analyze patterns
    const patterns = {
      errorsByType: this.groupBy(
        logs.filter((l) => l.level === 'error'),
        'data.error.name'
      ),
      errorsByMessage: this.groupBy(
        logs.filter((l) => l.level === 'error'),
        'message'
      ),
      logsByLevel: this.groupBy(logs, 'level'),
      logsByHour: this.groupByHour(logs),
    }

    return patterns
  }

  private groupBy(items: any[], key: string) {
    return items.reduce((acc, item) => {
      const value = this.getNestedValue(item, key) || 'unknown'
      acc[value] = (acc[value] || 0) + 1
      return acc
    }, {})
  }

  private groupByHour(items: any[]) {
    return items.reduce((acc, item) => {
      const hour = new Date(item.timestamp).getHours()
      acc[hour] = (acc[hour] || 0) + 1
      return acc
    }, {})
  }

  private getNestedValue(obj: any, path: string) {
    return path.split('.').reduce((current, key) => current?.[key], obj)
  }
}

Tracing Implementation

Distributed Tracing

Implement distributed tracing:

import { v4 as uuidv4 } from 'uuid'

class TracingSystem {
  async startTrace(name: string, metadata?: Record<string, any>) {
    const trace = {
      id: uuidv4(),
      name,
      startTime: Date.now(),
      metadata: metadata || {},
      spans: [],
    }

    // Store trace
    await db.create($.Trace, trace)

    return trace
  }

  async startSpan(traceId: string, name: string, parentSpanId?: string, metadata?: Record<string, any>) {
    const span = {
      id: uuidv4(),
      traceId,
      parentSpanId,
      name,
      startTime: Date.now(),
      metadata: metadata || {},
    }

    // Store span
    await db.create($.Span, span)

    return span
  }

  async endSpan(spanId: string, result?: any, error?: Error) {
    const span = await db.findOne($.Span, {
      where: { id: spanId },
    })

    const endTime = Date.now()

    await db.update($.Span, {
      where: { id: spanId },
      data: {
        endTime,
        duration: endTime - span.startTime,
        result,
        error: error
          ? {
              message: error.message,
              stack: error.stack,
            }
          : undefined,
        status: error ? 'error' : 'success',
      },
    })
  }

  async endTrace(traceId: string) {
    const trace = await db.findOne($.Trace, {
      where: { id: traceId },
      include: { spans: true },
    })

    const endTime = Date.now()

    await db.update($.Trace, {
      where: { id: traceId },
      data: {
        endTime,
        duration: endTime - trace.startTime,
        totalSpans: trace.spans.length,
      },
    })
  }

  async getTrace(traceId: string) {
    return await db.findOne($.Trace, {
      where: { id: traceId },
      include: {
        spans: {
          orderBy: { startTime: 'asc' },
        },
      },
    })
  }

  async analyzeTrace(traceId: string) {
    const trace = await this.getTrace(traceId)

    if (!trace) return null

    // Build span tree
    const spanTree = this.buildSpanTree(trace.spans)

    // Calculate critical path
    const criticalPath = this.findCriticalPath(spanTree)

    // Identify slow spans
    const slowSpans = trace.spans.filter((s) => s.duration > 1000).sort((a, b) => b.duration - a.duration)

    return {
      trace,
      spanTree,
      criticalPath,
      slowSpans,
      totalDuration: trace.duration,
      spanCount: trace.spans.length,
    }
  }

  private buildSpanTree(spans: any[]) {
    const spanMap = new Map(spans.map((s) => [s.id, { ...s, children: [] }]))

    const roots: any[] = []

    for (const span of spanMap.values()) {
      if (span.parentSpanId) {
        const parent = spanMap.get(span.parentSpanId)
        if (parent) {
          parent.children.push(span)
        }
      } else {
        roots.push(span)
      }
    }

    return roots
  }

  private findCriticalPath(spanTree: any[]): any[] {
    // Find longest path through span tree
    let longestPath: any[] = []
    let longestDuration = 0

    const traverse = (node: any, path: any[], duration: number) => {
      const currentPath = [...path, node]
      const currentDuration = duration + node.duration

      if (node.children.length === 0) {
        if (currentDuration > longestDuration) {
          longestDuration = currentDuration
          longestPath = currentPath
        }
      } else {
        for (const child of node.children) {
          traverse(child, currentPath, currentDuration)
        }
      }
    }

    for (const root of spanTree) {
      traverse(root, [], 0)
    }

    return longestPath
  }
}

// Use tracing in service execution
on($.ServiceRequest.created, async (request) => {
  const tracing = new TracingSystem()

  // Start trace
  const trace = await tracing.startTrace('service-execution', {
    serviceId: request.serviceId,
    requestId: request.id,
  })

  try {
    // Research span
    const researchSpan = await tracing.startSpan(trace.id, 'research')
    const research = await performResearch(request)
    await tracing.endSpan(researchSpan.id, research)

    // Outline span
    const outlineSpan = await tracing.startSpan(trace.id, 'outline', researchSpan.id)
    const outline = await createOutline(research)
    await tracing.endSpan(outlineSpan.id, outline)

    // Content span with nested spans
    const contentSpan = await tracing.startSpan(trace.id, 'content', outlineSpan.id)

    // AI generation nested span
    const aiSpan = await tracing.startSpan(trace.id, 'ai-generation', contentSpan.id)
    const content = await generateContent(outline)
    await tracing.endSpan(aiSpan.id, content)

    // Quality check nested span
    const qualitySpan = await tracing.startSpan(trace.id, 'quality-check', contentSpan.id)
    const quality = await checkQuality(content)
    await tracing.endSpan(qualitySpan.id, quality)

    await tracing.endSpan(contentSpan.id, content)

    // End trace
    await tracing.endTrace(trace.id)
  } catch (error) {
    // Record error in trace
    await tracing.endTrace(trace.id)
  }
})

Alerting Setup

Alert Configuration

Configure comprehensive alerting:

class AlertingSystem {
  async configureAlerts(serviceId: string) {
    // Performance alerts
    await this.createAlert({
      serviceId,
      name: 'High Response Time',
      condition: {
        metric: 'p95_duration',
        operator: 'greater_than',
        threshold: 10000, // 10 seconds
        window: '5m',
      },
      severity: 'warning',
      actions: ['email', 'slack'],
    })

    await this.createAlert({
      serviceId,
      name: 'Very High Response Time',
      condition: {
        metric: 'p99_duration',
        operator: 'greater_than',
        threshold: 30000, // 30 seconds
        window: '5m',
      },
      severity: 'critical',
      actions: ['email', 'slack', 'pagerduty'],
    })

    // Error rate alerts
    await this.createAlert({
      serviceId,
      name: 'Elevated Error Rate',
      condition: {
        metric: 'error_rate',
        operator: 'greater_than',
        threshold: 0.05, // 5%
        window: '5m',
      },
      severity: 'warning',
      actions: ['email', 'slack'],
    })

    await this.createAlert({
      serviceId,
      name: 'High Error Rate',
      condition: {
        metric: 'error_rate',
        operator: 'greater_than',
        threshold: 0.1, // 10%
        window: '5m',
      },
      severity: 'critical',
      actions: ['email', 'slack', 'pagerduty'],
    })

    // Availability alerts
    await this.createAlert({
      serviceId,
      name: 'Low Availability',
      condition: {
        metric: 'availability',
        operator: 'less_than',
        threshold: 0.99, // 99%
        window: '1h',
      },
      severity: 'critical',
      actions: ['email', 'slack', 'pagerduty'],
    })

    // Business metric alerts
    await this.createAlert({
      serviceId,
      name: 'Revenue Drop',
      condition: {
        metric: 'revenue',
        operator: 'decrease_by',
        threshold: 0.2, // 20% decrease
        window: '1h',
        comparison: 'previous_period',
      },
      severity: 'warning',
      actions: ['email'],
    })

    // Quality alerts
    await this.createAlert({
      serviceId,
      name: 'Quality Degradation',
      condition: {
        metric: 'quality_score',
        operator: 'less_than',
        threshold: 0.9, // 90%
        window: '1h',
      },
      severity: 'warning',
      actions: ['email', 'slack'],
    })
  }

  private async createAlert(config: any) {
    return await db.create($.Alert, config)
  }

  async evaluateAlerts(serviceId: string) {
    const alerts = await db.list($.Alert, {
      where: { serviceId, enabled: true },
    })

    for (const alert of alerts) {
      const triggered = await this.evaluateCondition(alert.condition, serviceId)

      if (triggered) {
        await this.triggerAlert(alert)
      }
    }
  }

  private async evaluateCondition(condition: any, serviceId: string): Promise<boolean> {
    // Get metric value
    const value = await this.getMetricValue(serviceId, condition.metric, condition.window)

    // Evaluate condition
    switch (condition.operator) {
      case 'greater_than':
        return value > condition.threshold
      case 'less_than':
        return value < condition.threshold
      case 'equals':
        return value === condition.threshold
      case 'decrease_by':
        const previousValue = await this.getPreviousMetricValue(serviceId, condition.metric, condition.window)
        const decrease = (previousValue - value) / previousValue
        return decrease > condition.threshold
      default:
        return false
    }
  }

  private async getMetricValue(serviceId: string, metric: string, window: string): Promise<number> {
    const windowMs = this.parseWindow(window)
    const startTime = new Date(Date.now() - windowMs)

    const aggregate = await db.findOne($.MetricAggregate, {
      where: {
        serviceId,
        timestamp: { gte: startTime },
      },
    })

    return aggregate?.metrics[metric] || 0
  }

  private async getPreviousMetricValue(serviceId: string, metric: string, window: string): Promise<number> {
    const windowMs = this.parseWindow(window)
    const endTime = new Date(Date.now() - windowMs)
    const startTime = new Date(endTime.getTime() - windowMs)

    const aggregate = await db.findOne($.MetricAggregate, {
      where: {
        serviceId,
        timestamp: { gte: startTime, lte: endTime },
      },
    })

    return aggregate?.metrics[metric] || 0
  }

  private parseWindow(window: string): number {
    const match = window.match(/^(\d+)([smhd])$/)
    if (!match) return 300000 // default 5 minutes

    const value = parseInt(match[1])
    const unit = match[2]

    const multipliers: Record<string, number> = {
      s: 1000,
      m: 60000,
      h: 3600000,
      d: 86400000,
    }

    return value * multipliers[unit]
  }

  private async triggerAlert(alert: any) {
    // Create alert instance
    const instance = await db.create($.AlertInstance, {
      alertId: alert.id,
      serviceId: alert.serviceId,
      timestamp: new Date(),
      severity: alert.severity,
      message: this.formatAlertMessage(alert),
    })

    // Execute alert actions
    for (const action of alert.actions) {
      await this.executeAlertAction(action, instance)
    }
  }

  private formatAlertMessage(alert: any): string {
    return `${alert.name}: ${alert.condition.metric} ${alert.condition.operator} ${alert.condition.threshold}`
  }

  private async executeAlertAction(action: string, instance: any) {
    switch (action) {
      case 'email':
        await send($.Email.send, {
          to: '[email protected]',
          subject: `Alert: ${instance.message}`,
          template: 'alert',
          data: instance,
        })
        break

      case 'slack':
        await send($.Slack.sendMessage, {
          channel: '#alerts',
          text: instance.message,
          data: instance,
        })
        break

      case 'pagerduty':
        await send($.PagerDuty.createIncident, {
          title: instance.message,
          severity: instance.severity,
          data: instance,
        })
        break
    }
  }
}

// Run alert evaluation periodically
on($.Schedule.everyMinute, async () => {
  const alerting = new AlertingSystem()
  const services = await db.list($.Service, {
    where: { status: 'active' },
  })

  for (const service of services) {
    await alerting.evaluateAlerts(service.id)
  }
})

Performance Monitoring

Real-Time Performance Tracking

Track performance in real-time:

class PerformanceMonitor {
  async trackPerformance(execution: any) {
    // Calculate performance score
    const score = await this.calculatePerformanceScore(execution)

    // Record performance
    await db.create($.PerformanceMetric, {
      serviceId: execution.serviceId,
      executionId: execution.id,
      timestamp: new Date(),
      score,
      metrics: {
        duration: execution.duration,
        cpuTime: execution.cpuTime,
        memoryPeak: execution.memoryPeak,
        latency: execution.latency,
      },
    })

    // Check for performance issues
    if (score < 0.7) {
      await this.investigatePerformance(execution)
    }

    return score
  }

  private async calculatePerformanceScore(execution: any): Promise<number> {
    // Get service SLA targets
    const service = await db.findOne($.Service, {
      where: { id: execution.serviceId },
    })

    const targetDuration = service.sla.responseTime
    const actualDuration = execution.duration

    // Score based on how close to target
    const durationScore = Math.max(0, 1 - actualDuration / (targetDuration * 2))

    // Factor in other metrics
    const cpuScore = this.scoreResource(execution.cpuTime, 1000) // Target 1 second CPU
    const memoryScore = this.scoreResource(execution.memoryPeak, 512 * 1024 * 1024) // Target 512MB

    // Weighted average
    return durationScore * 0.5 + cpuScore * 0.25 + memoryScore * 0.25
  }

  private scoreResource(actual: number, target: number): number {
    return Math.max(0, 1 - actual / (target * 2))
  }

  private async investigatePerformance(execution: any) {
    // Get trace for execution
    const trace = await db.findOne($.Trace, {
      where: { 'metadata.executionId': execution.id },
      include: { spans: true },
    })

    if (!trace) return

    // Analyze trace
    const tracing = new TracingSystem()
    const analysis = await tracing.analyzeTrace(trace.id)

    // Identify bottlenecks
    const bottlenecks = analysis.slowSpans.map((span) => ({
      name: span.name,
      duration: span.duration,
      percentage: (span.duration / trace.duration) * 100,
    }))

    // Create performance report
    await db.create($.PerformanceReport, {
      serviceId: execution.serviceId,
      executionId: execution.id,
      timestamp: new Date(),
      bottlenecks,
      recommendations: this.generateRecommendations(bottlenecks),
    })
  }

  private generateRecommendations(bottlenecks: any[]): string[] {
    const recommendations: string[] = []

    for (const bottleneck of bottlenecks) {
      if (bottleneck.name.includes('ai') && bottleneck.duration > 5000) {
        recommendations.push('Consider using a faster AI model or caching responses')
      }
      if (bottleneck.name.includes('database') && bottleneck.duration > 1000) {
        recommendations.push('Optimize database queries or add indexes')
      }
      if (bottleneck.name.includes('network') && bottleneck.duration > 2000) {
        recommendations.push('Consider caching external API responses')
      }
    }

    return recommendations
  }
}

Error Tracking

Comprehensive Error Tracking

Track and categorize errors:

class ErrorTracker {
  async trackError(error: Error, context: any) {
    // Categorize error
    const category = this.categorizeError(error)

    // Create error record
    const errorRecord = await db.create($.ErrorRecord, {
      serviceId: context.serviceId,
      executionId: context.executionId,
      timestamp: new Date(),
      category,
      message: error.message,
      stack: error.stack,
      context,
      fingerprint: this.generateFingerprint(error),
    })

    // Check for error spike
    await this.checkErrorSpike(context.serviceId)

    // Group similar errors
    await this.groupSimilarErrors(errorRecord)

    return errorRecord
  }

  private categorizeError(error: Error): string {
    if (error.message.includes('timeout')) return 'timeout'
    if (error.message.includes('rate limit')) return 'rate-limit'
    if (error.message.includes('API')) return 'api-error'
    if (error.message.includes('validation')) return 'validation'
    if (error.message.includes('auth')) return 'authentication'
    return 'unknown'
  }

  private generateFingerprint(error: Error): string {
    // Generate consistent fingerprint for grouping
    const signature = `${error.name}:${error.message}:${error.stack?.split('\n')[1]}`
    return this.hash(signature)
  }

  private hash(str: string): string {
    // Simple hash function
    let hash = 0
    for (let i = 0; i < str.length; i++) {
      const char = str.charCodeAt(i)
      hash = (hash << 5) - hash + char
      hash = hash & hash
    }
    return hash.toString(36)
  }

  private async checkErrorSpike(serviceId: string) {
    const lastHour = new Date(Date.now() - 60 * 60 * 1000)

    const recentErrors = await db.count($.ErrorRecord, {
      where: {
        serviceId,
        timestamp: { gte: lastHour },
      },
    })

    const previousHour = new Date(Date.now() - 2 * 60 * 60 * 1000)
    const previousErrors = await db.count($.ErrorRecord, {
      where: {
        serviceId,
        timestamp: { gte: previousHour, lt: lastHour },
      },
    })

    // Alert if errors doubled
    if (recentErrors > previousErrors * 2 && recentErrors > 10) {
      await send($.Alert.create, {
        type: 'error-spike',
        serviceId,
        message: `Error rate increased ${((recentErrors / previousErrors - 1) * 100).toFixed(0)}%`,
        data: {
          recentErrors,
          previousErrors,
        },
      })
    }
  }

  private async groupSimilarErrors(errorRecord: any) {
    // Find existing group
    let group = await db.findOne($.ErrorGroup, {
      where: {
        fingerprint: errorRecord.fingerprint,
        resolved: false,
      },
    })

    if (!group) {
      // Create new group
      group = await db.create($.ErrorGroup, {
        fingerprint: errorRecord.fingerprint,
        firstSeen: errorRecord.timestamp,
        lastSeen: errorRecord.timestamp,
        count: 1,
        category: errorRecord.category,
        message: errorRecord.message,
        resolved: false,
      })
    } else {
      // Update existing group
      await db.update($.ErrorGroup, {
        where: { id: group.id },
        data: {
          lastSeen: errorRecord.timestamp,
          count: { increment: 1 },
        },
      })
    }

    // Link error to group
    await db.update($.ErrorRecord, {
      where: { id: errorRecord.id },
      data: { groupId: group.id },
    })
  }
}

SLA Monitoring

SLA Compliance Tracking

Track SLA compliance:

class SLAMonitor {
  async checkSLACompliance(serviceId: string, period: { start: Date; end: Date }) {
    // Get service SLAs
    const service = await db.findOne($.Service, {
      where: { id: serviceId },
    })

    if (!service.sla) return null

    // Get executions
    const executions = await db.list($.ServiceExecution, {
      where: {
        serviceId,
        timestamp: { gte: period.start, lte: period.end },
      },
    })

    // Calculate SLA metrics
    const metrics = await this.calculateSLAMetrics(executions, service.sla)

    // Check compliance
    const compliance = {
      responseTime: metrics.p95Duration <= service.sla.responseTime,
      availability: metrics.availability >= service.sla.availability / 100,
      accuracy: metrics.accuracy >= service.sla.accuracy,
      overall:
        metrics.p95Duration <= service.sla.responseTime && metrics.availability >= service.sla.availability / 100 && metrics.accuracy >= service.sla.accuracy,
    }

    // Record SLA report
    const report = await db.create($.SLAReport, {
      serviceId,
      period,
      sla: service.sla,
      metrics,
      compliance,
      timestamp: new Date(),
    })

    // Alert on SLA violations
    if (!compliance.overall) {
      await this.handleSLAViolation(service, report)
    }

    return report
  }

  private async calculateSLAMetrics(executions: any[], sla: any) {
    const successful = executions.filter((e) => e.status === 'completed')
    const durations = successful.map((e) => e.duration).sort((a, b) => a - b)

    return {
      totalExecutions: executions.length,
      successfulExecutions: successful.length,
      failedExecutions: executions.length - successful.length,
      availability: successful.length / executions.length,
      p50Duration: durations[Math.floor(durations.length * 0.5)],
      p95Duration: durations[Math.floor(durations.length * 0.95)],
      p99Duration: durations[Math.floor(durations.length * 0.99)],
      accuracy: successful.reduce((sum, e) => sum + (e.qualityScore || 1), 0) / successful.length,
    }
  }

  private async handleSLAViolation(service: any, report: any) {
    await send($.Alert.create, {
      type: 'sla-violation',
      serviceId: service.id,
      severity: 'critical',
      message: `SLA violation detected for ${service.name}`,
      data: {
        sla: service.sla,
        metrics: report.metrics,
        compliance: report.compliance,
      },
    })

    // Notify customers if in SLA agreement
    const customers = await db.list($.Customer, {
      where: {
        subscriptions: {
          some: {
            serviceId: service.id,
            slaAgreement: true,
          },
        },
      },
    })

    for (const customer of customers) {
      await send($.Email.send, {
        to: customer.email,
        subject: 'Service Level Agreement Notification',
        template: 'sla-violation',
        data: {
          service,
          report,
        },
      })
    }
  }
}

// Check SLA compliance daily
on($.Schedule.daily, async () => {
  const monitor = new SLAMonitor()
  const services = await db.list($.Service, {
    where: { status: 'active', sla: { not: null } },
  })

  const yesterday = {
    start: new Date(Date.now() - 24 * 60 * 60 * 1000),
    end: new Date(),
  }

  for (const service of services) {
    await monitor.checkSLACompliance(service.id, yesterday)
  }
})

Dashboard Creation

Comprehensive Service Dashboard

Create real-time dashboards:

class DashboardBuilder {
  async buildServiceDashboard(serviceId: string) {
    const now = new Date()
    const last24h = new Date(now.getTime() - 24 * 60 * 60 * 1000)

    // Get metrics
    const metrics = new AdvancedMetrics()
    const derived = await metrics.calculateDerivedMetrics(serviceId, {
      start: last24h,
      end: now,
    })

    // Get errors
    const errorTracker = new ErrorTracker()
    const errors = await db.list($.ErrorGroup, {
      where: {
        serviceId,
        resolved: false,
      },
      orderBy: { count: 'desc' },
      take: 10,
    })

    // Get SLA status
    const slaMonitor = new SLAMonitor()
    const slaReport = await slaMonitor.checkSLACompliance(serviceId, {
      start: last24h,
      end: now,
    })

    // Build dashboard
    return {
      overview: {
        status: this.getServiceStatus(derived, slaReport),
        uptime: derived.uptime,
        requestsToday: derived.totalRequests,
        errorRate: derived.errorRate,
      },

      performance: {
        p50: derived.p50,
        p95: derived.p95,
        p99: derived.p99,
        throughput: derived.throughput,
        chart: await this.getPerformanceChart(serviceId, last24h, now),
      },

      reliability: {
        availability: derived.availability,
        errorCount: derived.errorCount,
        topErrors: errors,
      },

      sla: slaReport
        ? {
            compliant: slaReport.compliance.overall,
            responseTime: {
              target: slaReport.sla.responseTime,
              actual: slaReport.metrics.p95Duration,
              compliant: slaReport.compliance.responseTime,
            },
            availability: {
              target: slaReport.sla.availability,
              actual: slaReport.metrics.availability * 100,
              compliant: slaReport.compliance.availability,
            },
            accuracy: {
              target: slaReport.sla.accuracy,
              actual: slaReport.metrics.accuracy,
              compliant: slaReport.compliance.accuracy,
            },
          }
        : null,

      business: {
        revenue: derived.totalRevenue,
        cost: derived.totalCost,
        margin: derived.averageMargin,
        customers: await this.getActiveCustomers(serviceId, last24h),
      },

      recent: {
        errors: await this.getRecentErrors(serviceId, 10),
        slowRequests: await this.getSlowRequests(serviceId, 10),
      },
    }
  }

  private getServiceStatus(metrics: any, slaReport: any): string {
    if (slaReport && !slaReport.compliance.overall) return 'degraded'
    if (metrics.errorRate > 0.05) return 'warning'
    if (metrics.availability < 0.99) return 'warning'
    return 'healthy'
  }

  private async getPerformanceChart(serviceId: string, start: Date, end: Date) {
    // Get hourly aggregates
    const aggregates = await db.list($.MetricAggregate, {
      where: {
        serviceId,
        period: 'hour',
        timestamp: { gte: start, lte: end },
      },
      orderBy: { timestamp: 'asc' },
    })

    return aggregates.map((a) => ({
      timestamp: a.timestamp,
      avgDuration: a.metrics.totalDuration / a.metrics.totalExecutions,
      requests: a.metrics.totalExecutions,
      errors: a.metrics.failedExecutions,
    }))
  }

  private async getActiveCustomers(serviceId: string, since: Date) {
    return await db.count($.ServiceExecution, {
      where: {
        serviceId,
        timestamp: { gte: since },
      },
      distinct: ['customerId'],
    })
  }

  private async getRecentErrors(serviceId: string, limit: number) {
    return await db.list($.ErrorRecord, {
      where: { serviceId },
      orderBy: { timestamp: 'desc' },
      take: limit,
    })
  }

  private async getSlowRequests(serviceId: string, limit: number) {
    return await db.list($.ServiceExecution, {
      where: { serviceId },
      orderBy: { duration: 'desc' },
      take: limit,
    })
  }
}

// Create dashboard API endpoint
on($.API.request, async (req) => {
  if (req.path === '/api/dashboard/:serviceId') {
    const builder = new DashboardBuilder()
    const dashboard = await builder.buildServiceDashboard(req.params.serviceId)
    return { status: 200, body: dashboard }
  }
})

Best Practices

1. Instrument Everything

Add observability to all service operations:

// ✅ Good: Comprehensive instrumentation
const logger = new ServiceLogger(context)
const metrics = new MetricsCollector()
const tracing = new TracingSystem()

logger.info('Starting operation')
const trace = await tracing.startTrace('operation')
const result = await performOperation()
await metrics.recordExecution(result)
await tracing.endTrace(trace.id)

2. Use Structured Logs

Always use structured logging:

// ✅ Good: Structured
logger.info('Request completed', {
  duration: 1234,
  status: 'success',
  userId: 'user-123',
})

// ❌ Bad: Unstructured
console.log('Request completed in 1234ms for user-123')

3. Set Appropriate Alerts

Configure meaningful alerts:

// ✅ Good: Actionable alerts
// Alert on sustained issues, not transient blips
// Use appropriate thresholds based on SLAs

// ❌ Bad: Noisy alerts
// Alert on every single error
// Use arbitrary thresholds

4. Monitor Business Metrics

Track business impact:

// Monitor technical AND business metrics
// Track revenue, costs, customer satisfaction
// Correlate technical issues with business impact

5. Regular Review

Review monitoring data regularly:

// Weekly reviews of dashboards
// Monthly SLA reports
// Quarterly optimization reviews

Next Steps