.do
Manage

Manage

Manage and operate your Business-as-Code in production with monitoring, incident response, and operational excellence

Manage and operate your Business-as-Code in production with comprehensive monitoring, alerting, incident response, and continuous maintenance.

Overview

The Manage phase ensures your Business-as-Code runs smoothly in production. Through automated monitoring, intelligent alerting, rapid incident response, and proactive maintenance, you maintain high availability, performance, and reliability.

This phase emphasizes:

  • Observability: Full visibility into system behavior
  • Reliability: High availability and fault tolerance
  • Incident Response: Rapid detection and resolution
  • Maintenance: Proactive system care
  • Cost Optimization: Efficient resource utilization

Core Management Primitives

Health Monitoring

Continuous system health checks:

// Comprehensive health monitoring
export const healthMonitoring = {
  // Database health
  database: async () => {
    const start = Date.now()
    try {
      await db.query('SELECT 1')
      const latency = Date.now() - start

      return {
        status: latency < 100 ? 'healthy' : 'degraded',
        latency,
        connections: await db.pool.size(),
      }
    } catch (error) {
      return { status: 'down', error: error.message }
    }
  },

  // Cache health
  cache: async () => {
    try {
      const testKey = `health-${Date.now()}`
      await cache.set(testKey, 'ok', { ttl: 1 })
      const value = await cache.get(testKey)

      return {
        status: value === 'ok' ? 'healthy' : 'degraded',
        hitRate: await cache.getHitRate(),
      }
    } catch (error) {
      return { status: 'down', error: error.message }
    }
  },

  // External service health
  externalServices: async () => {
    const services = ['stripe', 'sendgrid', 'openai', 's3']
    const results = await Promise.all(
      services.map(async (service) => {
        try {
          const start = Date.now()
          await checkServiceHealth(service)
          return {
            service,
            status: 'healthy',
            latency: Date.now() - start,
          }
        } catch (error) {
          return {
            service,
            status: 'down',
            error: error.message,
          }
        }
      })
    )

    return {
      status: results.every((r) => r.status === 'healthy') ? 'healthy' : 'degraded',
      services: results,
    }
  },
}

// Automated health monitoring
on($.Schedule.every('1 minute'), async () => {
  const [database, cache, external] = await Promise.all([
    healthMonitoring.database(),
    healthMonitoring.cache(),
    healthMonitoring.externalServices(),
  ])

  const overall = {
    database,
    cache,
    external,
    status: [database, cache, external].every((h) => h.status === 'healthy') ? 'healthy' : 'degraded',
    timestamp: new Date(),
  }

  // Record health metrics
  await send($.Metrics.record, {
    metric: 'system_health',
    value: overall.status === 'healthy' ? 1 : 0,
    details: overall,
  })

  // Alert on degraded health
  if (overall.status !== 'healthy') {
    await send($.Alert.send, {
      severity: 'high',
      type: 'health-check-failed',
      message: 'System health check failed',
      details: overall,
    })
  }
})

Alerting System

Intelligent alerting with escalation:

// Alert rule definitions
const alertRules = {
  criticalErrors: {
    condition: (metrics) => metrics.errorRate > 0.05,
    severity: 'critical',
    escalation: {
      immediately: ['pagerduty'],
      after5min: ['sms', 'phone'],
    },
  },

  highLatency: {
    condition: (metrics) => metrics.p95Latency > 1000,
    severity: 'high',
    escalation: {
      immediately: ['slack'],
      after15min: ['email', 'pagerduty'],
    },
  },

  degradedPerformance: {
    condition: (metrics) => metrics.p95Latency > 500,
    severity: 'medium',
    escalation: {
      immediately: ['slack'],
    },
  },

  lowDiskSpace: {
    condition: (metrics) => metrics.diskUsage > 0.85,
    severity: 'medium',
    escalation: {
      immediately: ['slack'],
      after1hour: ['email'],
    },
  },
}

// Alert processing engine
on($.Metrics.updated, async (metrics, $) => {
  for (const [name, rule] of Object.entries(alertRules)) {
    if (rule.condition(metrics)) {
      // Check if alert already exists
      const existing = await db.findOne($.Alert, {
        where: { name, status: 'open' },
      })

      if (!existing) {
        // Create new alert
        const alert = await db.create($.Alert, {
          name,
          severity: rule.severity,
          metrics,
          status: 'open',
          createdAt: new Date(),
        })

        // Immediate escalation
        for (const channel of rule.escalation.immediately) {
          await send($.Alert.notify, {
            channel,
            alert,
            severity: rule.severity,
          })
        }

        // Schedule escalations
        for (const [timing, channels] of Object.entries(rule.escalation)) {
          if (timing !== 'immediately') {
            await $.Schedule.create({
              trigger: timing.replace('after', '+'),
              action: async () => {
                // Check if still open
                const current = await db.get($.Alert, alert.id)
                if (current.status === 'open') {
                  for (const channel of channels) {
                    await send($.Alert.notify, {
                      channel,
                      alert: current,
                      escalated: true,
                    })
                  }
                }
              },
            })
          }
        }
      }
    }
  }
})

Incident Management

Structured incident response:

// Incident lifecycle management
on($.Alert.created, async (alert, $) => {
  if (alert.severity === 'critical' || alert.severity === 'high') {
    // Create incident
    const incident = await db.create($.Incident, {
      title: alert.name,
      description: alert.message,
      severity: alert.severity,
      status: 'investigating',
      alertId: alert.id,
      createdAt: new Date(),
    })

    // Create incident channel
    const channel = await $.Slack.createChannel({
      name: `incident-${incident.id}`,
      topic: incident.title,
      members: await getOnCallTeam(),
    })

    // Post to channel
    await $.Slack.postMessage({
      channel: channel.id,
      blocks: [
        {
          type: 'header',
          text: { type: 'plain_text', text: `🚨 Incident #${incident.id}` },
        },
        {
          type: 'section',
          fields: [
            { type: 'mrkdwn', text: `*Severity:* ${incident.severity}` },
            { type: 'mrkdwn', text: `*Status:* ${incident.status}` },
          ],
        },
        {
          type: 'section',
          text: { type: 'mrkdwn', text: incident.description },
        },
      ],
    })

    // Update status page
    if (incident.severity === 'critical') {
      await $.StatusPage.update({
        status: 'major_outage',
        message: incident.title,
        incidentId: incident.id,
      })
    }
  }
})

// Incident resolution
on($.Incident.resolved, async (incident, $) => {
  // Calculate MTTR
  const mttr = incident.resolvedAt - incident.createdAt

  await $.Metrics.record('mttr', {
    duration: mttr,
    incident: incident.id,
    severity: incident.severity,
  })

  // Update status page
  await $.StatusPage.update({
    status: 'operational',
    message: 'All systems operational',
  })

  // Archive incident channel
  await $.Slack.archiveChannel(`incident-${incident.id}`)

  // Schedule postmortem
  if (incident.severity === 'critical' || incident.severity === 'high') {
    await $.Calendar.schedule({
      event: 'Incident Postmortem',
      time: '+24 hours',
      attendees: incident.responders,
      description: `Postmortem for incident #${incident.id}`,
    })

    // Generate postmortem template
    const postmortem = await $.ai.generate('incident-postmortem', {
      incident,
      timeline: await getIncidentTimeline(incident.id),
      impact: await calculateImpact(incident.id),
    })

    await db.update($.Incident, incident.id, {
      postmortem,
    })
  }
})

Automated Remediation

Self-healing systems:

// Auto-remediation rules
const remediationRules = {
  // High CPU usage
  highCPU: {
    condition: (metrics) => metrics.cpu > 80,
    action: async () => {
      await $.Infrastructure.scale({
        instances: '+2',
        reason: 'high-cpu-usage',
      })
    },
  },

  // High memory usage
  highMemory: {
    condition: (metrics) => metrics.memory > 85,
    action: async () => {
      // Clear caches
      await cache.flushOldest(0.2) // Remove oldest 20%

      // Trigger garbage collection
      await $.Runtime.gc()

      if ((await $.Metrics.current('memory')) > 85) {
        // Still high, scale up
        await $.Infrastructure.scale({
          instances: '+2',
          reason: 'high-memory-usage',
        })
      }
    },
  },

  // Database connection pool exhausted
  dbConnectionPool: {
    condition: (metrics) => metrics.dbConnections / metrics.dbPoolSize > 0.9,
    action: async () => {
      await $.Database.pool.resize({
        size: '+10',
        reason: 'high-connection-usage',
      })
    },
  },

  // Cache hit rate low
  lowCacheHitRate: {
    condition: (metrics) => metrics.cacheHitRate < 0.7,
    action: async () => {
      // Warm cache with popular items
      await $.Cache.warm({
        patterns: ['popular-products', 'active-users'],
        size: 1000,
      })
    },
  },

  // High error rate on external service
  externalServiceErrors: {
    condition: (metrics) => metrics.externalServiceErrorRate > 0.5,
    action: async (service) => {
      // Open circuit breaker
      await $.CircuitBreaker.open(service, {
        duration: '5 minutes',
        fallback: 'cached-response',
      })

      // Alert team
      await send($.Alert.send, {
        type: 'circuit-breaker-opened',
        service,
        errorRate: metrics.externalServiceErrorRate,
      })
    },
  },
}

// Apply auto-remediation
on($.Metrics.updated, async (metrics, $) => {
  for (const [name, rule] of Object.entries(remediationRules)) {
    if (rule.condition(metrics)) {
      try {
        await rule.action()

        await db.create($.RemediationAction, {
          rule: name,
          metrics,
          status: 'success',
          timestamp: new Date(),
        })
      } catch (error) {
        await send($.Alert.send, {
          type: 'remediation-failed',
          rule: name,
          error: error.message,
        })
      }
    }
  }
})

Backup and Recovery

Automated backup with point-in-time recovery:

// Backup strategy
const backupStrategy = {
  // Full daily backups
  daily: {
    schedule: '03:00',
    type: 'full',
    retention: '30 days',
    encrypted: true,
    locations: ['s3-primary', 's3-backup-region'],
  },

  // Hourly incremental backups
  hourly: {
    schedule: 'every hour',
    type: 'incremental',
    retention: '7 days',
    encrypted: true,
  },

  // Continuous WAL archiving
  continuous: {
    type: 'wal',
    retention: '7 days',
    encrypted: true,
  },
}

// Execute backups
on($.Schedule.daily('03:00'), async () => {
  const backup = await $.Database.backup({
    type: 'full',
    encrypt: true,
    compress: true,
  })

  // Upload to multiple locations
  await Promise.all([
    $.S3.upload({
      bucket: 'backups-primary',
      key: `full/${backup.timestamp}.tar.gz`,
      file: backup.path,
    }),
    $.S3.upload({
      bucket: 'backups-backup-region',
      key: `full/${backup.timestamp}.tar.gz`,
      file: backup.path,
    }),
  ])

  // Verify backup
  const verified = await $.Database.verifyBackup(backup.id)

  await send($.Monitoring.backup, {
    status: verified ? 'success' : 'failed',
    size: backup.size,
    duration: backup.duration,
    verified,
  })

  // Test restore (monthly)
  if (new Date().getDate() === 1) {
    const testRestore = await $.Database.restoreTest({
      backup: backup.id,
      environment: 'test',
    })

    await send($.Team.ops, {
      type: 'backup-restore-test',
      success: testRestore.success,
      duration: testRestore.duration,
    })
  }
})

// Point-in-time recovery
const recoverToPoint = async (timestamp: Date) => {
  // 1. Find nearest full backup
  const fullBackup = await db.findOne($.Backup, {
    where: {
      type: 'full',
      timestamp: { lte: timestamp },
    },
    orderBy: { timestamp: 'desc' },
  })

  // 2. Restore full backup
  await $.Database.restore(fullBackup.id)

  // 3. Apply WAL logs
  const walLogs = await db.list($.WALLog, {
    where: {
      timestamp: { gte: fullBackup.timestamp, lte: timestamp },
    },
    orderBy: { timestamp: 'asc' },
  })

  for (const log of walLogs) {
    await $.Database.replayWAL(log.id)
  }

  return {
    recovered: true,
    timestamp,
    fullBackup: fullBackup.id,
    walLogsApplied: walLogs.length,
  }
}

Cost Management

Track and optimize operational costs:

// Cost tracking
on($.Metrics.cost.updated, async (costs, $) => {
  await db.create($.CostReport, {
    date: new Date(),
    compute: costs.compute,
    storage: costs.storage,
    bandwidth: costs.bandwidth,
    external: costs.external,
    total: costs.total,
  })

  // Forecast costs
  const forecast = await $.ai.forecast('costs', {
    historical: await getCostHistory('30d'),
    periods: 30,
  })

  // Alert on anomalies
  if (costs.total > forecast.expected * 1.2) {
    await send($.Alert.send, {
      type: 'cost-anomaly',
      current: costs.total,
      expected: forecast.expected,
      variance: ((costs.total - forecast.expected) / forecast.expected) * 100,
    })
  }
})

// Cost optimization recommendations
on($.Schedule.weekly, async () => {
  const recommendations = await $.Cost.analyze({
    dimensions: ['compute', 'storage', 'bandwidth', 'unused'],
  })

  const actionable = recommendations.filter((r) => r.savings > 100 && r.risk === 'low')

  for (const rec of actionable) {
    await db.create($.Task, {
      title: `Cost Optimization: ${rec.title}`,
      description: rec.description,
      estimatedSavings: rec.savings,
      effort: rec.effort,
      priority: rec.savings > 500 ? 'high' : 'medium',
    })
  }
})

Operational Runbooks

Database Maintenance

# Vacuum and analyze
do manage db:vacuum
do manage db:analyze

# Reindex
do manage db:reindex

# Check for bloat
do manage db:bloat-check

# Optimize queries
do manage db:slow-queries --optimize

Performance Tuning

# Clear caches
do manage cache:clear --pattern "user:*"

# Warm caches
do manage cache:warm --popular

# Database query analysis
do manage db:explain --query "SELECT..."

# Connection pool management
do manage db:pool:resize --size 100

Security Operations

# Rotate secrets
do manage secrets:rotate --all

# Update SSL certificates
do manage ssl:renew

# Security scan
do manage security:scan

# Access audit
do manage audit:access --period 7d

Best Practices

Do's

  1. Automate everything - Reduce human error
  2. Monitor proactively - Detect before users report
  3. Document runbooks - Standardize responses
  4. Test recovery - Practice disaster recovery
  5. Track metrics - Measure operational health
  6. Regular maintenance - Prevent issues
  7. Learn from incidents - Continuous improvement

Don'ts

  1. Don't ignore alerts - Every alert should be actionable
  2. Don't skip backups - Data loss is catastrophic
  3. Don't manual operations - Automate repetitive tasks
  4. Don't accumulate debt - Address technical debt
  5. Don't over-alert - Alert fatigue reduces effectiveness

CLI Tools

# Health check
do manage health

# View incidents
do manage incidents --status open

# Trigger backup
do manage backup --type full

# Cost report
do manage costs --period 30d

# Performance analysis
do manage performance --endpoints all

# Generate runbook
do manage runbook --service api

Next Steps


Management Tip: The best incidents are the ones that never happen. Invest in prevention through automation and monitoring.