Monitoring and Observability Guide
Complete guide to implementing monitoring, logging, tracing, alerting, and dashboards for Services-as-Software
Learn how to build comprehensive monitoring and observability systems for your Services-as-Software, ensuring visibility into service health, performance, and customer experience.
Why Monitoring and Observability Matter
Effective monitoring and observability are critical for Services-as-Software:
- Proactive Issue Detection - Catch problems before customers experience them
- Performance Optimization - Identify bottlenecks and optimization opportunities
- SLA Compliance - Track and enforce service level agreements
- Customer Experience - Understand and improve customer satisfaction
- Business Intelligence - Drive product and business decisions with data
- Incident Response - Quickly diagnose and resolve issues
- Cost Management - Optimize resource usage and costs
Service Metrics
Basic Metrics Collection
Collect fundamental service metrics:
import { db, on } from 'sdk.do'
class MetricsCollector {
async recordExecution(execution: any) {
// Record execution metrics
await db.create($.Metric, {
serviceId: execution.serviceId,
customerId: execution.customerId,
timestamp: new Date(),
type: 'execution',
metrics: {
duration: execution.duration,
status: execution.status,
inputSize: execution.inputSize,
outputSize: execution.outputSize,
tokensUsed: execution.tokensUsed,
cost: execution.cost,
},
tags: {
environment: process.env.ENVIRONMENT,
version: execution.serviceVersion,
region: execution.region,
},
})
// Update aggregates
await this.updateAggregates(execution)
}
private async updateAggregates(execution: any) {
const serviceKey = `service:${execution.serviceId}:metrics`
const now = new Date()
// Update hourly aggregates
await db.increment($.MetricAggregate, {
where: {
serviceId: execution.serviceId,
period: 'hour',
timestamp: new Date(now.getFullYear(), now.getMonth(), now.getDate(), now.getHours()),
},
data: {
totalExecutions: 1,
successfulExecutions: execution.status === 'completed' ? 1 : 0,
failedExecutions: execution.status === 'failed' ? 1 : 0,
totalDuration: execution.duration,
totalCost: execution.cost,
},
})
// Update daily aggregates
await db.increment($.MetricAggregate, {
where: {
serviceId: execution.serviceId,
period: 'day',
timestamp: new Date(now.getFullYear(), now.getMonth(), now.getDate()),
},
data: {
totalExecutions: 1,
successfulExecutions: execution.status === 'completed' ? 1 : 0,
failedExecutions: execution.status === 'failed' ? 1 : 0,
totalDuration: execution.duration,
totalCost: execution.cost,
},
})
}
}
// Collect metrics on every execution
on($.ServiceExecution.complete, async (execution) => {
const collector = new MetricsCollector()
await collector.recordExecution(execution)
})Advanced Metrics
Track detailed performance metrics:
class AdvancedMetrics {
async recordDetailedMetrics(execution: any) {
// Performance metrics
const performanceMetrics = {
// Timing breakdown
timing: {
queueTime: execution.startTime - execution.createdTime,
executionTime: execution.endTime - execution.startTime,
totalTime: execution.endTime - execution.createdTime,
},
// Resource usage
resources: {
cpuTime: execution.cpuTime,
memoryPeak: execution.memoryPeak,
memoryAverage: execution.memoryAverage,
networkIn: execution.networkIn,
networkOut: execution.networkOut,
},
// AI/ML metrics
ai: {
model: execution.model,
tokensInput: execution.tokensInput,
tokensOutput: execution.tokensOutput,
tokensTotal: execution.tokensTotal,
temperature: execution.temperature,
quality: execution.qualityScore,
},
// Business metrics
business: {
revenue: execution.revenue,
cost: execution.cost,
margin: execution.revenue - execution.cost,
marginPercent: ((execution.revenue - execution.cost) / execution.revenue) * 100,
},
// Quality metrics
quality: {
accuracy: execution.accuracy,
completeness: execution.completeness,
relevance: execution.relevance,
customerSatisfaction: execution.customerSatisfaction,
},
}
await db.create($.DetailedMetric, {
serviceId: execution.serviceId,
executionId: execution.id,
timestamp: new Date(),
metrics: performanceMetrics,
})
return performanceMetrics
}
async calculateDerivedMetrics(serviceId: string, period: { start: Date; end: Date }) {
// Get raw metrics
const executions = await db.list($.ServiceExecution, {
where: {
serviceId,
timestamp: { gte: period.start, lte: period.end },
},
})
if (executions.length === 0) {
return null
}
// Calculate derived metrics
const successful = executions.filter((e) => e.status === 'completed')
const failed = executions.filter((e) => e.status === 'failed')
const durations = successful.map((e) => e.duration).sort((a, b) => a - b)
return {
// Availability
availability: successful.length / executions.length,
uptime: (successful.length / executions.length) * 100,
// Performance percentiles
p50: durations[Math.floor(durations.length * 0.5)],
p75: durations[Math.floor(durations.length * 0.75)],
p90: durations[Math.floor(durations.length * 0.9)],
p95: durations[Math.floor(durations.length * 0.95)],
p99: durations[Math.floor(durations.length * 0.99)],
// Throughput
throughput: executions.length / ((period.end.getTime() - period.start.getTime()) / 1000),
requestsPerSecond: executions.length / ((period.end.getTime() - period.start.getTime()) / 1000),
// Error rates
errorRate: failed.length / executions.length,
errorCount: failed.length,
// Quality
averageQuality: successful.reduce((sum, e) => sum + (e.qualityScore || 0), 0) / successful.length,
// Business
totalRevenue: executions.reduce((sum, e) => sum + (e.revenue || 0), 0),
totalCost: executions.reduce((sum, e) => sum + (e.cost || 0), 0),
averageMargin: successful.reduce((sum, e) => sum + ((e.revenue - e.cost) / e.revenue) * 100, 0) / successful.length,
}
}
}Logging Strategies
Structured Logging
Implement structured logging:
class ServiceLogger {
private context: Record<string, any>
constructor(context: Record<string, any>) {
this.context = context
}
async log(level: string, message: string, data?: Record<string, any>) {
const logEntry = {
timestamp: new Date().toISOString(),
level,
message,
context: this.context,
data: data || {},
environment: process.env.ENVIRONMENT,
version: process.env.VERSION,
}
// Write to database
await db.create($.LogEntry, logEntry)
// Also log to console in development
if (process.env.ENVIRONMENT === 'development') {
console.log(JSON.stringify(logEntry, null, 2))
}
// Send critical logs to alerting system
if (level === 'error' || level === 'critical') {
await this.sendAlert(logEntry)
}
}
async info(message: string, data?: Record<string, any>) {
await this.log('info', message, data)
}
async warn(message: string, data?: Record<string, any>) {
await this.log('warn', message, data)
}
async error(message: string, error?: Error, data?: Record<string, any>) {
await this.log('error', message, {
...data,
error: error
? {
message: error.message,
stack: error.stack,
name: error.name,
}
: undefined,
})
}
async critical(message: string, error?: Error, data?: Record<string, any>) {
await this.log('critical', message, {
...data,
error: error
? {
message: error.message,
stack: error.stack,
name: error.name,
}
: undefined,
})
}
private async sendAlert(logEntry: any) {
await send($.Alert.create, {
type: 'log-alert',
severity: logEntry.level,
message: logEntry.message,
context: logEntry.context,
data: logEntry.data,
})
}
}
// Usage in service execution
on($.ServiceRequest.created, async (request) => {
const logger = new ServiceLogger({
serviceId: request.serviceId,
requestId: request.id,
customerId: request.customerId,
})
logger.info('Service request received', {
inputs: request.inputs,
})
try {
logger.info('Starting service execution')
const result = await executeService(request)
logger.info('Service execution completed', {
duration: result.duration,
outputs: result.outputs,
})
} catch (error) {
logger.error('Service execution failed', error, {
inputs: request.inputs,
})
}
})Log Aggregation
Aggregate and search logs:
class LogAggregator {
async searchLogs(query: { serviceId?: string; customerId?: string; level?: string; startTime?: Date; endTime?: Date; message?: string; limit?: number }) {
const where: any = {}
if (query.serviceId) where['context.serviceId'] = query.serviceId
if (query.customerId) where['context.customerId'] = query.customerId
if (query.level) where.level = query.level
if (query.startTime || query.endTime) {
where.timestamp = {}
if (query.startTime) where.timestamp.gte = query.startTime
if (query.endTime) where.timestamp.lte = query.endTime
}
if (query.message) where.message = { contains: query.message }
return await db.list($.LogEntry, {
where,
orderBy: { timestamp: 'desc' },
take: query.limit || 100,
})
}
async getErrorLogs(serviceId: string, hours: number = 24) {
const startTime = new Date(Date.now() - hours * 60 * 60 * 1000)
return await this.searchLogs({
serviceId,
level: 'error',
startTime,
})
}
async analyzeLogPatterns(serviceId: string, period: { start: Date; end: Date }) {
const logs = await this.searchLogs({
serviceId,
startTime: period.start,
endTime: period.end,
limit: 10000,
})
// Analyze patterns
const patterns = {
errorsByType: this.groupBy(
logs.filter((l) => l.level === 'error'),
'data.error.name'
),
errorsByMessage: this.groupBy(
logs.filter((l) => l.level === 'error'),
'message'
),
logsByLevel: this.groupBy(logs, 'level'),
logsByHour: this.groupByHour(logs),
}
return patterns
}
private groupBy(items: any[], key: string) {
return items.reduce((acc, item) => {
const value = this.getNestedValue(item, key) || 'unknown'
acc[value] = (acc[value] || 0) + 1
return acc
}, {})
}
private groupByHour(items: any[]) {
return items.reduce((acc, item) => {
const hour = new Date(item.timestamp).getHours()
acc[hour] = (acc[hour] || 0) + 1
return acc
}, {})
}
private getNestedValue(obj: any, path: string) {
return path.split('.').reduce((current, key) => current?.[key], obj)
}
}Tracing Implementation
Distributed Tracing
Implement distributed tracing:
import { v4 as uuidv4 } from 'uuid'
class TracingSystem {
async startTrace(name: string, metadata?: Record<string, any>) {
const trace = {
id: uuidv4(),
name,
startTime: Date.now(),
metadata: metadata || {},
spans: [],
}
// Store trace
await db.create($.Trace, trace)
return trace
}
async startSpan(traceId: string, name: string, parentSpanId?: string, metadata?: Record<string, any>) {
const span = {
id: uuidv4(),
traceId,
parentSpanId,
name,
startTime: Date.now(),
metadata: metadata || {},
}
// Store span
await db.create($.Span, span)
return span
}
async endSpan(spanId: string, result?: any, error?: Error) {
const span = await db.findOne($.Span, {
where: { id: spanId },
})
const endTime = Date.now()
await db.update($.Span, {
where: { id: spanId },
data: {
endTime,
duration: endTime - span.startTime,
result,
error: error
? {
message: error.message,
stack: error.stack,
}
: undefined,
status: error ? 'error' : 'success',
},
})
}
async endTrace(traceId: string) {
const trace = await db.findOne($.Trace, {
where: { id: traceId },
include: { spans: true },
})
const endTime = Date.now()
await db.update($.Trace, {
where: { id: traceId },
data: {
endTime,
duration: endTime - trace.startTime,
totalSpans: trace.spans.length,
},
})
}
async getTrace(traceId: string) {
return await db.findOne($.Trace, {
where: { id: traceId },
include: {
spans: {
orderBy: { startTime: 'asc' },
},
},
})
}
async analyzeTrace(traceId: string) {
const trace = await this.getTrace(traceId)
if (!trace) return null
// Build span tree
const spanTree = this.buildSpanTree(trace.spans)
// Calculate critical path
const criticalPath = this.findCriticalPath(spanTree)
// Identify slow spans
const slowSpans = trace.spans.filter((s) => s.duration > 1000).sort((a, b) => b.duration - a.duration)
return {
trace,
spanTree,
criticalPath,
slowSpans,
totalDuration: trace.duration,
spanCount: trace.spans.length,
}
}
private buildSpanTree(spans: any[]) {
const spanMap = new Map(spans.map((s) => [s.id, { ...s, children: [] }]))
const roots: any[] = []
for (const span of spanMap.values()) {
if (span.parentSpanId) {
const parent = spanMap.get(span.parentSpanId)
if (parent) {
parent.children.push(span)
}
} else {
roots.push(span)
}
}
return roots
}
private findCriticalPath(spanTree: any[]): any[] {
// Find longest path through span tree
let longestPath: any[] = []
let longestDuration = 0
const traverse = (node: any, path: any[], duration: number) => {
const currentPath = [...path, node]
const currentDuration = duration + node.duration
if (node.children.length === 0) {
if (currentDuration > longestDuration) {
longestDuration = currentDuration
longestPath = currentPath
}
} else {
for (const child of node.children) {
traverse(child, currentPath, currentDuration)
}
}
}
for (const root of spanTree) {
traverse(root, [], 0)
}
return longestPath
}
}
// Use tracing in service execution
on($.ServiceRequest.created, async (request) => {
const tracing = new TracingSystem()
// Start trace
const trace = await tracing.startTrace('service-execution', {
serviceId: request.serviceId,
requestId: request.id,
})
try {
// Research span
const researchSpan = await tracing.startSpan(trace.id, 'research')
const research = await performResearch(request)
await tracing.endSpan(researchSpan.id, research)
// Outline span
const outlineSpan = await tracing.startSpan(trace.id, 'outline', researchSpan.id)
const outline = await createOutline(research)
await tracing.endSpan(outlineSpan.id, outline)
// Content span with nested spans
const contentSpan = await tracing.startSpan(trace.id, 'content', outlineSpan.id)
// AI generation nested span
const aiSpan = await tracing.startSpan(trace.id, 'ai-generation', contentSpan.id)
const content = await generateContent(outline)
await tracing.endSpan(aiSpan.id, content)
// Quality check nested span
const qualitySpan = await tracing.startSpan(trace.id, 'quality-check', contentSpan.id)
const quality = await checkQuality(content)
await tracing.endSpan(qualitySpan.id, quality)
await tracing.endSpan(contentSpan.id, content)
// End trace
await tracing.endTrace(trace.id)
} catch (error) {
// Record error in trace
await tracing.endTrace(trace.id)
}
})Alerting Setup
Alert Configuration
Configure comprehensive alerting:
class AlertingSystem {
async configureAlerts(serviceId: string) {
// Performance alerts
await this.createAlert({
serviceId,
name: 'High Response Time',
condition: {
metric: 'p95_duration',
operator: 'greater_than',
threshold: 10000, // 10 seconds
window: '5m',
},
severity: 'warning',
actions: ['email', 'slack'],
})
await this.createAlert({
serviceId,
name: 'Very High Response Time',
condition: {
metric: 'p99_duration',
operator: 'greater_than',
threshold: 30000, // 30 seconds
window: '5m',
},
severity: 'critical',
actions: ['email', 'slack', 'pagerduty'],
})
// Error rate alerts
await this.createAlert({
serviceId,
name: 'Elevated Error Rate',
condition: {
metric: 'error_rate',
operator: 'greater_than',
threshold: 0.05, // 5%
window: '5m',
},
severity: 'warning',
actions: ['email', 'slack'],
})
await this.createAlert({
serviceId,
name: 'High Error Rate',
condition: {
metric: 'error_rate',
operator: 'greater_than',
threshold: 0.1, // 10%
window: '5m',
},
severity: 'critical',
actions: ['email', 'slack', 'pagerduty'],
})
// Availability alerts
await this.createAlert({
serviceId,
name: 'Low Availability',
condition: {
metric: 'availability',
operator: 'less_than',
threshold: 0.99, // 99%
window: '1h',
},
severity: 'critical',
actions: ['email', 'slack', 'pagerduty'],
})
// Business metric alerts
await this.createAlert({
serviceId,
name: 'Revenue Drop',
condition: {
metric: 'revenue',
operator: 'decrease_by',
threshold: 0.2, // 20% decrease
window: '1h',
comparison: 'previous_period',
},
severity: 'warning',
actions: ['email'],
})
// Quality alerts
await this.createAlert({
serviceId,
name: 'Quality Degradation',
condition: {
metric: 'quality_score',
operator: 'less_than',
threshold: 0.9, // 90%
window: '1h',
},
severity: 'warning',
actions: ['email', 'slack'],
})
}
private async createAlert(config: any) {
return await db.create($.Alert, config)
}
async evaluateAlerts(serviceId: string) {
const alerts = await db.list($.Alert, {
where: { serviceId, enabled: true },
})
for (const alert of alerts) {
const triggered = await this.evaluateCondition(alert.condition, serviceId)
if (triggered) {
await this.triggerAlert(alert)
}
}
}
private async evaluateCondition(condition: any, serviceId: string): Promise<boolean> {
// Get metric value
const value = await this.getMetricValue(serviceId, condition.metric, condition.window)
// Evaluate condition
switch (condition.operator) {
case 'greater_than':
return value > condition.threshold
case 'less_than':
return value < condition.threshold
case 'equals':
return value === condition.threshold
case 'decrease_by':
const previousValue = await this.getPreviousMetricValue(serviceId, condition.metric, condition.window)
const decrease = (previousValue - value) / previousValue
return decrease > condition.threshold
default:
return false
}
}
private async getMetricValue(serviceId: string, metric: string, window: string): Promise<number> {
const windowMs = this.parseWindow(window)
const startTime = new Date(Date.now() - windowMs)
const aggregate = await db.findOne($.MetricAggregate, {
where: {
serviceId,
timestamp: { gte: startTime },
},
})
return aggregate?.metrics[metric] || 0
}
private async getPreviousMetricValue(serviceId: string, metric: string, window: string): Promise<number> {
const windowMs = this.parseWindow(window)
const endTime = new Date(Date.now() - windowMs)
const startTime = new Date(endTime.getTime() - windowMs)
const aggregate = await db.findOne($.MetricAggregate, {
where: {
serviceId,
timestamp: { gte: startTime, lte: endTime },
},
})
return aggregate?.metrics[metric] || 0
}
private parseWindow(window: string): number {
const match = window.match(/^(\d+)([smhd])$/)
if (!match) return 300000 // default 5 minutes
const value = parseInt(match[1])
const unit = match[2]
const multipliers: Record<string, number> = {
s: 1000,
m: 60000,
h: 3600000,
d: 86400000,
}
return value * multipliers[unit]
}
private async triggerAlert(alert: any) {
// Create alert instance
const instance = await db.create($.AlertInstance, {
alertId: alert.id,
serviceId: alert.serviceId,
timestamp: new Date(),
severity: alert.severity,
message: this.formatAlertMessage(alert),
})
// Execute alert actions
for (const action of alert.actions) {
await this.executeAlertAction(action, instance)
}
}
private formatAlertMessage(alert: any): string {
return `${alert.name}: ${alert.condition.metric} ${alert.condition.operator} ${alert.condition.threshold}`
}
private async executeAlertAction(action: string, instance: any) {
switch (action) {
case 'email':
await send($.Email.send, {
to: '[email protected]',
subject: `Alert: ${instance.message}`,
template: 'alert',
data: instance,
})
break
case 'slack':
await send($.Slack.sendMessage, {
channel: '#alerts',
text: instance.message,
data: instance,
})
break
case 'pagerduty':
await send($.PagerDuty.createIncident, {
title: instance.message,
severity: instance.severity,
data: instance,
})
break
}
}
}
// Run alert evaluation periodically
on($.Schedule.everyMinute, async () => {
const alerting = new AlertingSystem()
const services = await db.list($.Service, {
where: { status: 'active' },
})
for (const service of services) {
await alerting.evaluateAlerts(service.id)
}
})Performance Monitoring
Real-Time Performance Tracking
Track performance in real-time:
class PerformanceMonitor {
async trackPerformance(execution: any) {
// Calculate performance score
const score = await this.calculatePerformanceScore(execution)
// Record performance
await db.create($.PerformanceMetric, {
serviceId: execution.serviceId,
executionId: execution.id,
timestamp: new Date(),
score,
metrics: {
duration: execution.duration,
cpuTime: execution.cpuTime,
memoryPeak: execution.memoryPeak,
latency: execution.latency,
},
})
// Check for performance issues
if (score < 0.7) {
await this.investigatePerformance(execution)
}
return score
}
private async calculatePerformanceScore(execution: any): Promise<number> {
// Get service SLA targets
const service = await db.findOne($.Service, {
where: { id: execution.serviceId },
})
const targetDuration = service.sla.responseTime
const actualDuration = execution.duration
// Score based on how close to target
const durationScore = Math.max(0, 1 - actualDuration / (targetDuration * 2))
// Factor in other metrics
const cpuScore = this.scoreResource(execution.cpuTime, 1000) // Target 1 second CPU
const memoryScore = this.scoreResource(execution.memoryPeak, 512 * 1024 * 1024) // Target 512MB
// Weighted average
return durationScore * 0.5 + cpuScore * 0.25 + memoryScore * 0.25
}
private scoreResource(actual: number, target: number): number {
return Math.max(0, 1 - actual / (target * 2))
}
private async investigatePerformance(execution: any) {
// Get trace for execution
const trace = await db.findOne($.Trace, {
where: { 'metadata.executionId': execution.id },
include: { spans: true },
})
if (!trace) return
// Analyze trace
const tracing = new TracingSystem()
const analysis = await tracing.analyzeTrace(trace.id)
// Identify bottlenecks
const bottlenecks = analysis.slowSpans.map((span) => ({
name: span.name,
duration: span.duration,
percentage: (span.duration / trace.duration) * 100,
}))
// Create performance report
await db.create($.PerformanceReport, {
serviceId: execution.serviceId,
executionId: execution.id,
timestamp: new Date(),
bottlenecks,
recommendations: this.generateRecommendations(bottlenecks),
})
}
private generateRecommendations(bottlenecks: any[]): string[] {
const recommendations: string[] = []
for (const bottleneck of bottlenecks) {
if (bottleneck.name.includes('ai') && bottleneck.duration > 5000) {
recommendations.push('Consider using a faster AI model or caching responses')
}
if (bottleneck.name.includes('database') && bottleneck.duration > 1000) {
recommendations.push('Optimize database queries or add indexes')
}
if (bottleneck.name.includes('network') && bottleneck.duration > 2000) {
recommendations.push('Consider caching external API responses')
}
}
return recommendations
}
}Error Tracking
Comprehensive Error Tracking
Track and categorize errors:
class ErrorTracker {
async trackError(error: Error, context: any) {
// Categorize error
const category = this.categorizeError(error)
// Create error record
const errorRecord = await db.create($.ErrorRecord, {
serviceId: context.serviceId,
executionId: context.executionId,
timestamp: new Date(),
category,
message: error.message,
stack: error.stack,
context,
fingerprint: this.generateFingerprint(error),
})
// Check for error spike
await this.checkErrorSpike(context.serviceId)
// Group similar errors
await this.groupSimilarErrors(errorRecord)
return errorRecord
}
private categorizeError(error: Error): string {
if (error.message.includes('timeout')) return 'timeout'
if (error.message.includes('rate limit')) return 'rate-limit'
if (error.message.includes('API')) return 'api-error'
if (error.message.includes('validation')) return 'validation'
if (error.message.includes('auth')) return 'authentication'
return 'unknown'
}
private generateFingerprint(error: Error): string {
// Generate consistent fingerprint for grouping
const signature = `${error.name}:${error.message}:${error.stack?.split('\n')[1]}`
return this.hash(signature)
}
private hash(str: string): string {
// Simple hash function
let hash = 0
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i)
hash = (hash << 5) - hash + char
hash = hash & hash
}
return hash.toString(36)
}
private async checkErrorSpike(serviceId: string) {
const lastHour = new Date(Date.now() - 60 * 60 * 1000)
const recentErrors = await db.count($.ErrorRecord, {
where: {
serviceId,
timestamp: { gte: lastHour },
},
})
const previousHour = new Date(Date.now() - 2 * 60 * 60 * 1000)
const previousErrors = await db.count($.ErrorRecord, {
where: {
serviceId,
timestamp: { gte: previousHour, lt: lastHour },
},
})
// Alert if errors doubled
if (recentErrors > previousErrors * 2 && recentErrors > 10) {
await send($.Alert.create, {
type: 'error-spike',
serviceId,
message: `Error rate increased ${((recentErrors / previousErrors - 1) * 100).toFixed(0)}%`,
data: {
recentErrors,
previousErrors,
},
})
}
}
private async groupSimilarErrors(errorRecord: any) {
// Find existing group
let group = await db.findOne($.ErrorGroup, {
where: {
fingerprint: errorRecord.fingerprint,
resolved: false,
},
})
if (!group) {
// Create new group
group = await db.create($.ErrorGroup, {
fingerprint: errorRecord.fingerprint,
firstSeen: errorRecord.timestamp,
lastSeen: errorRecord.timestamp,
count: 1,
category: errorRecord.category,
message: errorRecord.message,
resolved: false,
})
} else {
// Update existing group
await db.update($.ErrorGroup, {
where: { id: group.id },
data: {
lastSeen: errorRecord.timestamp,
count: { increment: 1 },
},
})
}
// Link error to group
await db.update($.ErrorRecord, {
where: { id: errorRecord.id },
data: { groupId: group.id },
})
}
}SLA Monitoring
SLA Compliance Tracking
Track SLA compliance:
class SLAMonitor {
async checkSLACompliance(serviceId: string, period: { start: Date; end: Date }) {
// Get service SLAs
const service = await db.findOne($.Service, {
where: { id: serviceId },
})
if (!service.sla) return null
// Get executions
const executions = await db.list($.ServiceExecution, {
where: {
serviceId,
timestamp: { gte: period.start, lte: period.end },
},
})
// Calculate SLA metrics
const metrics = await this.calculateSLAMetrics(executions, service.sla)
// Check compliance
const compliance = {
responseTime: metrics.p95Duration <= service.sla.responseTime,
availability: metrics.availability >= service.sla.availability / 100,
accuracy: metrics.accuracy >= service.sla.accuracy,
overall:
metrics.p95Duration <= service.sla.responseTime && metrics.availability >= service.sla.availability / 100 && metrics.accuracy >= service.sla.accuracy,
}
// Record SLA report
const report = await db.create($.SLAReport, {
serviceId,
period,
sla: service.sla,
metrics,
compliance,
timestamp: new Date(),
})
// Alert on SLA violations
if (!compliance.overall) {
await this.handleSLAViolation(service, report)
}
return report
}
private async calculateSLAMetrics(executions: any[], sla: any) {
const successful = executions.filter((e) => e.status === 'completed')
const durations = successful.map((e) => e.duration).sort((a, b) => a - b)
return {
totalExecutions: executions.length,
successfulExecutions: successful.length,
failedExecutions: executions.length - successful.length,
availability: successful.length / executions.length,
p50Duration: durations[Math.floor(durations.length * 0.5)],
p95Duration: durations[Math.floor(durations.length * 0.95)],
p99Duration: durations[Math.floor(durations.length * 0.99)],
accuracy: successful.reduce((sum, e) => sum + (e.qualityScore || 1), 0) / successful.length,
}
}
private async handleSLAViolation(service: any, report: any) {
await send($.Alert.create, {
type: 'sla-violation',
serviceId: service.id,
severity: 'critical',
message: `SLA violation detected for ${service.name}`,
data: {
sla: service.sla,
metrics: report.metrics,
compliance: report.compliance,
},
})
// Notify customers if in SLA agreement
const customers = await db.list($.Customer, {
where: {
subscriptions: {
some: {
serviceId: service.id,
slaAgreement: true,
},
},
},
})
for (const customer of customers) {
await send($.Email.send, {
to: customer.email,
subject: 'Service Level Agreement Notification',
template: 'sla-violation',
data: {
service,
report,
},
})
}
}
}
// Check SLA compliance daily
on($.Schedule.daily, async () => {
const monitor = new SLAMonitor()
const services = await db.list($.Service, {
where: { status: 'active', sla: { not: null } },
})
const yesterday = {
start: new Date(Date.now() - 24 * 60 * 60 * 1000),
end: new Date(),
}
for (const service of services) {
await monitor.checkSLACompliance(service.id, yesterday)
}
})Dashboard Creation
Comprehensive Service Dashboard
Create real-time dashboards:
class DashboardBuilder {
async buildServiceDashboard(serviceId: string) {
const now = new Date()
const last24h = new Date(now.getTime() - 24 * 60 * 60 * 1000)
// Get metrics
const metrics = new AdvancedMetrics()
const derived = await metrics.calculateDerivedMetrics(serviceId, {
start: last24h,
end: now,
})
// Get errors
const errorTracker = new ErrorTracker()
const errors = await db.list($.ErrorGroup, {
where: {
serviceId,
resolved: false,
},
orderBy: { count: 'desc' },
take: 10,
})
// Get SLA status
const slaMonitor = new SLAMonitor()
const slaReport = await slaMonitor.checkSLACompliance(serviceId, {
start: last24h,
end: now,
})
// Build dashboard
return {
overview: {
status: this.getServiceStatus(derived, slaReport),
uptime: derived.uptime,
requestsToday: derived.totalRequests,
errorRate: derived.errorRate,
},
performance: {
p50: derived.p50,
p95: derived.p95,
p99: derived.p99,
throughput: derived.throughput,
chart: await this.getPerformanceChart(serviceId, last24h, now),
},
reliability: {
availability: derived.availability,
errorCount: derived.errorCount,
topErrors: errors,
},
sla: slaReport
? {
compliant: slaReport.compliance.overall,
responseTime: {
target: slaReport.sla.responseTime,
actual: slaReport.metrics.p95Duration,
compliant: slaReport.compliance.responseTime,
},
availability: {
target: slaReport.sla.availability,
actual: slaReport.metrics.availability * 100,
compliant: slaReport.compliance.availability,
},
accuracy: {
target: slaReport.sla.accuracy,
actual: slaReport.metrics.accuracy,
compliant: slaReport.compliance.accuracy,
},
}
: null,
business: {
revenue: derived.totalRevenue,
cost: derived.totalCost,
margin: derived.averageMargin,
customers: await this.getActiveCustomers(serviceId, last24h),
},
recent: {
errors: await this.getRecentErrors(serviceId, 10),
slowRequests: await this.getSlowRequests(serviceId, 10),
},
}
}
private getServiceStatus(metrics: any, slaReport: any): string {
if (slaReport && !slaReport.compliance.overall) return 'degraded'
if (metrics.errorRate > 0.05) return 'warning'
if (metrics.availability < 0.99) return 'warning'
return 'healthy'
}
private async getPerformanceChart(serviceId: string, start: Date, end: Date) {
// Get hourly aggregates
const aggregates = await db.list($.MetricAggregate, {
where: {
serviceId,
period: 'hour',
timestamp: { gte: start, lte: end },
},
orderBy: { timestamp: 'asc' },
})
return aggregates.map((a) => ({
timestamp: a.timestamp,
avgDuration: a.metrics.totalDuration / a.metrics.totalExecutions,
requests: a.metrics.totalExecutions,
errors: a.metrics.failedExecutions,
}))
}
private async getActiveCustomers(serviceId: string, since: Date) {
return await db.count($.ServiceExecution, {
where: {
serviceId,
timestamp: { gte: since },
},
distinct: ['customerId'],
})
}
private async getRecentErrors(serviceId: string, limit: number) {
return await db.list($.ErrorRecord, {
where: { serviceId },
orderBy: { timestamp: 'desc' },
take: limit,
})
}
private async getSlowRequests(serviceId: string, limit: number) {
return await db.list($.ServiceExecution, {
where: { serviceId },
orderBy: { duration: 'desc' },
take: limit,
})
}
}
// Create dashboard API endpoint
on($.API.request, async (req) => {
if (req.path === '/api/dashboard/:serviceId') {
const builder = new DashboardBuilder()
const dashboard = await builder.buildServiceDashboard(req.params.serviceId)
return { status: 200, body: dashboard }
}
})Best Practices
1. Instrument Everything
Add observability to all service operations:
// ✅ Good: Comprehensive instrumentation
const logger = new ServiceLogger(context)
const metrics = new MetricsCollector()
const tracing = new TracingSystem()
logger.info('Starting operation')
const trace = await tracing.startTrace('operation')
const result = await performOperation()
await metrics.recordExecution(result)
await tracing.endTrace(trace.id)2. Use Structured Logs
Always use structured logging:
// ✅ Good: Structured
logger.info('Request completed', {
duration: 1234,
status: 'success',
userId: 'user-123',
})
// ❌ Bad: Unstructured
console.log('Request completed in 1234ms for user-123')3. Set Appropriate Alerts
Configure meaningful alerts:
// ✅ Good: Actionable alerts
// Alert on sustained issues, not transient blips
// Use appropriate thresholds based on SLAs
// ❌ Bad: Noisy alerts
// Alert on every single error
// Use arbitrary thresholds4. Monitor Business Metrics
Track business impact:
// Monitor technical AND business metrics
// Track revenue, costs, customer satisfaction
// Correlate technical issues with business impact5. Regular Review
Review monitoring data regularly:
// Weekly reviews of dashboards
// Monthly SLA reports
// Quarterly optimization reviewsNext Steps
- Deploy Services → - Go to production
- Learn Best Practices → - Service optimization
- Explore Examples → - Real-world implementations
- Review Core Concepts → - Deepen understanding