Observability-Driven Design¶
This document defines ConnectSoft's observability-driven design principles. It is written for architects and engineers building systems where observability is built-in from the start.
Every agent run has traceId, agentId, skillId, tenantId, moduleId attached. All code, docs, blueprints, and tests are stored as knowledge modules with semantic search, metadata, and vector DB retrieval. Observability is not optional—it's how we understand and improve the system.
Tip
Observability enables the Factory to learn and improve. Every action is traced, logged, and measured. This memory becomes our moat: over time, the Factory gets smarter and faster because it reuses prior solutions.
Why Observability First¶
Observability is built-in from the start because:
Easier Debugging¶
- Full context - Correlation IDs link related operations
- Distributed tracing - See request flow across services
- Structured logs - Query and filter logs easily
Performance Tuning¶
- Metrics - Identify bottlenecks and slow operations
- Traces - See where time is spent in request flow
- Profiling - Identify performance issues
Incident Response¶
- Real-time monitoring - Dashboards show system health
- Alerting - Get notified of issues immediately
- Root cause analysis - Traces and logs help find root causes
Continuous Improvement¶
- Knowledge system - Learn from every operation
- Pattern recognition - Identify common issues and solutions
- Agent improvement - Agents learn from observability data
Signals We Care About¶
Logs¶
Structured Logging: - JSON format with consistent schema - Correlation IDs for request tracing - Log levels: Debug, Info, Warning, Error, Critical
What We Log: - Request/response details (sanitized) - Business events (invoice created, payment processed) - Errors and exceptions (with stack traces) - Agent actions (generation runs, decisions)
Example:
{
"timestamp": "2026-01-01T10:30:00Z",
"level": "Information",
"correlationId": "abc123",
"tenantId": "tenant-456",
"service": "InvoiceService",
"message": "Invoice created",
"properties": {
"invoiceId": "inv-789",
"amount": 1000.00
}
}
Metrics¶
Technical Metrics: - Request rate (requests per second) - Latency (p50, p95, p99) - Error rate (errors per second) - Resource usage (CPU, memory, disk)
Business Metrics: - Domain-specific KPIs (invoices created, payments processed) - User activity (sign-ins, feature usage) - Business events (subscriptions, cancellations)
Example Metrics:
- invoices_created_total - Counter of invoices created
- invoice_processing_duration_seconds - Histogram of processing time
- payment_success_rate - Gauge of payment success percentage
Traces¶
Distributed Tracing: - OpenTelemetry spans for all operations - Trace IDs link related operations - Span hierarchy shows request flow
What We Trace: - HTTP requests (incoming and outgoing) - Database queries - Event publishing/subscription - Agent actions (generation runs, decisions)
Example Trace:
Trace: abc123
├─ Span: HTTP POST /invoices (100ms)
│ ├─ Span: CreateInvoiceUseCase (80ms)
│ │ ├─ Span: InvoiceRepository.Save (50ms)
│ │ └─ Span: EventBus.Publish (20ms)
│ └─ Span: AuditLogger.Log (10ms)
Knowledge¶
Knowledge Modules: - Code, docs, blueprints stored as knowledge - Semantic search via vector embeddings - Metadata for filtering and organization
What We Store: - Generated code with metadata - Architecture blueprints and ADRs - Test suites and quality metrics - Agent decisions and outcomes
How Templates Implement Observability¶
OpenTelemetry Wiring¶
Built-in Integration: - OpenTelemetry SDK configured in templates - Automatic instrumentation for HTTP, database, messaging - Custom spans for domain operations
Example:
builder.Services.AddOpenTelemetry()
.WithTracing(builder => builder
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddSource("InvoiceService"))
.WithMetrics(builder => builder
.AddAspNetCoreInstrumentation()
.AddRuntimeInstrumentation());
Logging Conventions¶
Structured Logging:
- ILogger<T> injected into all classes
- Correlation IDs automatically included
- Consistent log format across services
Example:
_logger.LogInformation(
"Invoice created. InvoiceId: {InvoiceId}, Amount: {Amount}",
invoice.Id, invoice.Amount);
Correlation IDs¶
Automatic Propagation: - Correlation ID generated at request entry - Propagated through all layers (API → Application → Domain → Infrastructure) - Included in logs, traces, and events
Example:
// Correlation ID automatically added to context
using var activity = ActivitySource.StartActivity("CreateInvoice");
activity?.SetTag("invoiceId", invoice.Id);
Correlation Across Services and Agents¶
Service Correlation¶
How It Works:
- Correlation ID generated at API gateway or first service
- Propagated via HTTP headers (X-Correlation-Id)
- Included in all logs, traces, and events
Example Flow:
API Gateway (generates correlation ID: abc123)
→ Invoice Service (uses abc123)
→ Payment Service (uses abc123)
→ Notification Service (uses abc123)
Agent Correlation¶
How It Works: - Agent actions linked via correlation IDs - Factory run ID links all agent actions in a generation - Agent ID, skill ID, tenant ID included in all logs
Example:
{
"correlationId": "factory-run-456",
"agentId": "engineering-agent-1",
"skillId": "microservice-generation",
"tenantId": "tenant-789",
"moduleId": "invoice-service",
"action": "code-generated"
}
Cross-System Correlation¶
Factory and Runtime: - Factory generation runs linked to runtime deployments - Agent actions linked to generated code - Knowledge modules linked to generation runs
flowchart TD
FACTORY[Factory Generation Run<br/>correlationId: run-123] -->|Generates| CODE[Generated Code<br/>moduleId: invoice-service]
CODE -->|Deploys| RUNTIME[Runtime Service<br/>correlationId: run-123]
RUNTIME -->|Emits Events| EVENTS[Event Bus<br/>correlationId: run-123]
FACTORY -->|Stores| KNOWLEDGE[Knowledge System<br/>runId: run-123]
CODE -->|Stored As| KNOWLEDGE
RUNTIME -->|Logs| LOGS[Log Storage<br/>correlationId: run-123]
FACTORY -->|Logs| LOGS
RUNTIME -->|Traces| TRACES[Trace Storage<br/>traceId: run-123]
FACTORY -->|Traces| TRACES
style FACTORY fill:#2563EB,color:#fff
style RUNTIME fill:#4F46E5,color:#fff
style KNOWLEDGE fill:#10B981,color:#fff
Dashboards and Alerting¶
Types of Dashboards¶
Platform Dashboards: - Service health (availability, latency, error rate) - Business metrics (invoices created, payments processed) - Resource usage (CPU, memory, database connections)
Factory Dashboard: - Generation runs (success rate, duration) - Agent performance (tasks completed, errors) - Knowledge system (patterns stored, reuse rate)
Per-Service Dashboards: - Service-specific metrics - Request flow and traces - Error patterns and trends
Alerting Principles¶
Avoid Alert Fatigue: - Only alert on actionable issues - Use SLOs (Service Level Objectives) for alerting - Clear runbooks for each alert
Alert Types: - Critical - Service down, data loss risk (immediate response) - Warning - Degraded performance, high error rate (investigate) - Info - Notable events, threshold breaches (monitor)
SLO-Based Alerting: - Alert when SLO is at risk (e.g., error rate > 1%) - Don't alert on every error - Use burn rate for SLO alerts
Example KPIs¶
Identity Platform: - Sign-in success rate (> 99.9%) - Token generation latency (p95 < 100ms) - Active users per day
Audit Platform: - Event ingestion rate (events per second) - Query latency (p95 < 500ms) - Storage growth rate
Factory: - Generation success rate (> 95%) - Average generation time (< 30 minutes) - Knowledge reuse rate (> 50%)
Related Documents¶
- Cloud-Native Mindset - How observability fits cloud-native
- Event-Driven Mindset - How events enable observability
- Knowledge & Memory System - Knowledge storage and retrieval
- Microservice Template - How templates implement observability