Observability-Driven Design¶

This document defines ConnectSoft's observability-driven design principles. It is written for architects and engineers building systems where observability is built-in from the start.

Every agent run has traceId, agentId, skillId, tenantId, moduleId attached. All code, docs, blueprints, and tests are stored as knowledge modules with semantic search, metadata, and vector DB retrieval. Observability is not optional—it's how we understand and improve the system.

Tip

Observability enables the Factory to learn and improve. Every action is traced, logged, and measured. This memory becomes our moat: over time, the Factory gets smarter and faster because it reuses prior solutions.

Why Observability First¶

Observability is built-in from the start because:

Easier Debugging¶

Full context - Correlation IDs link related operations
Distributed tracing - See request flow across services
Structured logs - Query and filter logs easily

Performance Tuning¶

Metrics - Identify bottlenecks and slow operations
Traces - See where time is spent in request flow
Profiling - Identify performance issues

Incident Response¶

Real-time monitoring - Dashboards show system health
Alerting - Get notified of issues immediately
Root cause analysis - Traces and logs help find root causes

Continuous Improvement¶

Knowledge system - Learn from every operation
Pattern recognition - Identify common issues and solutions
Agent improvement - Agents learn from observability data

Signals We Care About¶

Logs¶

Structured Logging: - JSON format with consistent schema - Correlation IDs for request tracing - Log levels: Debug, Info, Warning, Error, Critical

What We Log: - Request/response details (sanitized) - Business events (invoice created, payment processed) - Errors and exceptions (with stack traces) - Agent actions (generation runs, decisions)

Example:

{
  "timestamp": "2026-01-01T10:30:00Z",
  "level": "Information",
  "correlationId": "abc123",
  "tenantId": "tenant-456",
  "service": "InvoiceService",
  "message": "Invoice created",
  "properties": {
    "invoiceId": "inv-789",
    "amount": 1000.00
  }
}

Metrics¶

Technical Metrics: - Request rate (requests per second) - Latency (p50, p95, p99) - Error rate (errors per second) - Resource usage (CPU, memory, disk)

Business Metrics: - Domain-specific KPIs (invoices created, payments processed) - User activity (sign-ins, feature usage) - Business events (subscriptions, cancellations)

Example Metrics: - invoices_created_total - Counter of invoices created - invoice_processing_duration_seconds - Histogram of processing time - payment_success_rate - Gauge of payment success percentage

Traces¶

Distributed Tracing: - OpenTelemetry spans for all operations - Trace IDs link related operations - Span hierarchy shows request flow

What We Trace: - HTTP requests (incoming and outgoing) - Database queries - Event publishing/subscription - Agent actions (generation runs, decisions)

Example Trace:

Trace: abc123
├─ Span: HTTP POST /invoices (100ms)
│  ├─ Span: CreateInvoiceUseCase (80ms)
│  │  ├─ Span: InvoiceRepository.Save (50ms)
│  │  └─ Span: EventBus.Publish (20ms)
│  └─ Span: AuditLogger.Log (10ms)

Knowledge¶

Knowledge Modules: - Code, docs, blueprints stored as knowledge - Semantic search via vector embeddings - Metadata for filtering and organization

What We Store: - Generated code with metadata - Architecture blueprints and ADRs - Test suites and quality metrics - Agent decisions and outcomes

How Templates Implement Observability¶

OpenTelemetry Wiring¶

Built-in Integration: - OpenTelemetry SDK configured in templates - Automatic instrumentation for HTTP, database, messaging - Custom spans for domain operations

Example:

builder.Services.AddOpenTelemetry()
    .WithTracing(builder => builder
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("InvoiceService"))
    .WithMetrics(builder => builder
        .AddAspNetCoreInstrumentation()
        .AddRuntimeInstrumentation());

Logging Conventions¶

Structured Logging: - ILogger<T> injected into all classes - Correlation IDs automatically included - Consistent log format across services

Example:

_logger.LogInformation(
    "Invoice created. InvoiceId: {InvoiceId}, Amount: {Amount}",
    invoice.Id, invoice.Amount);

Correlation IDs¶

Automatic Propagation: - Correlation ID generated at request entry - Propagated through all layers (API → Application → Domain → Infrastructure) - Included in logs, traces, and events

Example:

// Correlation ID automatically added to context
using var activity = ActivitySource.StartActivity("CreateInvoice");
activity?.SetTag("invoiceId", invoice.Id);

Correlation Across Services and Agents¶

Service Correlation¶

How It Works: - Correlation ID generated at API gateway or first service - Propagated via HTTP headers (X-Correlation-Id) - Included in all logs, traces, and events

Example Flow:

API Gateway (generates correlation ID: abc123)
  → Invoice Service (uses abc123)
    → Payment Service (uses abc123)
      → Notification Service (uses abc123)

Agent Correlation¶

How It Works: - Agent actions linked via correlation IDs - Factory run ID links all agent actions in a generation - Agent ID, skill ID, tenant ID included in all logs

Example:

{
  "correlationId": "factory-run-456",
  "agentId": "engineering-agent-1",
  "skillId": "microservice-generation",
  "tenantId": "tenant-789",
  "moduleId": "invoice-service",
  "action": "code-generated"
}

Cross-System Correlation¶

Factory and Runtime: - Factory generation runs linked to runtime deployments - Agent actions linked to generated code - Knowledge modules linked to generation runs

flowchart TD
    FACTORY[Factory Generation Run<br/>correlationId: run-123] -->|Generates| CODE[Generated Code<br/>moduleId: invoice-service]

    CODE -->|Deploys| RUNTIME[Runtime Service<br/>correlationId: run-123]

    RUNTIME -->|Emits Events| EVENTS[Event Bus<br/>correlationId: run-123]

    FACTORY -->|Stores| KNOWLEDGE[Knowledge System<br/>runId: run-123]
    CODE -->|Stored As| KNOWLEDGE

    RUNTIME -->|Logs| LOGS[Log Storage<br/>correlationId: run-123]
    FACTORY -->|Logs| LOGS

    RUNTIME -->|Traces| TRACES[Trace Storage<br/>traceId: run-123]
    FACTORY -->|Traces| TRACES

    style FACTORY fill:#2563EB,color:#fff
    style RUNTIME fill:#4F46E5,color:#fff
    style KNOWLEDGE fill:#10B981,color:#fff

Hold "Alt" / "Option" to enable pan & zoom

Dashboards and Alerting¶

Types of Dashboards¶

Platform Dashboards: - Service health (availability, latency, error rate) - Business metrics (invoices created, payments processed) - Resource usage (CPU, memory, database connections)

Factory Dashboard: - Generation runs (success rate, duration) - Agent performance (tasks completed, errors) - Knowledge system (patterns stored, reuse rate)

Per-Service Dashboards: - Service-specific metrics - Request flow and traces - Error patterns and trends

Alerting Principles¶

Avoid Alert Fatigue: - Only alert on actionable issues - Use SLOs (Service Level Objectives) for alerting - Clear runbooks for each alert

Alert Types: - Critical - Service down, data loss risk (immediate response) - Warning - Degraded performance, high error rate (investigate) - Info - Notable events, threshold breaches (monitor)

SLO-Based Alerting: - Alert when SLO is at risk (e.g., error rate > 1%) - Don't alert on every error - Use burn rate for SLO alerts

Example KPIs¶

Identity Platform: - Sign-in success rate (> 99.9%) - Token generation latency (p95 < 100ms) - Active users per day

Audit Platform: - Event ingestion rate (events per second) - Query latency (p95 < 500ms) - Storage growth rate

Factory: - Generation success rate (> 95%) - Average generation time (< 30 minutes) - Knowledge reuse rate (> 50%)

Cloud-Native Mindset - How observability fits cloud-native
Event-Driven Mindset - How events enable observability
Knowledge & Memory System - Knowledge storage and retrieval
Microservice Template - How templates implement observability

Observability-Driven Design¶

Why Observability First¶

Easier Debugging¶

Performance Tuning¶

Incident Response¶

Continuous Improvement¶

Signals We Care About¶

Logs¶

Metrics¶

Traces¶

Knowledge¶

How Templates Implement Observability¶

OpenTelemetry Wiring¶

Logging Conventions¶

Correlation IDs¶

Correlation Across Services and Agents¶

Service Correlation¶

Agent Correlation¶

Cross-System Correlation¶

Dashboards and Alerting¶

Types of Dashboards¶

Alerting Principles¶

Example KPIs¶

Related Documents¶