Observability – Dashboards and Alerts¶
This document defines what dashboards exist (or should exist) for each platform/microservice category and minimal alerting rules required for production readiness. It is written for ops engineers, SREs, and developers setting up monitoring and alerting.
Observability enables detecting issues before customers report them, supporting rapid debugging, and providing visibility into business-level KPIs as well as technical metrics. This page defines the dashboard and alerting standards for ConnectSoft systems.
Important
Production Readiness Requirement: No production service is considered "live" without basic dashboards and alerts. Health checks, metrics, logging, and tracing are required for all production services.
Observability Goals¶
Core Goals:
- Detect Issues Before Customers Report - Proactive monitoring and alerting
- Support Rapid Debugging - Logs, metrics, traces for quick root cause analysis
- Provide Business Visibility - Business-level KPIs alongside technical metrics
- Enable SLO-Based Alerting - Alert on SLO breaches, not raw metrics
- Reduce Alert Fatigue - Alerts must be actionable and meaningful
See: Observability-Driven Design for observability principles.
See: Support and SLA Policy for SLO definitions.
Core Dashboards¶
Dashboard Categories¶
Dashboard Types:
- Platform Overview Dashboard - Health of core platforms (Identity, Audit, Config, Bot)
- Per-Service Dashboard - Technical metrics for individual services
- Business KPI Dashboard - Domain-specific events and KPIs
- Factory Dashboard - Factory runs, agent performance, knowledge system
Dashboard Table¶
| Dashboard Type | Metrics/Views | Who Uses It |
|---|---|---|
| Platform Overview | Health of core platforms, error rates, availability | Ops/SRE, architects |
| Service Technical | Requests, errors, latency, resource usage | Dev + Ops |
| Business Metrics | Domain-specific events and KPIs | Product, business, ops |
| Factory Dashboard | Run success rate, agent performance, queue depth | Factory team, ops |
Platform Overview Dashboard:
- Health status of Identity, Audit, Config, Bot platforms
- Error rates and availability per platform
- Request rates and latency trends
- Resource usage (CPU, memory, queues)
Service Technical Dashboard:
- Request rate (requests per second)
- Error rate (errors per second, error percentage)
- Latency (p50, p95, p99)
- Resource usage (CPU, memory, disk, network)
- Health check status
Business Metrics Dashboard:
- Domain-specific events (sign-ins, jobs processed, etc.)
- Business KPIs (conversion rates, user activity, etc.)
- Custom metrics relevant to the SaaS solution
See: Monitoring and Dashboards for detailed monitoring practices.
Minimal Alerting Rules¶
Baseline Alert Rules¶
Required Alerts for Any Production Service:
- High Error Rate - 5xx or domain-specific failures above threshold
- Increased Latency - Latency beyond SLO threshold
- Health Check Failures - Service reported unhealthy
- Resource Saturation - CPU, memory, queue backlog above threshold
Alert Rules Table¶
| Alert Type | Condition (Conceptual) | Severity | Action |
|---|---|---|---|
| Error Rate Spike | Error rate > 5% for 5 minutes | Sev½ | Page on-call, investigate |
| Latency Degradation | p95 latency > threshold (2x baseline) for 10 minutes | Sev2 | Investigate, consider rollback |
| Health Check Failure | Service reported unhealthy for 2 consecutive checks | Sev½ | Page on-call, check logs |
| Queue Backlog | Messages queued > expected threshold for 10 minutes | Sev⅔ | Investigate processing rate |
| Resource Saturation | CPU > 80% or memory > 90% for 15 minutes | Sev2 | Scale or optimize |
| SLO Breach | Availability below SLO threshold | Sev½ | Page on-call, investigate |
Alert Conditions:
- Error Rate Spike - Sustained high error rate indicates regression
- Latency Degradation - Performance regression affecting user experience
- Health Check Failure - Service is down or degraded
- Queue Backlog - Processing bottleneck or downstream issue
- Resource Saturation - Capacity issue requiring scaling
Warning
Alert Fatigue: Alerts MUST be actionable. Too many false positives or non-actionable alerts lead to alert fatigue and missed critical issues. Review and tune alerts regularly based on actual incidents.
See: Incident Response Runbook for incident response procedures.
Platform-Specific Observability¶
Identity Platform¶
Key Metrics:
- Login Success/Failure Rates - Authentication success and failure rates
- Token Issuance Errors - Errors in token generation or validation
- Request Rate - Authentication requests per second
- Latency - p50, p95, p99 latency for auth endpoints
- Active Users - Concurrent authenticated users
Key Events:
- Failed authentication attempts (security monitoring)
- Token refresh events
- User provisioning events
Alerts:
- High failure rate (> 5% for 5 minutes)
- Token issuance errors
- Latency degradation (> 2x baseline)
See: Identity Platform Runbook for Identity Platform operations.
Audit Platform¶
Key Metrics:
- Events Ingested per Second - Throughput of event ingestion
- Dropped/Failed Events - Events that failed to ingest
- Query Performance - Query latency and success rate
- Storage Usage - Storage growth and retention
- Tamper Detection - Integrity check failures
Key Events:
- Event ingestion failures
- Query timeouts
- Storage capacity warnings
Alerts:
- High drop rate (> 1% for 5 minutes)
- Query performance degradation
- Storage capacity warnings (> 80%)
See: Audit Platform Runbook for Audit Platform operations.
Config Platform¶
Key Metrics:
- Config Read/Write Errors - Errors accessing configuration
- Cache Hit Rate - Configuration cache effectiveness
- Update Latency - Time to propagate config changes
- Feature Flag Usage - Feature flag evaluation rates
- Version History - Config version changes
Key Events:
- Config update events
- Feature flag changes
- Cache invalidation events
Alerts:
- High read/write error rate (> 1% for 5 minutes)
- Low cache hit rate (< 80%)
- Config update failures
See: Config Platform API Overview for Config Platform details.
Bot Platform¶
Key Metrics:
- Request Counts - Bot requests per second
- Error Rates - Bot request failures
- Latency to Model/Provider - Response time from AI providers
- Conversation Completion Rate - Successful conversation completions
- Token Usage - AI model token consumption
Key Events:
- Bot request failures
- Model provider errors
- Conversation timeouts
Alerts:
- High error rate (> 5% for 5 minutes)
- High latency to model provider (> 5 seconds)
- Model provider errors
See: Bot Platform API Overview for Bot Platform details.
SaaS and Factory-Generated Service Observability¶
Required Observability¶
Every SaaS Service Must Expose:
- Standard Health Endpoints -
/health(liveness),/ready(readiness) - Technical Metrics - Request rate, error rate, latency
- Business Events - Domain-specific events logged/audited
- Structured Logging - Logs with correlation IDs
- Distributed Tracing - Traces integrated with central tracing backend
Observability Checklist¶
Production Readiness Checklist:
- Health check endpoint implemented and wired to monitoring
- HTTP/gRPC metrics exported (request rate, error rate, latency)
- Domain events logged/audited where relevant
- Tracing integrated with central tracing backend
- Dashboards created for service metrics
- Alerts configured for error rate, latency, health checks
- Custom KPIs defined and monitored (if applicable)
- Log aggregation configured with correlation IDs
See: Microservice Template for template with health checks and metrics.
See: Observability-Driven Design for observability principles.
Factory-Generated Service Observability¶
Factory Template Includes:
- Health check endpoints (liveness, readiness)
- Metrics export (OpenTelemetry or similar)
- Structured logging with correlation IDs
- Distributed tracing integration
- Standard dashboards and alerts
Customization:
- Services can add custom metrics and KPIs
- Domain-specific events can be added
- Business metrics can be integrated
See: Factory Overview for Factory details.
See: Agent Microservice Standard Blueprint for microservice structure.
Related Documents¶
- Operations and SRE Overview - Operations overview
- Deployment and Rollback Runbook - Deployment procedures
- Incident Response Runbook - Incident response procedures
- Monitoring and Dashboards - Detailed monitoring practices
- Observability-Driven Design - Observability principles
- Microservice Template - Template with health checks and metrics
- Support and SLA Policy - SLO definitions