Observability – Dashboards and Alerts¶

This document defines what dashboards exist (or should exist) for each platform/microservice category and minimal alerting rules required for production readiness. It is written for ops engineers, SREs, and developers setting up monitoring and alerting.

Observability enables detecting issues before customers report them, supporting rapid debugging, and providing visibility into business-level KPIs as well as technical metrics. This page defines the dashboard and alerting standards for ConnectSoft systems.

Important

Production Readiness Requirement: No production service is considered "live" without basic dashboards and alerts. Health checks, metrics, logging, and tracing are required for all production services.

Observability Goals¶

Core Goals:

Detect Issues Before Customers Report - Proactive monitoring and alerting
Support Rapid Debugging - Logs, metrics, traces for quick root cause analysis
Provide Business Visibility - Business-level KPIs alongside technical metrics
Enable SLO-Based Alerting - Alert on SLO breaches, not raw metrics
Reduce Alert Fatigue - Alerts must be actionable and meaningful

See: Observability-Driven Design for observability principles.

See: Support and SLA Policy for SLO definitions.

Core Dashboards¶

Dashboard Categories¶

Dashboard Types:

Platform Overview Dashboard - Health of core platforms (Identity, Audit, Config, Bot)
Per-Service Dashboard - Technical metrics for individual services
Business KPI Dashboard - Domain-specific events and KPIs
Factory Dashboard - Factory runs, agent performance, knowledge system

Dashboard Table¶

Dashboard Type	Metrics/Views	Who Uses It
Platform Overview	Health of core platforms, error rates, availability	Ops/SRE, architects
Service Technical	Requests, errors, latency, resource usage	Dev + Ops
Business Metrics	Domain-specific events and KPIs	Product, business, ops
Factory Dashboard	Run success rate, agent performance, queue depth	Factory team, ops

Platform Overview Dashboard:

Health status of Identity, Audit, Config, Bot platforms
Error rates and availability per platform
Request rates and latency trends
Resource usage (CPU, memory, queues)

Service Technical Dashboard:

Request rate (requests per second)
Error rate (errors per second, error percentage)
Latency (p50, p95, p99)
Resource usage (CPU, memory, disk, network)
Health check status

Business Metrics Dashboard:

Domain-specific events (sign-ins, jobs processed, etc.)
Business KPIs (conversion rates, user activity, etc.)
Custom metrics relevant to the SaaS solution

See: Monitoring and Dashboards for detailed monitoring practices.

Minimal Alerting Rules¶

Baseline Alert Rules¶

Required Alerts for Any Production Service:

High Error Rate - 5xx or domain-specific failures above threshold
Increased Latency - Latency beyond SLO threshold
Health Check Failures - Service reported unhealthy
Resource Saturation - CPU, memory, queue backlog above threshold

Alert Rules Table¶

Alert Type	Condition (Conceptual)	Severity	Action
Error Rate Spike	Error rate > 5% for 5 minutes	Sev½	Page on-call, investigate
Latency Degradation	p95 latency > threshold (2x baseline) for 10 minutes	Sev2	Investigate, consider rollback
Health Check Failure	Service reported unhealthy for 2 consecutive checks	Sev½	Page on-call, check logs
Queue Backlog	Messages queued > expected threshold for 10 minutes	Sev⅔	Investigate processing rate
Resource Saturation	CPU > 80% or memory > 90% for 15 minutes	Sev2	Scale or optimize
SLO Breach	Availability below SLO threshold	Sev½	Page on-call, investigate

Alert Conditions:

Error Rate Spike - Sustained high error rate indicates regression
Latency Degradation - Performance regression affecting user experience
Health Check Failure - Service is down or degraded
Queue Backlog - Processing bottleneck or downstream issue
Resource Saturation - Capacity issue requiring scaling

Warning

Alert Fatigue: Alerts MUST be actionable. Too many false positives or non-actionable alerts lead to alert fatigue and missed critical issues. Review and tune alerts regularly based on actual incidents.

See: Incident Response Runbook for incident response procedures.

Platform-Specific Observability¶

Identity Platform¶

Key Metrics:

Login Success/Failure Rates - Authentication success and failure rates
Token Issuance Errors - Errors in token generation or validation
Request Rate - Authentication requests per second
Latency - p50, p95, p99 latency for auth endpoints
Active Users - Concurrent authenticated users

Key Events:

Failed authentication attempts (security monitoring)
Token refresh events
User provisioning events

Alerts:

High failure rate (> 5% for 5 minutes)
Token issuance errors
Latency degradation (> 2x baseline)

See: Identity Platform Runbook for Identity Platform operations.

Audit Platform¶

Key Metrics:

Events Ingested per Second - Throughput of event ingestion
Dropped/Failed Events - Events that failed to ingest
Query Performance - Query latency and success rate
Storage Usage - Storage growth and retention
Tamper Detection - Integrity check failures

Key Events:

Event ingestion failures
Query timeouts
Storage capacity warnings

Alerts:

High drop rate (> 1% for 5 minutes)
Query performance degradation
Storage capacity warnings (> 80%)

See: Audit Platform Runbook for Audit Platform operations.

Config Platform¶

Key Metrics:

Config Read/Write Errors - Errors accessing configuration
Cache Hit Rate - Configuration cache effectiveness
Update Latency - Time to propagate config changes
Feature Flag Usage - Feature flag evaluation rates
Version History - Config version changes

Key Events:

Config update events
Feature flag changes
Cache invalidation events

Alerts:

High read/write error rate (> 1% for 5 minutes)
Low cache hit rate (< 80%)
Config update failures

See: Config Platform API Overview for Config Platform details.

Bot Platform¶

Key Metrics:

Request Counts - Bot requests per second
Error Rates - Bot request failures
Latency to Model/Provider - Response time from AI providers
Conversation Completion Rate - Successful conversation completions
Token Usage - AI model token consumption

Key Events:

Bot request failures
Model provider errors
Conversation timeouts

Alerts:

High error rate (> 5% for 5 minutes)
High latency to model provider (> 5 seconds)
Model provider errors

See: Bot Platform API Overview for Bot Platform details.

SaaS and Factory-Generated Service Observability¶

Required Observability¶

Every SaaS Service Must Expose:

Standard Health Endpoints - /health (liveness), /ready (readiness)
Technical Metrics - Request rate, error rate, latency
Business Events - Domain-specific events logged/audited
Structured Logging - Logs with correlation IDs
Distributed Tracing - Traces integrated with central tracing backend

Observability Checklist¶

Production Readiness Checklist:

Health check endpoint implemented and wired to monitoring
HTTP/gRPC metrics exported (request rate, error rate, latency)
Domain events logged/audited where relevant
Tracing integrated with central tracing backend
Dashboards created for service metrics
Alerts configured for error rate, latency, health checks
Custom KPIs defined and monitored (if applicable)
Log aggregation configured with correlation IDs

See: Microservice Template for template with health checks and metrics.

See: Observability-Driven Design for observability principles.

Factory-Generated Service Observability¶

Factory Template Includes:

Health check endpoints (liveness, readiness)
Metrics export (OpenTelemetry or similar)
Structured logging with correlation IDs
Distributed tracing integration
Standard dashboards and alerts

Customization:

Services can add custom metrics and KPIs
Domain-specific events can be added
Business metrics can be integrated

See: Factory Overview for Factory details.

See: Agent Microservice Standard Blueprint for microservice structure.

Operations and SRE Overview - Operations overview
Deployment and Rollback Runbook - Deployment procedures
Incident Response Runbook - Incident response procedures
Monitoring and Dashboards - Detailed monitoring practices
Observability-Driven Design - Observability principles
Microservice Template - Template with health checks and metrics
Support and SLA Policy - SLO definitions

Observability – Dashboards and Alerts¶

Observability Goals¶

Core Dashboards¶

Dashboard Categories¶

Dashboard Table¶

Minimal Alerting Rules¶

Baseline Alert Rules¶

Alert Rules Table¶

Platform-Specific Observability¶

Identity Platform¶

Audit Platform¶

Config Platform¶

Bot Platform¶

SaaS and Factory-Generated Service Observability¶

Required Observability¶

Observability Checklist¶

Factory-Generated Service Observability¶

Related Documents¶