Skip to content

Observability – Dashboards and Alerts

This document defines what dashboards exist (or should exist) for each platform/microservice category and minimal alerting rules required for production readiness. It is written for ops engineers, SREs, and developers setting up monitoring and alerting.

Observability enables detecting issues before customers report them, supporting rapid debugging, and providing visibility into business-level KPIs as well as technical metrics. This page defines the dashboard and alerting standards for ConnectSoft systems.

Important

Production Readiness Requirement: No production service is considered "live" without basic dashboards and alerts. Health checks, metrics, logging, and tracing are required for all production services.

Observability Goals

Core Goals:

  • Detect Issues Before Customers Report - Proactive monitoring and alerting
  • Support Rapid Debugging - Logs, metrics, traces for quick root cause analysis
  • Provide Business Visibility - Business-level KPIs alongside technical metrics
  • Enable SLO-Based Alerting - Alert on SLO breaches, not raw metrics
  • Reduce Alert Fatigue - Alerts must be actionable and meaningful

See: Observability-Driven Design for observability principles.

See: Support and SLA Policy for SLO definitions.

Core Dashboards

Dashboard Categories

Dashboard Types:

  • Platform Overview Dashboard - Health of core platforms (Identity, Audit, Config, Bot)
  • Per-Service Dashboard - Technical metrics for individual services
  • Business KPI Dashboard - Domain-specific events and KPIs
  • Factory Dashboard - Factory runs, agent performance, knowledge system

Dashboard Table

Dashboard Type Metrics/Views Who Uses It
Platform Overview Health of core platforms, error rates, availability Ops/SRE, architects
Service Technical Requests, errors, latency, resource usage Dev + Ops
Business Metrics Domain-specific events and KPIs Product, business, ops
Factory Dashboard Run success rate, agent performance, queue depth Factory team, ops

Platform Overview Dashboard:

  • Health status of Identity, Audit, Config, Bot platforms
  • Error rates and availability per platform
  • Request rates and latency trends
  • Resource usage (CPU, memory, queues)

Service Technical Dashboard:

  • Request rate (requests per second)
  • Error rate (errors per second, error percentage)
  • Latency (p50, p95, p99)
  • Resource usage (CPU, memory, disk, network)
  • Health check status

Business Metrics Dashboard:

  • Domain-specific events (sign-ins, jobs processed, etc.)
  • Business KPIs (conversion rates, user activity, etc.)
  • Custom metrics relevant to the SaaS solution

See: Monitoring and Dashboards for detailed monitoring practices.

Minimal Alerting Rules

Baseline Alert Rules

Required Alerts for Any Production Service:

  • High Error Rate - 5xx or domain-specific failures above threshold
  • Increased Latency - Latency beyond SLO threshold
  • Health Check Failures - Service reported unhealthy
  • Resource Saturation - CPU, memory, queue backlog above threshold

Alert Rules Table

Alert Type Condition (Conceptual) Severity Action
Error Rate Spike Error rate > 5% for 5 minutes Sev½ Page on-call, investigate
Latency Degradation p95 latency > threshold (2x baseline) for 10 minutes Sev2 Investigate, consider rollback
Health Check Failure Service reported unhealthy for 2 consecutive checks Sev½ Page on-call, check logs
Queue Backlog Messages queued > expected threshold for 10 minutes Sev⅔ Investigate processing rate
Resource Saturation CPU > 80% or memory > 90% for 15 minutes Sev2 Scale or optimize
SLO Breach Availability below SLO threshold Sev½ Page on-call, investigate

Alert Conditions:

  • Error Rate Spike - Sustained high error rate indicates regression
  • Latency Degradation - Performance regression affecting user experience
  • Health Check Failure - Service is down or degraded
  • Queue Backlog - Processing bottleneck or downstream issue
  • Resource Saturation - Capacity issue requiring scaling

Warning

Alert Fatigue: Alerts MUST be actionable. Too many false positives or non-actionable alerts lead to alert fatigue and missed critical issues. Review and tune alerts regularly based on actual incidents.

See: Incident Response Runbook for incident response procedures.

Platform-Specific Observability

Identity Platform

Key Metrics:

  • Login Success/Failure Rates - Authentication success and failure rates
  • Token Issuance Errors - Errors in token generation or validation
  • Request Rate - Authentication requests per second
  • Latency - p50, p95, p99 latency for auth endpoints
  • Active Users - Concurrent authenticated users

Key Events:

  • Failed authentication attempts (security monitoring)
  • Token refresh events
  • User provisioning events

Alerts:

  • High failure rate (> 5% for 5 minutes)
  • Token issuance errors
  • Latency degradation (> 2x baseline)

See: Identity Platform Runbook for Identity Platform operations.

Audit Platform

Key Metrics:

  • Events Ingested per Second - Throughput of event ingestion
  • Dropped/Failed Events - Events that failed to ingest
  • Query Performance - Query latency and success rate
  • Storage Usage - Storage growth and retention
  • Tamper Detection - Integrity check failures

Key Events:

  • Event ingestion failures
  • Query timeouts
  • Storage capacity warnings

Alerts:

  • High drop rate (> 1% for 5 minutes)
  • Query performance degradation
  • Storage capacity warnings (> 80%)

See: Audit Platform Runbook for Audit Platform operations.

Config Platform

Key Metrics:

  • Config Read/Write Errors - Errors accessing configuration
  • Cache Hit Rate - Configuration cache effectiveness
  • Update Latency - Time to propagate config changes
  • Feature Flag Usage - Feature flag evaluation rates
  • Version History - Config version changes

Key Events:

  • Config update events
  • Feature flag changes
  • Cache invalidation events

Alerts:

  • High read/write error rate (> 1% for 5 minutes)
  • Low cache hit rate (< 80%)
  • Config update failures

See: Config Platform API Overview for Config Platform details.

Bot Platform

Key Metrics:

  • Request Counts - Bot requests per second
  • Error Rates - Bot request failures
  • Latency to Model/Provider - Response time from AI providers
  • Conversation Completion Rate - Successful conversation completions
  • Token Usage - AI model token consumption

Key Events:

  • Bot request failures
  • Model provider errors
  • Conversation timeouts

Alerts:

  • High error rate (> 5% for 5 minutes)
  • High latency to model provider (> 5 seconds)
  • Model provider errors

See: Bot Platform API Overview for Bot Platform details.

SaaS and Factory-Generated Service Observability

Required Observability

Every SaaS Service Must Expose:

  • Standard Health Endpoints - /health (liveness), /ready (readiness)
  • Technical Metrics - Request rate, error rate, latency
  • Business Events - Domain-specific events logged/audited
  • Structured Logging - Logs with correlation IDs
  • Distributed Tracing - Traces integrated with central tracing backend

Observability Checklist

Production Readiness Checklist:

  • Health check endpoint implemented and wired to monitoring
  • HTTP/gRPC metrics exported (request rate, error rate, latency)
  • Domain events logged/audited where relevant
  • Tracing integrated with central tracing backend
  • Dashboards created for service metrics
  • Alerts configured for error rate, latency, health checks
  • Custom KPIs defined and monitored (if applicable)
  • Log aggregation configured with correlation IDs

See: Microservice Template for template with health checks and metrics.

See: Observability-Driven Design for observability principles.

Factory-Generated Service Observability

Factory Template Includes:

  • Health check endpoints (liveness, readiness)
  • Metrics export (OpenTelemetry or similar)
  • Structured logging with correlation IDs
  • Distributed tracing integration
  • Standard dashboards and alerts

Customization:

  • Services can add custom metrics and KPIs
  • Domain-specific events can be added
  • Business metrics can be integrated

See: Factory Overview for Factory details.

See: Agent Microservice Standard Blueprint for microservice structure.