Monitoring and Dashboards¶

This document describes monitoring, alerting, and dashboard practices for ConnectSoft systems. It is written for operations teams and SREs setting up and maintaining monitoring.

ConnectSoft uses comprehensive monitoring covering technical metrics, business metrics, logs, traces, and knowledge. Dashboards provide real-time visibility into system health, and alerting ensures rapid incident response.

Tip

Monitor what matters. Don't alert on everything—alert on actionable issues. Use SLOs (Service Level Objectives) for alerting, not raw metrics.

What We Monitor¶

Technical Metrics¶

Availability: - Service uptime percentage - Health check success rate - Endpoint availability

Performance: - Request latency (p50, p95, p99) - Request rate (requests per second) - Throughput (operations per second)

Errors: - Error rate (errors per second) - Error percentage (errors / total requests) - Error types and patterns

Resources: - CPU usage - Memory usage - Disk usage - Network usage

Business Metrics¶

Identity Platform: - Sign-ins per day/hour - Token generation rate - Active users - Failed authentication attempts

Audit Platform: - Events ingested per second - Events queried per second - Storage growth rate - Retention compliance

Factory: - Generation runs per day - Generation success rate - Average generation time - Knowledge reuse rate

SaaS Platforms: - Tenant count - API usage per tenant - Feature usage - Subscription metrics

Types of Dashboards¶

Platform Dashboards¶

Identity Platform Dashboard: - Sign-in success rate - Token generation latency - Active users - Error rate - Resource usage

Audit Platform Dashboard: - Event ingestion rate - Query latency - Storage growth - Backlog (if any) - Error rate

Config Platform Dashboard: - Configuration requests - Feature flag usage - Update frequency - Error rate

Bot Platform Dashboard: - Conversation volume - Response latency - User satisfaction - Error rate

Factory Dashboard¶

Generation Metrics: - Generation runs (success rate, duration) - Runs per day/hour - Average generation time - Failed runs

Agent Performance: - Agent task completion rate - Agent execution time - Agent errors - Agent queue length

Knowledge System: - Patterns stored - Knowledge reuse rate - Query performance - Storage growth

Infrastructure: - API request rate - API latency - Error rate - Resource usage

Per-Service Dashboards¶

Service-Specific: - Request rate and latency - Error rate and types - Resource usage - Business metrics - Dependencies health

Example: Invoice Service dashboard shows invoice creation rate, payment processing rate, error rate, and dependency health (database, messaging, etc.).

Alerting Principles¶

Avoid Alert Fatigue¶

Principles: - Alert on actionable issues - Only alert when action is needed - Use SLOs for alerting - Alert when SLO is at risk, not on every error - Clear runbooks - Every alert must have a runbook - Proper severity - Use appropriate severity levels

Don't Alert On: - Every error (alert on error rate threshold) - Normal variations (alert on anomalies) - Non-actionable metrics (alert on actionable issues)

Use SLOs for Alerting¶

SLO-Based Alerting: - Error Budget - Alert when error budget is at risk - Latency SLO - Alert when latency exceeds SLO - Availability SLO - Alert when availability drops below SLO

Example: - SLO: 99.9% availability (error budget: 0.1%) - Alert: When error budget consumption rate would exhaust budget in < 7 days - Not Alert: Every error (only alert when SLO at risk)

Alert Severity Levels¶

Critical: - Service down - Data loss risk - Security breach - SLO violation (immediate risk)

Warning: - Degraded performance - High error rate - SLO at risk (but not violated) - Resource constraints

Info: - Notable events - Threshold breaches (non-critical) - Maintenance notifications

Clear Runbooks¶

Every Alert Must Have: - Runbook - Step-by-step resolution guide - Owner - Who responds to this alert - Escalation - When to escalate - Documentation - Links to relevant docs

Example KPIs¶

Identity Platform KPIs¶

Availability: - Target: 99.9% uptime - Measurement: Health check success rate - Alert: < 99.9% for > 5 minutes

Performance: - Target: p95 latency < 100ms - Measurement: Token generation latency - Alert: p95 > 100ms for > 5 minutes

Business: - Sign-ins per day - Track user activity - Active users - Track user engagement - Failed authentication rate - Track security issues

Audit Platform KPIs¶

Ingestion: - Target: Process events within 1 second - Measurement: Event processing latency - Alert: Processing latency > 1 second

Query: - Target: p95 query latency < 500ms - Measurement: Query response time - Alert: p95 > 500ms for > 5 minutes

Storage: - Storage growth rate - Track capacity needs - Retention compliance - Track compliance - Backlog - Track processing capacity

Factory KPIs¶

Generation: - Success rate: > 95% - Average time: < 30 minutes - Alert: Success rate < 95% or average time > 30 minutes

Agents: - Task completion rate: > 98% - Average execution time: Per agent type - Alert: Completion rate < 98%

Knowledge: - Reuse rate: > 50% - Query performance: < 100ms - Storage growth: Track capacity needs

Operations Overview - Operations documentation overview
Incident Management - Incident response process
Observability-Driven Design - Observability principles
Factory Operations - Factory monitoring