Monitoring and Dashboards¶
This document describes monitoring, alerting, and dashboard practices for ConnectSoft systems. It is written for operations teams and SREs setting up and maintaining monitoring.
ConnectSoft uses comprehensive monitoring covering technical metrics, business metrics, logs, traces, and knowledge. Dashboards provide real-time visibility into system health, and alerting ensures rapid incident response.
Tip
Monitor what matters. Don't alert on everything—alert on actionable issues. Use SLOs (Service Level Objectives) for alerting, not raw metrics.
What We Monitor¶
Technical Metrics¶
Availability: - Service uptime percentage - Health check success rate - Endpoint availability
Performance: - Request latency (p50, p95, p99) - Request rate (requests per second) - Throughput (operations per second)
Errors: - Error rate (errors per second) - Error percentage (errors / total requests) - Error types and patterns
Resources: - CPU usage - Memory usage - Disk usage - Network usage
Business Metrics¶
Identity Platform: - Sign-ins per day/hour - Token generation rate - Active users - Failed authentication attempts
Audit Platform: - Events ingested per second - Events queried per second - Storage growth rate - Retention compliance
Factory: - Generation runs per day - Generation success rate - Average generation time - Knowledge reuse rate
SaaS Platforms: - Tenant count - API usage per tenant - Feature usage - Subscription metrics
Types of Dashboards¶
Platform Dashboards¶
Identity Platform Dashboard: - Sign-in success rate - Token generation latency - Active users - Error rate - Resource usage
Audit Platform Dashboard: - Event ingestion rate - Query latency - Storage growth - Backlog (if any) - Error rate
Config Platform Dashboard: - Configuration requests - Feature flag usage - Update frequency - Error rate
Bot Platform Dashboard: - Conversation volume - Response latency - User satisfaction - Error rate
Factory Dashboard¶
Generation Metrics: - Generation runs (success rate, duration) - Runs per day/hour - Average generation time - Failed runs
Agent Performance: - Agent task completion rate - Agent execution time - Agent errors - Agent queue length
Knowledge System: - Patterns stored - Knowledge reuse rate - Query performance - Storage growth
Infrastructure: - API request rate - API latency - Error rate - Resource usage
Per-Service Dashboards¶
Service-Specific: - Request rate and latency - Error rate and types - Resource usage - Business metrics - Dependencies health
Example: Invoice Service dashboard shows invoice creation rate, payment processing rate, error rate, and dependency health (database, messaging, etc.).
Alerting Principles¶
Avoid Alert Fatigue¶
Principles: - Alert on actionable issues - Only alert when action is needed - Use SLOs for alerting - Alert when SLO is at risk, not on every error - Clear runbooks - Every alert must have a runbook - Proper severity - Use appropriate severity levels
Don't Alert On: - Every error (alert on error rate threshold) - Normal variations (alert on anomalies) - Non-actionable metrics (alert on actionable issues)
Use SLOs for Alerting¶
SLO-Based Alerting: - Error Budget - Alert when error budget is at risk - Latency SLO - Alert when latency exceeds SLO - Availability SLO - Alert when availability drops below SLO
Example: - SLO: 99.9% availability (error budget: 0.1%) - Alert: When error budget consumption rate would exhaust budget in < 7 days - Not Alert: Every error (only alert when SLO at risk)
Alert Severity Levels¶
Critical: - Service down - Data loss risk - Security breach - SLO violation (immediate risk)
Warning: - Degraded performance - High error rate - SLO at risk (but not violated) - Resource constraints
Info: - Notable events - Threshold breaches (non-critical) - Maintenance notifications
Clear Runbooks¶
Every Alert Must Have: - Runbook - Step-by-step resolution guide - Owner - Who responds to this alert - Escalation - When to escalate - Documentation - Links to relevant docs
Example KPIs¶
Identity Platform KPIs¶
Availability: - Target: 99.9% uptime - Measurement: Health check success rate - Alert: < 99.9% for > 5 minutes
Performance: - Target: p95 latency < 100ms - Measurement: Token generation latency - Alert: p95 > 100ms for > 5 minutes
Business: - Sign-ins per day - Track user activity - Active users - Track user engagement - Failed authentication rate - Track security issues
Audit Platform KPIs¶
Ingestion: - Target: Process events within 1 second - Measurement: Event processing latency - Alert: Processing latency > 1 second
Query: - Target: p95 query latency < 500ms - Measurement: Query response time - Alert: p95 > 500ms for > 5 minutes
Storage: - Storage growth rate - Track capacity needs - Retention compliance - Track compliance - Backlog - Track processing capacity
Factory KPIs¶
Generation: - Success rate: > 95% - Average time: < 30 minutes - Alert: Success rate < 95% or average time > 30 minutes
Agents: - Task completion rate: > 98% - Average execution time: Per agent type - Alert: Completion rate < 98%
Knowledge: - Reuse rate: > 50% - Query performance: < 100ms - Storage growth: Track capacity needs
Related Documents¶
- Operations Overview - Operations documentation overview
- Incident Management - Incident response process
- Observability-Driven Design - Observability principles
- Factory Operations - Factory monitoring