Incident Management¶
This document defines incident management procedures and practices for ConnectSoft systems. It is written for operations teams, SREs, and engineers responding to incidents.
Incident management ensures rapid response, effective resolution, and continuous improvement. This document covers incident classification, response procedures, communication, and post-incident reviews.
Important
All incidents require postmortems. Blameless postmortems focus on learning, not blame. Update runbooks after every incident.
Incident Severity Levels¶
SEV-1: Critical¶
Definition: Complete service outage or data loss risk affecting all users.
Examples: - Identity Platform completely down (no logins possible) - Factory completely unavailable (no generations possible) - Data corruption or loss risk - Security breach
Response Time: Immediate (< 15 minutes) Resolution Target: < 1 hour Escalation: CTO, VP Engineering notified
SEV-2: High¶
Definition: Significant service degradation affecting many users.
Examples: - High error rate (> 10%) - Performance degradation (latency > 2x normal) - Partial service outage - Data inconsistency
Response Time: < 30 minutes Resolution Target: < 4 hours Escalation: Engineering leads notified
SEV-3: Medium¶
Definition: Service degradation affecting some users or non-critical features.
Examples: - Moderate error rate (1-10%) - Performance issues (latency 1.5-2x normal) - Feature degradation - Non-critical service unavailable
Response Time: < 2 hours Resolution Target: < 24 hours Escalation: Team lead notified
SEV-4: Low¶
Definition: Minor issues with workarounds or affecting few users.
Examples: - Low error rate (< 1%) - Minor performance issues - Cosmetic issues - Documentation issues
Response Time: < 8 hours (business hours) Resolution Target: < 1 week Escalation: None required
Roles and Responsibilities¶
On-Call Engineer¶
Responsibilities: - Respond to alerts and incidents - Triage and diagnose issues - Execute runbook procedures - Escalate when needed - Document incident actions
Skills Required: - System knowledge - Troubleshooting skills - Runbook familiarity - Communication skills
Incident Commander¶
Responsibilities: - Coordinate incident response - Make decisions during incident - Communicate status updates - Manage escalation - Ensure postmortem happens
When Assigned: - SEV-1 incidents (always) - SEV-2 incidents (if complex) - Multi-team incidents
Communications Lead¶
Responsibilities: - Update status page - Communicate with stakeholders - Manage customer communications - Document timeline
When Assigned: - SEV-1 incidents (always) - Customer-facing incidents - Public incidents
Subject Matter Experts (SMEs)¶
Responsibilities: - Provide domain expertise - Assist with diagnosis - Help with resolution - Review postmortems
When Involved: - Complex incidents - Domain-specific issues - Architecture decisions needed
Standard Incident Workflow¶
Incident Response Flow¶
flowchart TD
DETECT[Incident Detected<br/>Alert/Report] --> TRIAGE[Triage<br/>Classify Severity]
TRIAGE -->|SEV-1| COMMANDER[Assign Incident Commander]
TRIAGE -->|SEV-2/3/4| ONCALL[On-Call Engineer]
COMMANDER --> DIAGNOSE[Diagnose<br/>Identify Root Cause]
ONCALL --> DIAGNOSE
DIAGNOSE --> MITIGATE[Mitigate<br/>Stop Bleeding]
MITIGATE --> RESOLVE[Resolve<br/>Fix Root Cause]
RESOLVE --> VERIFY[Verify<br/>Confirm Resolution]
VERIFY -->|Resolved| COMMUNICATE[Communicate<br/>Status Update]
VERIFY -->|Not Resolved| DIAGNOSE
COMMUNICATE --> POSTMORTEM[Postmortem<br/>Learn & Improve]
POSTMORTEM --> UPDATE[Update Runbooks<br/>Document Learnings]
style DETECT fill:#EF4444,color:#fff
style RESOLVE fill:#10B981,color:#fff
style POSTMORTEM fill:#2563EB,color:#fff
Detailed Steps¶
1. Detection - Alert fires or incident reported - On-call engineer notified - Initial assessment
2. Triage - Classify severity (SEV-½/¾) - Assign incident commander (if SEV-1) - Notify stakeholders
3. Diagnosis - Check logs, metrics, traces - Identify root cause - Document findings
4. Mitigation - Stop the bleeding (restart, rollback, disable feature) - Restore service if possible - Document actions
5. Resolution - Fix root cause - Deploy fix - Verify fix works
6. Verification - Confirm service restored - Verify no regressions - Monitor metrics
7. Communication - Update status page - Notify stakeholders - Document timeline
8. Postmortem - Schedule postmortem (within 48 hours) - Conduct blameless review - Document learnings - Update runbooks
Postmortems and Learnings¶
Postmortem Process¶
Timeline: - Schedule: Within 48 hours of incident - Duration: 1-2 hours - Attendees: Incident responders, SMEs, stakeholders
Structure:
- Incident Summary
- What happened
- Timeline
-
Impact
-
Root Cause Analysis
- What caused the incident
- Why it happened
-
Contributing factors
-
What Went Well
- What worked during response
-
Positive actions taken
-
What Could Be Better
- What didn't work
-
What could be improved
-
Action Items
- Prevent recurrence
- Improve detection
- Improve response
- Update runbooks
Blameless Postmortems¶
Principles: - Focus on systems, not people - Systems failed, not people - Learn, don't blame - Goal is improvement, not punishment - Assume good intentions - People did their best - Focus on prevention - How to prevent recurrence
Questions to Ask: - What in the system allowed this to happen? - How can we prevent this? - How can we detect this earlier? - How can we respond better?
Runbook Updates¶
After Every Incident:
- Review Runbook
- Was runbook followed?
- Was runbook accurate?
-
Was runbook helpful?
-
Update Runbook
- Add new scenario if needed
- Update procedures based on learnings
- Fix inaccuracies
-
Add new checks
-
Test Runbook
- Verify procedures work
- Test commands and scripts
- Verify accuracy
Important
Update runbooks after every incident. Outdated runbooks cause confusion and delays. Treat runbooks as critical documentation that must be accurate and current.
Related Documents¶
- Operations Overview - Operations documentation overview
- Monitoring & Dashboards - Monitoring and alerting
- Identity Platform Runbook - Platform-specific runbooks
- Audit Platform Runbook - Platform-specific runbooks
- Factory Operations - Factory operations