Incident Management¶

This document defines incident management procedures and practices for ConnectSoft systems. It is written for operations teams, SREs, and engineers responding to incidents.

Incident management ensures rapid response, effective resolution, and continuous improvement. This document covers incident classification, response procedures, communication, and post-incident reviews.

Important

All incidents require postmortems. Blameless postmortems focus on learning, not blame. Update runbooks after every incident.

Incident Severity Levels¶

SEV-1: Critical¶

Definition: Complete service outage or data loss risk affecting all users.

Examples: - Identity Platform completely down (no logins possible) - Factory completely unavailable (no generations possible) - Data corruption or loss risk - Security breach

Response Time: Immediate (< 15 minutes) Resolution Target: < 1 hour Escalation: CTO, VP Engineering notified

SEV-2: High¶

Definition: Significant service degradation affecting many users.

Examples: - High error rate (> 10%) - Performance degradation (latency > 2x normal) - Partial service outage - Data inconsistency

Response Time: < 30 minutes Resolution Target: < 4 hours Escalation: Engineering leads notified

SEV-3: Medium¶

Definition: Service degradation affecting some users or non-critical features.

Examples: - Moderate error rate (1-10%) - Performance issues (latency 1.5-2x normal) - Feature degradation - Non-critical service unavailable

Response Time: < 2 hours Resolution Target: < 24 hours Escalation: Team lead notified

SEV-4: Low¶

Definition: Minor issues with workarounds or affecting few users.

Examples: - Low error rate (< 1%) - Minor performance issues - Cosmetic issues - Documentation issues

Response Time: < 8 hours (business hours) Resolution Target: < 1 week Escalation: None required

Roles and Responsibilities¶

On-Call Engineer¶

Responsibilities: - Respond to alerts and incidents - Triage and diagnose issues - Execute runbook procedures - Escalate when needed - Document incident actions

Skills Required: - System knowledge - Troubleshooting skills - Runbook familiarity - Communication skills

Incident Commander¶

Responsibilities: - Coordinate incident response - Make decisions during incident - Communicate status updates - Manage escalation - Ensure postmortem happens

When Assigned: - SEV-1 incidents (always) - SEV-2 incidents (if complex) - Multi-team incidents

Communications Lead¶

Responsibilities: - Update status page - Communicate with stakeholders - Manage customer communications - Document timeline

When Assigned: - SEV-1 incidents (always) - Customer-facing incidents - Public incidents

Subject Matter Experts (SMEs)¶

Responsibilities: - Provide domain expertise - Assist with diagnosis - Help with resolution - Review postmortems

When Involved: - Complex incidents - Domain-specific issues - Architecture decisions needed

Standard Incident Workflow¶

Incident Response Flow¶

flowchart TD
    DETECT[Incident Detected<br/>Alert/Report] --> TRIAGE[Triage<br/>Classify Severity]
    TRIAGE -->|SEV-1| COMMANDER[Assign Incident Commander]
    TRIAGE -->|SEV-2/3/4| ONCALL[On-Call Engineer]

    COMMANDER --> DIAGNOSE[Diagnose<br/>Identify Root Cause]
    ONCALL --> DIAGNOSE

    DIAGNOSE --> MITIGATE[Mitigate<br/>Stop Bleeding]
    MITIGATE --> RESOLVE[Resolve<br/>Fix Root Cause]

    RESOLVE --> VERIFY[Verify<br/>Confirm Resolution]
    VERIFY -->|Resolved| COMMUNICATE[Communicate<br/>Status Update]
    VERIFY -->|Not Resolved| DIAGNOSE

    COMMUNICATE --> POSTMORTEM[Postmortem<br/>Learn & Improve]
    POSTMORTEM --> UPDATE[Update Runbooks<br/>Document Learnings]

    style DETECT fill:#EF4444,color:#fff
    style RESOLVE fill:#10B981,color:#fff
    style POSTMORTEM fill:#2563EB,color:#fff

Hold "Alt" / "Option" to enable pan & zoom

Detailed Steps¶

1. Detection - Alert fires or incident reported - On-call engineer notified - Initial assessment

2. Triage - Classify severity (SEV-½/¾) - Assign incident commander (if SEV-1) - Notify stakeholders

3. Diagnosis - Check logs, metrics, traces - Identify root cause - Document findings

4. Mitigation - Stop the bleeding (restart, rollback, disable feature) - Restore service if possible - Document actions

5. Resolution - Fix root cause - Deploy fix - Verify fix works

6. Verification - Confirm service restored - Verify no regressions - Monitor metrics

7. Communication - Update status page - Notify stakeholders - Document timeline

8. Postmortem - Schedule postmortem (within 48 hours) - Conduct blameless review - Document learnings - Update runbooks

Postmortems and Learnings¶

Postmortem Process¶

Timeline: - Schedule: Within 48 hours of incident - Duration: 1-2 hours - Attendees: Incident responders, SMEs, stakeholders

Structure:

Incident Summary
What happened
Timeline
Impact
Root Cause Analysis
What caused the incident
Why it happened
Contributing factors
What Went Well
What worked during response
Positive actions taken
What Could Be Better
What didn't work
What could be improved
Action Items
Prevent recurrence
Improve detection
Improve response
Update runbooks

Blameless Postmortems¶

Principles: - Focus on systems, not people - Systems failed, not people - Learn, don't blame - Goal is improvement, not punishment - Assume good intentions - People did their best - Focus on prevention - How to prevent recurrence

Questions to Ask: - What in the system allowed this to happen? - How can we prevent this? - How can we detect this earlier? - How can we respond better?

Runbook Updates¶

After Every Incident:

Review Runbook
Was runbook followed?
Was runbook accurate?
Was runbook helpful?
Update Runbook
Add new scenario if needed
Update procedures based on learnings
Fix inaccuracies
Add new checks
Test Runbook
Verify procedures work
Test commands and scripts
Verify accuracy