Skip to content

Incident Management

This document defines incident management procedures and practices for ConnectSoft systems. It is written for operations teams, SREs, and engineers responding to incidents.

Incident management ensures rapid response, effective resolution, and continuous improvement. This document covers incident classification, response procedures, communication, and post-incident reviews.

Important

All incidents require postmortems. Blameless postmortems focus on learning, not blame. Update runbooks after every incident.

Incident Severity Levels

SEV-1: Critical

Definition: Complete service outage or data loss risk affecting all users.

Examples: - Identity Platform completely down (no logins possible) - Factory completely unavailable (no generations possible) - Data corruption or loss risk - Security breach

Response Time: Immediate (< 15 minutes) Resolution Target: < 1 hour Escalation: CTO, VP Engineering notified

SEV-2: High

Definition: Significant service degradation affecting many users.

Examples: - High error rate (> 10%) - Performance degradation (latency > 2x normal) - Partial service outage - Data inconsistency

Response Time: < 30 minutes Resolution Target: < 4 hours Escalation: Engineering leads notified

SEV-3: Medium

Definition: Service degradation affecting some users or non-critical features.

Examples: - Moderate error rate (1-10%) - Performance issues (latency 1.5-2x normal) - Feature degradation - Non-critical service unavailable

Response Time: < 2 hours Resolution Target: < 24 hours Escalation: Team lead notified

SEV-4: Low

Definition: Minor issues with workarounds or affecting few users.

Examples: - Low error rate (< 1%) - Minor performance issues - Cosmetic issues - Documentation issues

Response Time: < 8 hours (business hours) Resolution Target: < 1 week Escalation: None required

Roles and Responsibilities

On-Call Engineer

Responsibilities: - Respond to alerts and incidents - Triage and diagnose issues - Execute runbook procedures - Escalate when needed - Document incident actions

Skills Required: - System knowledge - Troubleshooting skills - Runbook familiarity - Communication skills

Incident Commander

Responsibilities: - Coordinate incident response - Make decisions during incident - Communicate status updates - Manage escalation - Ensure postmortem happens

When Assigned: - SEV-1 incidents (always) - SEV-2 incidents (if complex) - Multi-team incidents

Communications Lead

Responsibilities: - Update status page - Communicate with stakeholders - Manage customer communications - Document timeline

When Assigned: - SEV-1 incidents (always) - Customer-facing incidents - Public incidents

Subject Matter Experts (SMEs)

Responsibilities: - Provide domain expertise - Assist with diagnosis - Help with resolution - Review postmortems

When Involved: - Complex incidents - Domain-specific issues - Architecture decisions needed

Standard Incident Workflow

Incident Response Flow

flowchart TD
    DETECT[Incident Detected<br/>Alert/Report] --> TRIAGE[Triage<br/>Classify Severity]
    TRIAGE -->|SEV-1| COMMANDER[Assign Incident Commander]
    TRIAGE -->|SEV-2/3/4| ONCALL[On-Call Engineer]

    COMMANDER --> DIAGNOSE[Diagnose<br/>Identify Root Cause]
    ONCALL --> DIAGNOSE

    DIAGNOSE --> MITIGATE[Mitigate<br/>Stop Bleeding]
    MITIGATE --> RESOLVE[Resolve<br/>Fix Root Cause]

    RESOLVE --> VERIFY[Verify<br/>Confirm Resolution]
    VERIFY -->|Resolved| COMMUNICATE[Communicate<br/>Status Update]
    VERIFY -->|Not Resolved| DIAGNOSE

    COMMUNICATE --> POSTMORTEM[Postmortem<br/>Learn & Improve]
    POSTMORTEM --> UPDATE[Update Runbooks<br/>Document Learnings]

    style DETECT fill:#EF4444,color:#fff
    style RESOLVE fill:#10B981,color:#fff
    style POSTMORTEM fill:#2563EB,color:#fff
Hold "Alt" / "Option" to enable pan & zoom

Detailed Steps

1. Detection - Alert fires or incident reported - On-call engineer notified - Initial assessment

2. Triage - Classify severity (SEV-½/¾) - Assign incident commander (if SEV-1) - Notify stakeholders

3. Diagnosis - Check logs, metrics, traces - Identify root cause - Document findings

4. Mitigation - Stop the bleeding (restart, rollback, disable feature) - Restore service if possible - Document actions

5. Resolution - Fix root cause - Deploy fix - Verify fix works

6. Verification - Confirm service restored - Verify no regressions - Monitor metrics

7. Communication - Update status page - Notify stakeholders - Document timeline

8. Postmortem - Schedule postmortem (within 48 hours) - Conduct blameless review - Document learnings - Update runbooks

Postmortems and Learnings

Postmortem Process

Timeline: - Schedule: Within 48 hours of incident - Duration: 1-2 hours - Attendees: Incident responders, SMEs, stakeholders

Structure:

  1. Incident Summary
  2. What happened
  3. Timeline
  4. Impact

  5. Root Cause Analysis

  6. What caused the incident
  7. Why it happened
  8. Contributing factors

  9. What Went Well

  10. What worked during response
  11. Positive actions taken

  12. What Could Be Better

  13. What didn't work
  14. What could be improved

  15. Action Items

  16. Prevent recurrence
  17. Improve detection
  18. Improve response
  19. Update runbooks

Blameless Postmortems

Principles: - Focus on systems, not people - Systems failed, not people - Learn, don't blame - Goal is improvement, not punishment - Assume good intentions - People did their best - Focus on prevention - How to prevent recurrence

Questions to Ask: - What in the system allowed this to happen? - How can we prevent this? - How can we detect this earlier? - How can we respond better?

Runbook Updates

After Every Incident:

  1. Review Runbook
  2. Was runbook followed?
  3. Was runbook accurate?
  4. Was runbook helpful?

  5. Update Runbook

  6. Add new scenario if needed
  7. Update procedures based on learnings
  8. Fix inaccuracies
  9. Add new checks

  10. Test Runbook

  11. Verify procedures work
  12. Test commands and scripts
  13. Verify accuracy

Important

Update runbooks after every incident. Outdated runbooks cause confusion and delays. Treat runbooks as critical documentation that must be accurate and current.