Runbook – Incident Response¶
This document defines the incident response process: severity levels, triage steps, communication patterns, and roles and responsibilities. It is written for ops engineers, SREs, on-call engineers, and anyone responding to production incidents.
Incident response ensures rapid detection, diagnosis, mitigation, and learning. This runbook applies to Factory, platforms, and production SaaS built on them. Smaller internal hiccups may be tracked as normal bugs, not incidents.
Note
Incident vs Bug: Smaller internal hiccups may be tracked as normal bugs, not incidents. Incidents are SLA breaches, customer-facing outages, or security issues that require immediate response.
Scope and Definitions¶
What Counts as an Incident¶
Incident Criteria:
- SLA Breach - Service availability or performance below SLO
- Customer-Facing Outage - Service unavailable or degraded for customers
- Security Issue - Security breach, data exposure, or unauthorized access
- Data Loss Risk - Potential data loss or corruption
- Critical System Failure - Factory, Identity, or Audit Platform down
Applies To:
- AI Factory (runs, orchestration, storage)
- Core Platforms (Identity, Audit, Config, Bot)
- Production SaaS solutions built on ConnectSoft platforms
Not Incidents:
- Minor bugs affecting single users
- Non-critical feature issues
- Internal tooling problems (unless blocking production)
- Planned maintenance windows
See: Support and SLA Policy for SLA definitions.
Severity Levels¶
Severity Definitions¶
| Severity | Description | Example | Response Time | Resolution Target |
|---|---|---|---|---|
| Sev1 | Critical outage, major customer impact | Auth down, all tenants affected | Immediate (< 15 min) | < 1 hour |
| Sev2 | Degraded service, limited impact | One region/product degraded, workaround available | < 30 minutes | < 4 hours |
| Sev3 | Minor issues, non-urgent | Intermittent errors, low impact | < 4 hours | < 24 hours |
Sev1 Examples:
- Identity Platform completely down (no logins possible)
- Factory completely unavailable (no generations possible)
- Data corruption or loss risk
- Security breach
Sev2 Examples:
- High error rate (> 10%) affecting some tenants
- Performance degradation (latency > 2x normal)
- Partial service outage (one region or feature)
- Data inconsistency
Sev3 Examples:
- Intermittent errors affecting few users
- Non-critical feature broken
- Minor performance issues
- Low-impact bugs
Important
Paging Requirements: Sev1 incidents require immediate paging of Ops + Dmitry + On-call dev. Sev2 incidents page Ops + On-call dev. Sev3 incidents are handled async by Ops. Always err on the side of paging for unclear severity.
Triage and Initial Response¶
Triage Flow¶
Step-by-Step Triage:
- Acknowledge Alert and Assign Incident Commander (IC)
- First responder acknowledges alert
- Assigns incident commander (IC) role
-
IC coordinates response and communication
-
Confirm Severity Level Using Predefined Criteria
- Review severity definitions
- Assess customer impact
-
Assign severity level (Sev½/3)
-
Check Dashboards, Logs, and Alerts for Quick Diagnosis
- Review dashboards for anomalies
- Check error logs for patterns
- Review recent deployments or changes
-
Check traces (if available)
-
Decide: Mitigate, Rollback, or Escalate
- Mitigate - Quick fix or workaround available
- Rollback - Recent deployment likely cause
-
Escalate - Requires deeper investigation or expertise
-
Update Incident Log with Timestamps and Actions
- Document all actions taken
- Record timestamps for key events
- Update status and next steps
Triage Checklist:
- Alert acknowledged, IC assigned
- Severity level confirmed
- Dashboards checked for anomalies
- Logs reviewed for error patterns
- Recent changes reviewed
- Decision made (mitigate/rollback/escalate)
- Incident log updated
Warning
No Blind Patching: Guessing or patching blindly in production is not allowed. All actions must be traceable and documented. If unsure, escalate or rollback rather than guessing.
See: Deployment and Rollback Runbook for rollback procedures.
See: Observability – Dashboards and Alerts for monitoring details.
Communication and Stakeholder Management¶
Communication Channels¶
Internal Tech Team:
- Channel: Slack/Teams incident channel
- Content: Symptoms, suspected cause, actions taken
- Cadence: Every 15–30 minutes for Sev1, hourly for Sev2
Business/Management:
- Channel: Email/Slack summary
- Content: Impact summary, estimated restore time
- Cadence: Initial notification, then updates every hour for Sev1
Affected Customers (if applicable):
- Channel: Status page / email
- Content: Status and workaround (if available)
- Cadence: Initial notification, updates every 30–60 minutes
Communication Table¶
| Audience | What They Need | Channel | Cadence |
|---|---|---|---|
| Internal Tech | Symptoms, suspected cause, actions | Slack/Teams, incident doc | Every 15–30 min (Sev1) |
| Business/Management | Impact summary, estimated restore | Email/Slack summary | Hourly (Sev1) |
| Affected Customers | Status and workaround | Status page / email | Every 30–60 min |
Incident Status Update Template¶
Template:
## Incident Status Update
**Time:** [Timestamp]
**Severity:** [Sev1/2/3]
**Status:** [Investigating / Mitigating / Resolved]
**Summary:**
[One to two sentences describing the issue and current status]
**Impact:**
- [Affected services/features]
- [Affected users/tenants]
- [Estimated restore time]
**Actions Taken:**
- [Action 1]
- [Action 2]
- [Next steps]
**Next Update:**
[When next update will be provided]
Tip
Status Update Template: Use a simple template for incident status updates (one to two paragraphs, bullets for actions). Keep updates concise and focused on what stakeholders need to know.
Post-Incident Review and Follow-Up¶
Post-Incident Review Process¶
Review Timeline:
- Sev1 - Postmortem within 48 hours
- Sev2 - Postmortem within 1 week
- Sev3 - Optional postmortem or async review
Postmortem Structure:
- Incident Summary
- What happened
- Timeline of events
-
Impact assessment
-
Root Cause Analysis
- Root cause identified
- Contributing factors
-
Why it wasn't prevented
-
Remediations
- Immediate fixes applied
- Short-term mitigations
-
Long-term prevention steps
-
Action Items
- Tasks created for fixes
- Runbook updates needed
- Process improvements
Post-Incident Checklist:
- Postmortem scheduled and conducted
- Root cause identified and documented
- Contributing factors documented
- Remediations planned and tracked
- Runbooks updated with learnings
- ADRs/BDRs created or updated (if needed)
- Action items created and assigned
Decision
Post-Incident Review Requirement: Every Sev1 incident must have a documented post-incident review. Postmortems are blameless and focus on learning, not blame. Runbooks must be updated based on postmortem findings.
See: How to Write a Good ADR for ADR guidance.
See: How to Write a Good BDR for BDR guidance.
Related Documents¶
- Operations and SRE Overview - Operations overview
- Deployment and Rollback Runbook - Deployment and rollback procedures
- Observability – Dashboards and Alerts - Monitoring and alerting
- Support and SLA Policy - SLA definitions
- Security & Compliance - Security incident procedures