Skip to content

Runbook – Incident Response

This document defines the incident response process: severity levels, triage steps, communication patterns, and roles and responsibilities. It is written for ops engineers, SREs, on-call engineers, and anyone responding to production incidents.

Incident response ensures rapid detection, diagnosis, mitigation, and learning. This runbook applies to Factory, platforms, and production SaaS built on them. Smaller internal hiccups may be tracked as normal bugs, not incidents.

Note

Incident vs Bug: Smaller internal hiccups may be tracked as normal bugs, not incidents. Incidents are SLA breaches, customer-facing outages, or security issues that require immediate response.

Scope and Definitions

What Counts as an Incident

Incident Criteria:

  • SLA Breach - Service availability or performance below SLO
  • Customer-Facing Outage - Service unavailable or degraded for customers
  • Security Issue - Security breach, data exposure, or unauthorized access
  • Data Loss Risk - Potential data loss or corruption
  • Critical System Failure - Factory, Identity, or Audit Platform down

Applies To:

  • AI Factory (runs, orchestration, storage)
  • Core Platforms (Identity, Audit, Config, Bot)
  • Production SaaS solutions built on ConnectSoft platforms

Not Incidents:

  • Minor bugs affecting single users
  • Non-critical feature issues
  • Internal tooling problems (unless blocking production)
  • Planned maintenance windows

See: Support and SLA Policy for SLA definitions.

Severity Levels

Severity Definitions

Severity Description Example Response Time Resolution Target
Sev1 Critical outage, major customer impact Auth down, all tenants affected Immediate (< 15 min) < 1 hour
Sev2 Degraded service, limited impact One region/product degraded, workaround available < 30 minutes < 4 hours
Sev3 Minor issues, non-urgent Intermittent errors, low impact < 4 hours < 24 hours

Sev1 Examples:

  • Identity Platform completely down (no logins possible)
  • Factory completely unavailable (no generations possible)
  • Data corruption or loss risk
  • Security breach

Sev2 Examples:

  • High error rate (> 10%) affecting some tenants
  • Performance degradation (latency > 2x normal)
  • Partial service outage (one region or feature)
  • Data inconsistency

Sev3 Examples:

  • Intermittent errors affecting few users
  • Non-critical feature broken
  • Minor performance issues
  • Low-impact bugs

Important

Paging Requirements: Sev1 incidents require immediate paging of Ops + Dmitry + On-call dev. Sev2 incidents page Ops + On-call dev. Sev3 incidents are handled async by Ops. Always err on the side of paging for unclear severity.

Triage and Initial Response

Triage Flow

Step-by-Step Triage:

  1. Acknowledge Alert and Assign Incident Commander (IC)
  2. First responder acknowledges alert
  3. Assigns incident commander (IC) role
  4. IC coordinates response and communication

  5. Confirm Severity Level Using Predefined Criteria

  6. Review severity definitions
  7. Assess customer impact
  8. Assign severity level (Sev½/3)

  9. Check Dashboards, Logs, and Alerts for Quick Diagnosis

  10. Review dashboards for anomalies
  11. Check error logs for patterns
  12. Review recent deployments or changes
  13. Check traces (if available)

  14. Decide: Mitigate, Rollback, or Escalate

  15. Mitigate - Quick fix or workaround available
  16. Rollback - Recent deployment likely cause
  17. Escalate - Requires deeper investigation or expertise

  18. Update Incident Log with Timestamps and Actions

  19. Document all actions taken
  20. Record timestamps for key events
  21. Update status and next steps

Triage Checklist:

  • Alert acknowledged, IC assigned
  • Severity level confirmed
  • Dashboards checked for anomalies
  • Logs reviewed for error patterns
  • Recent changes reviewed
  • Decision made (mitigate/rollback/escalate)
  • Incident log updated

Warning

No Blind Patching: Guessing or patching blindly in production is not allowed. All actions must be traceable and documented. If unsure, escalate or rollback rather than guessing.

See: Deployment and Rollback Runbook for rollback procedures.

See: Observability – Dashboards and Alerts for monitoring details.

Communication and Stakeholder Management

Communication Channels

Internal Tech Team:

  • Channel: Slack/Teams incident channel
  • Content: Symptoms, suspected cause, actions taken
  • Cadence: Every 15–30 minutes for Sev1, hourly for Sev2

Business/Management:

  • Channel: Email/Slack summary
  • Content: Impact summary, estimated restore time
  • Cadence: Initial notification, then updates every hour for Sev1

Affected Customers (if applicable):

  • Channel: Status page / email
  • Content: Status and workaround (if available)
  • Cadence: Initial notification, updates every 30–60 minutes

Communication Table

Audience What They Need Channel Cadence
Internal Tech Symptoms, suspected cause, actions Slack/Teams, incident doc Every 15–30 min (Sev1)
Business/Management Impact summary, estimated restore Email/Slack summary Hourly (Sev1)
Affected Customers Status and workaround Status page / email Every 30–60 min

Incident Status Update Template

Template:

## Incident Status Update

**Time:** [Timestamp]
**Severity:** [Sev1/2/3]
**Status:** [Investigating / Mitigating / Resolved]

**Summary:**
[One to two sentences describing the issue and current status]

**Impact:**
- [Affected services/features]
- [Affected users/tenants]
- [Estimated restore time]

**Actions Taken:**
- [Action 1]
- [Action 2]
- [Next steps]

**Next Update:**
[When next update will be provided]

Tip

Status Update Template: Use a simple template for incident status updates (one to two paragraphs, bullets for actions). Keep updates concise and focused on what stakeholders need to know.

Post-Incident Review and Follow-Up

Post-Incident Review Process

Review Timeline:

  • Sev1 - Postmortem within 48 hours
  • Sev2 - Postmortem within 1 week
  • Sev3 - Optional postmortem or async review

Postmortem Structure:

  1. Incident Summary
  2. What happened
  3. Timeline of events
  4. Impact assessment

  5. Root Cause Analysis

  6. Root cause identified
  7. Contributing factors
  8. Why it wasn't prevented

  9. Remediations

  10. Immediate fixes applied
  11. Short-term mitigations
  12. Long-term prevention steps

  13. Action Items

  14. Tasks created for fixes
  15. Runbook updates needed
  16. Process improvements

Post-Incident Checklist:

  • Postmortem scheduled and conducted
  • Root cause identified and documented
  • Contributing factors documented
  • Remediations planned and tracked
  • Runbooks updated with learnings
  • ADRs/BDRs created or updated (if needed)
  • Action items created and assigned

Decision

Post-Incident Review Requirement: Every Sev1 incident must have a documented post-incident review. Postmortems are blameless and focus on learning, not blame. Runbooks must be updated based on postmortem findings.

See: How to Write a Good ADR for ADR guidance.

See: How to Write a Good BDR for BDR guidance.