Runbook – Incident Response¶

This document defines the incident response process: severity levels, triage steps, communication patterns, and roles and responsibilities. It is written for ops engineers, SREs, on-call engineers, and anyone responding to production incidents.

Incident response ensures rapid detection, diagnosis, mitigation, and learning. This runbook applies to Factory, platforms, and production SaaS built on them. Smaller internal hiccups may be tracked as normal bugs, not incidents.

Note

Incident vs Bug: Smaller internal hiccups may be tracked as normal bugs, not incidents. Incidents are SLA breaches, customer-facing outages, or security issues that require immediate response.

Scope and Definitions¶

What Counts as an Incident¶

Incident Criteria:

SLA Breach - Service availability or performance below SLO
Customer-Facing Outage - Service unavailable or degraded for customers
Security Issue - Security breach, data exposure, or unauthorized access
Data Loss Risk - Potential data loss or corruption
Critical System Failure - Factory, Identity, or Audit Platform down

Applies To:

AI Factory (runs, orchestration, storage)
Core Platforms (Identity, Audit, Config, Bot)
Production SaaS solutions built on ConnectSoft platforms

Not Incidents:

Minor bugs affecting single users
Non-critical feature issues
Internal tooling problems (unless blocking production)
Planned maintenance windows

See: Support and SLA Policy for SLA definitions.

Severity Levels¶

Severity Definitions¶

Severity	Description	Example	Response Time	Resolution Target
Sev1	Critical outage, major customer impact	Auth down, all tenants affected	Immediate (< 15 min)	< 1 hour
Sev2	Degraded service, limited impact	One region/product degraded, workaround available	< 30 minutes	< 4 hours
Sev3	Minor issues, non-urgent	Intermittent errors, low impact	< 4 hours	< 24 hours

Sev1 Examples:

Identity Platform completely down (no logins possible)
Factory completely unavailable (no generations possible)
Data corruption or loss risk
Security breach

Sev2 Examples:

High error rate (> 10%) affecting some tenants
Performance degradation (latency > 2x normal)
Partial service outage (one region or feature)
Data inconsistency

Sev3 Examples:

Intermittent errors affecting few users
Non-critical feature broken
Minor performance issues
Low-impact bugs

Important

Paging Requirements: Sev1 incidents require immediate paging of Ops + Dmitry + On-call dev. Sev2 incidents page Ops + On-call dev. Sev3 incidents are handled async by Ops. Always err on the side of paging for unclear severity.

Triage and Initial Response¶

Triage Flow¶

Step-by-Step Triage:

Acknowledge Alert and Assign Incident Commander (IC)
First responder acknowledges alert
Assigns incident commander (IC) role
IC coordinates response and communication
Confirm Severity Level Using Predefined Criteria
Review severity definitions
Assess customer impact
Assign severity level (Sev½/3)
Check Dashboards, Logs, and Alerts for Quick Diagnosis
Review dashboards for anomalies
Check error logs for patterns
Review recent deployments or changes
Check traces (if available)
Decide: Mitigate, Rollback, or Escalate
Mitigate - Quick fix or workaround available
Rollback - Recent deployment likely cause
Escalate - Requires deeper investigation or expertise
Update Incident Log with Timestamps and Actions
Document all actions taken
Record timestamps for key events
Update status and next steps

Triage Checklist:

Alert acknowledged, IC assigned
Severity level confirmed
Dashboards checked for anomalies
Logs reviewed for error patterns
Recent changes reviewed
Decision made (mitigate/rollback/escalate)
Incident log updated

Warning

No Blind Patching: Guessing or patching blindly in production is not allowed. All actions must be traceable and documented. If unsure, escalate or rollback rather than guessing.

See: Deployment and Rollback Runbook for rollback procedures.

See: Observability – Dashboards and Alerts for monitoring details.

Communication and Stakeholder Management¶

Communication Channels¶

Internal Tech Team:

Channel: Slack/Teams incident channel
Content: Symptoms, suspected cause, actions taken
Cadence: Every 15–30 minutes for Sev1, hourly for Sev2

Business/Management:

Channel: Email/Slack summary
Content: Impact summary, estimated restore time
Cadence: Initial notification, then updates every hour for Sev1

Affected Customers (if applicable):

Channel: Status page / email
Content: Status and workaround (if available)
Cadence: Initial notification, updates every 30–60 minutes

Communication Table¶

Audience	What They Need	Channel	Cadence
Internal Tech	Symptoms, suspected cause, actions	Slack/Teams, incident doc	Every 15–30 min (Sev1)
Business/Management	Impact summary, estimated restore	Email/Slack summary	Hourly (Sev1)
Affected Customers	Status and workaround	Status page / email	Every 30–60 min

Incident Status Update Template¶

Template:

## Incident Status Update

**Time:** [Timestamp]
**Severity:** [Sev1/2/3]
**Status:** [Investigating / Mitigating / Resolved]

**Summary:**
[One to two sentences describing the issue and current status]

**Impact:**
- [Affected services/features]
- [Affected users/tenants]
- [Estimated restore time]

**Actions Taken:**
- [Action 1]
- [Action 2]
- [Next steps]

**Next Update:**
[When next update will be provided]

Tip

Status Update Template: Use a simple template for incident status updates (one to two paragraphs, bullets for actions). Keep updates concise and focused on what stakeholders need to know.

Post-Incident Review and Follow-Up¶

Post-Incident Review Process¶

Review Timeline:

Sev1 - Postmortem within 48 hours
Sev2 - Postmortem within 1 week
Sev3 - Optional postmortem or async review

Postmortem Structure:

Incident Summary
What happened
Timeline of events
Impact assessment
Root Cause Analysis
Root cause identified
Contributing factors
Why it wasn't prevented
Remediations
Immediate fixes applied
Short-term mitigations
Long-term prevention steps
Action Items
Tasks created for fixes
Runbook updates needed
Process improvements

Post-Incident Checklist:

Postmortem scheduled and conducted
Root cause identified and documented
Contributing factors documented
Remediations planned and tracked
Runbooks updated with learnings
ADRs/BDRs created or updated (if needed)
Action items created and assigned

Decision

Post-Incident Review Requirement: Every Sev1 incident must have a documented post-incident review. Postmortems are blameless and focus on learning, not blame. Runbooks must be updated based on postmortem findings.

See: How to Write a Good ADR for ADR guidance.

See: How to Write a Good BDR for BDR guidance.

Operations and SRE Overview - Operations overview
Deployment and Rollback Runbook - Deployment and rollback procedures
Observability – Dashboards and Alerts - Monitoring and alerting
Support and SLA Policy - SLA definitions
Security & Compliance - Security incident procedures