Runbook – Deployment and Rollback¶

This document defines a standard deployment checklist for Factory-generated services and a rollback plan, including required telemetry to make decisions safely. It is written for ops engineers, SREs, and developers performing deployments.

This runbook applies to Factory-generated microservices and platform services. It assumes CI/CD pipelines are in place (e.g., Azure DevOps) with standard steps. Details can be specialized per project, but this is the default template.

Note

Project-Specific Details: Deployment details can be specialized per project, but this runbook provides the default template. All Factory-generated services should follow these principles even if implementation details vary.

Scope and Assumptions¶

Scope¶

Applies To:

Factory-generated microservices
Platform services (Identity, Audit, Config, Bot) where possible
SaaS solutions built on ConnectSoft platforms

Assumes:

CI/CD pipelines are in place (e.g., Azure DevOps)
Standard deployment steps (build, test, deploy)
Health checks are implemented
Observability is configured (metrics, logs, traces)

Assumptions¶

Infrastructure:

Infrastructure-as-Code (Bicep/Terraform) for infrastructure changes
Container-based deployments (Docker images)
Blue/green or rolling update deployment strategy

Observability:

Health check endpoints available
Metrics exported (request rate, error rate, latency)
Logging configured with correlation IDs
Tracing available (if applicable)

See: Microservice Template for template with health checks and metrics.

See: Observability-Driven Design for observability principles.

Pre-Deployment Checklist¶

Pre-Deployment Requirements:

All tests (unit/integration) are passing
ADRs/BDRs relevant to this change are linked in the PR or ticket
Deployment artifacts built and stored (images, packages)
Backups/snapshots taken for critical data stores, if applicable
Feature flags and config entries reviewed
Change communicated to relevant stakeholders (if production)
Rollback plan documented and tested (if applicable)
Database migrations reviewed and rollback plan exists (if applicable)

Important

Production Deployment Gate: Deployments to production MUST NOT proceed if the pre-deployment checklist is not complete. Incomplete checklists lead to incidents and rollbacks.

See: How to Write a Good ADR for ADR guidance.

See: How to Write a Good BDR for BDR guidance.

Deployment Procedure¶

Generic Deployment Steps¶

Standard Procedure:

Confirm Environment and Version
Verify target environment (dev/test/staging/production)
Confirm deployment version/artifact
Check deployment pipeline status
Apply Infrastructure Changes (if any)
Review infrastructure changes (IaC)
Apply infrastructure updates first (if needed)
Verify infrastructure health
Deploy Service
Trigger deployment via CI/CD pipeline
Use blue/green or rolling update strategy
Monitor initial logs/health during rollout
Validate Health Checks Pass
Wait for health checks to pass
Verify liveness and readiness endpoints
Check initial metrics for anomalies

Deployment Flow:

flowchart TD
    A[Start Deployment] --> B[Confirm Environment & Version]
    B --> C[Apply Infrastructure Changes]
    C --> D[Deploy Service]
    D --> E[Monitor Health Checks]
    E --> F{Health Checks Pass?}
    F -->|Yes| G[Post-Deployment Verification]
    F -->|No| H[Rollback]
    G --> I[Deployment Complete]

Hold "Alt" / "Option" to enable pan & zoom

See: Post-Deployment Verification for verification steps.

See: Rollback Strategy and Steps for rollback procedures.

Post-Deployment Verification¶

Post-Deployment Checklist:

Service health checks are green
No spike in error rate or latency
Key business flows tested (smoke tests)
Dashboards reflect expected metrics
Logs show no new critical errors
Custom KPIs are within expected ranges
No resource saturation (CPU, memory, queues)

Verification Steps:

Health Checks
Verify /health endpoint returns 200 OK
Verify /ready endpoint returns 200 OK
Check health check metrics in dashboard
Error Rate
Check error rate metric (should be < 1%)
Review error logs for new error patterns
Verify no increase in 5xx errors
Latency
Check p50, p95, p99 latency metrics
Verify latency is within SLO thresholds
Compare to pre-deployment baseline
Business Flows
Run smoke tests for critical user flows
Verify key business operations work
Check domain-specific metrics
Resource Usage
Monitor CPU, memory, disk usage
Check queue backlogs (if applicable)
Verify no resource saturation

Tip

Automated Smoke Tests: Recommend automated smoke tests where possible. Automated tests catch issues faster than manual verification and can be integrated into deployment pipelines.

See: Observability – Dashboards and Alerts for dashboard and alert details.

Rollback Strategy and Steps¶

Rollback Strategy¶

Preferred Strategy:

Quick Rollback - Revert to previous release if issues can't be mitigated within defined time
Rollback Triggers - Sustained elevated errors, severe incidents, or failed health checks
Rollback Window - Rollback should be possible within minutes, not hours

Rollback Decision Criteria:

Error rate > 5% sustained for > 5 minutes
Latency degradation > 2x baseline sustained
Health checks failing for > 2 minutes
Critical business flows broken
Data integrity concerns

Rollback Checklist¶

Rollback Steps:

Identify last known good version
Communicate rollback decision to stakeholders
Trigger rollback in CI/CD (redeploy previous artifacts)
Monitor health checks, logs, and metrics
Verify service returns to normal operation
Log incident and create follow-up ticket for root cause analysis
Update runbook with learnings

Rollback Procedure:

Identify Last Known Good Version
Check deployment history
Identify previous stable version
Verify rollback artifacts are available
Communicate Rollback Decision
Notify stakeholders (Slack/Teams)
Update incident log
Document reason for rollback
Trigger Rollback
Use CI/CD pipeline to redeploy previous version
Monitor deployment progress
Verify rollback completes successfully
Monitor and Verify
Check health checks return to normal
Verify error rate decreases
Confirm latency returns to baseline
Run smoke tests to verify functionality
Post-Rollback Actions
Document incident and rollback
Create follow-up ticket for root cause analysis
Update runbook with learnings
Schedule postmortem if Sev1/Sev2

Warning

Irreversible Migrations: If irreversible database or data migrations were executed, rollback may need a data-level plan. Migration ADRs must include rollback strategy. Always test migration rollback procedures in staging before production.

See: Incident Response Runbook for incident procedures.

Required Telemetry and Observability¶

Minimal Telemetry Requirements¶

Required Signals:

Health Endpoints - Liveness and readiness checks
Request Rate - Requests per second
Error Rate - Errors per second, error percentage
Latency - p50, p95, p99 latency metrics
Logs - Structured logs with correlation IDs
Traces - Distributed tracing (if available)

Telemetry Table:

Signal	Why It Matters	Example
Health Checks	Basic liveness/readiness	`/health` endpoint returns 200 OK
Error Rate	Detect regressions quickly	Error rate < 1%
Latency	Performance regressions	p95 latency < 500ms
Request Rate	Traffic patterns and anomalies	Requests/sec within expected range
Custom KPIs	Business-level impact	Domain-specific metrics (sign-ins, jobs processed)
Logs	Debugging and root cause analysis	Structured logs with correlation IDs
Traces	End-to-end request flow	Distributed traces across services