Runbook – Deployment and Rollback¶
This document defines a standard deployment checklist for Factory-generated services and a rollback plan, including required telemetry to make decisions safely. It is written for ops engineers, SREs, and developers performing deployments.
This runbook applies to Factory-generated microservices and platform services. It assumes CI/CD pipelines are in place (e.g., Azure DevOps) with standard steps. Details can be specialized per project, but this is the default template.
Note
Project-Specific Details: Deployment details can be specialized per project, but this runbook provides the default template. All Factory-generated services should follow these principles even if implementation details vary.
Scope and Assumptions¶
Scope¶
Applies To:
- Factory-generated microservices
- Platform services (Identity, Audit, Config, Bot) where possible
- SaaS solutions built on ConnectSoft platforms
Assumes:
- CI/CD pipelines are in place (e.g., Azure DevOps)
- Standard deployment steps (build, test, deploy)
- Health checks are implemented
- Observability is configured (metrics, logs, traces)
Assumptions¶
Infrastructure:
- Infrastructure-as-Code (Bicep/Terraform) for infrastructure changes
- Container-based deployments (Docker images)
- Blue/green or rolling update deployment strategy
Observability:
- Health check endpoints available
- Metrics exported (request rate, error rate, latency)
- Logging configured with correlation IDs
- Tracing available (if applicable)
See: Microservice Template for template with health checks and metrics.
See: Observability-Driven Design for observability principles.
Pre-Deployment Checklist¶
Pre-Deployment Requirements:
- All tests (unit/integration) are passing
- ADRs/BDRs relevant to this change are linked in the PR or ticket
- Deployment artifacts built and stored (images, packages)
- Backups/snapshots taken for critical data stores, if applicable
- Feature flags and config entries reviewed
- Change communicated to relevant stakeholders (if production)
- Rollback plan documented and tested (if applicable)
- Database migrations reviewed and rollback plan exists (if applicable)
Important
Production Deployment Gate: Deployments to production MUST NOT proceed if the pre-deployment checklist is not complete. Incomplete checklists lead to incidents and rollbacks.
See: How to Write a Good ADR for ADR guidance.
See: How to Write a Good BDR for BDR guidance.
Deployment Procedure¶
Generic Deployment Steps¶
Standard Procedure:
- Confirm Environment and Version
- Verify target environment (dev/test/staging/production)
- Confirm deployment version/artifact
-
Check deployment pipeline status
-
Apply Infrastructure Changes (if any)
- Review infrastructure changes (IaC)
- Apply infrastructure updates first (if needed)
-
Verify infrastructure health
-
Deploy Service
- Trigger deployment via CI/CD pipeline
- Use blue/green or rolling update strategy
-
Monitor initial logs/health during rollout
-
Validate Health Checks Pass
- Wait for health checks to pass
- Verify liveness and readiness endpoints
- Check initial metrics for anomalies
Deployment Flow:
flowchart TD
A[Start Deployment] --> B[Confirm Environment & Version]
B --> C[Apply Infrastructure Changes]
C --> D[Deploy Service]
D --> E[Monitor Health Checks]
E --> F{Health Checks Pass?}
F -->|Yes| G[Post-Deployment Verification]
F -->|No| H[Rollback]
G --> I[Deployment Complete]
See: Post-Deployment Verification for verification steps.
See: Rollback Strategy and Steps for rollback procedures.
Post-Deployment Verification¶
Post-Deployment Checklist:
- Service health checks are green
- No spike in error rate or latency
- Key business flows tested (smoke tests)
- Dashboards reflect expected metrics
- Logs show no new critical errors
- Custom KPIs are within expected ranges
- No resource saturation (CPU, memory, queues)
Verification Steps:
- Health Checks
- Verify
/healthendpoint returns 200 OK - Verify
/readyendpoint returns 200 OK -
Check health check metrics in dashboard
-
Error Rate
- Check error rate metric (should be < 1%)
- Review error logs for new error patterns
-
Verify no increase in 5xx errors
-
Latency
- Check p50, p95, p99 latency metrics
- Verify latency is within SLO thresholds
-
Compare to pre-deployment baseline
-
Business Flows
- Run smoke tests for critical user flows
- Verify key business operations work
-
Check domain-specific metrics
-
Resource Usage
- Monitor CPU, memory, disk usage
- Check queue backlogs (if applicable)
- Verify no resource saturation
Tip
Automated Smoke Tests: Recommend automated smoke tests where possible. Automated tests catch issues faster than manual verification and can be integrated into deployment pipelines.
See: Observability – Dashboards and Alerts for dashboard and alert details.
Rollback Strategy and Steps¶
Rollback Strategy¶
Preferred Strategy:
- Quick Rollback - Revert to previous release if issues can't be mitigated within defined time
- Rollback Triggers - Sustained elevated errors, severe incidents, or failed health checks
- Rollback Window - Rollback should be possible within minutes, not hours
Rollback Decision Criteria:
- Error rate > 5% sustained for > 5 minutes
- Latency degradation > 2x baseline sustained
- Health checks failing for > 2 minutes
- Critical business flows broken
- Data integrity concerns
Rollback Checklist¶
Rollback Steps:
- Identify last known good version
- Communicate rollback decision to stakeholders
- Trigger rollback in CI/CD (redeploy previous artifacts)
- Monitor health checks, logs, and metrics
- Verify service returns to normal operation
- Log incident and create follow-up ticket for root cause analysis
- Update runbook with learnings
Rollback Procedure:
- Identify Last Known Good Version
- Check deployment history
- Identify previous stable version
-
Verify rollback artifacts are available
-
Communicate Rollback Decision
- Notify stakeholders (Slack/Teams)
- Update incident log
-
Document reason for rollback
-
Trigger Rollback
- Use CI/CD pipeline to redeploy previous version
- Monitor deployment progress
-
Verify rollback completes successfully
-
Monitor and Verify
- Check health checks return to normal
- Verify error rate decreases
- Confirm latency returns to baseline
-
Run smoke tests to verify functionality
-
Post-Rollback Actions
- Document incident and rollback
- Create follow-up ticket for root cause analysis
- Update runbook with learnings
- Schedule postmortem if Sev1/Sev2
Warning
Irreversible Migrations: If irreversible database or data migrations were executed, rollback may need a data-level plan. Migration ADRs must include rollback strategy. Always test migration rollback procedures in staging before production.
See: Incident Response Runbook for incident procedures.
Required Telemetry and Observability¶
Minimal Telemetry Requirements¶
Required Signals:
- Health Endpoints - Liveness and readiness checks
- Request Rate - Requests per second
- Error Rate - Errors per second, error percentage
- Latency - p50, p95, p99 latency metrics
- Logs - Structured logs with correlation IDs
- Traces - Distributed tracing (if available)
Telemetry Table:
| Signal | Why It Matters | Example |
|---|---|---|
| Health Checks | Basic liveness/readiness | /health endpoint returns 200 OK |
| Error Rate | Detect regressions quickly | Error rate < 1% |
| Latency | Performance regressions | p95 latency < 500ms |
| Request Rate | Traffic patterns and anomalies | Requests/sec within expected range |
| Custom KPIs | Business-level impact | Domain-specific metrics (sign-ins, jobs processed) |
| Logs | Debugging and root cause analysis | Structured logs with correlation IDs |
| Traces | End-to-end request flow | Distributed traces across services |
See: Observability – Dashboards and Alerts for dashboard and alert details.
See: Observability-Driven Design for observability principles.
Related Documents¶
- Operations and SRE Overview - Operations overview
- Incident Response Runbook - Incident response procedures
- Observability – Dashboards and Alerts - Monitoring and alerting
- Microservice Template - Template with health checks and metrics
- Observability-Driven Design - Observability principles
- Support and SLA Policy - SLA definitions