Skip to content

Runbook – Deployment and Rollback

This document defines a standard deployment checklist for Factory-generated services and a rollback plan, including required telemetry to make decisions safely. It is written for ops engineers, SREs, and developers performing deployments.

This runbook applies to Factory-generated microservices and platform services. It assumes CI/CD pipelines are in place (e.g., Azure DevOps) with standard steps. Details can be specialized per project, but this is the default template.

Note

Project-Specific Details: Deployment details can be specialized per project, but this runbook provides the default template. All Factory-generated services should follow these principles even if implementation details vary.

Scope and Assumptions

Scope

Applies To:

  • Factory-generated microservices
  • Platform services (Identity, Audit, Config, Bot) where possible
  • SaaS solutions built on ConnectSoft platforms

Assumes:

  • CI/CD pipelines are in place (e.g., Azure DevOps)
  • Standard deployment steps (build, test, deploy)
  • Health checks are implemented
  • Observability is configured (metrics, logs, traces)

Assumptions

Infrastructure:

  • Infrastructure-as-Code (Bicep/Terraform) for infrastructure changes
  • Container-based deployments (Docker images)
  • Blue/green or rolling update deployment strategy

Observability:

  • Health check endpoints available
  • Metrics exported (request rate, error rate, latency)
  • Logging configured with correlation IDs
  • Tracing available (if applicable)

See: Microservice Template for template with health checks and metrics.

See: Observability-Driven Design for observability principles.

Pre-Deployment Checklist

Pre-Deployment Requirements:

  • All tests (unit/integration) are passing
  • ADRs/BDRs relevant to this change are linked in the PR or ticket
  • Deployment artifacts built and stored (images, packages)
  • Backups/snapshots taken for critical data stores, if applicable
  • Feature flags and config entries reviewed
  • Change communicated to relevant stakeholders (if production)
  • Rollback plan documented and tested (if applicable)
  • Database migrations reviewed and rollback plan exists (if applicable)

Important

Production Deployment Gate: Deployments to production MUST NOT proceed if the pre-deployment checklist is not complete. Incomplete checklists lead to incidents and rollbacks.

See: How to Write a Good ADR for ADR guidance.

See: How to Write a Good BDR for BDR guidance.

Deployment Procedure

Generic Deployment Steps

Standard Procedure:

  1. Confirm Environment and Version
  2. Verify target environment (dev/test/staging/production)
  3. Confirm deployment version/artifact
  4. Check deployment pipeline status

  5. Apply Infrastructure Changes (if any)

  6. Review infrastructure changes (IaC)
  7. Apply infrastructure updates first (if needed)
  8. Verify infrastructure health

  9. Deploy Service

  10. Trigger deployment via CI/CD pipeline
  11. Use blue/green or rolling update strategy
  12. Monitor initial logs/health during rollout

  13. Validate Health Checks Pass

  14. Wait for health checks to pass
  15. Verify liveness and readiness endpoints
  16. Check initial metrics for anomalies

Deployment Flow:

flowchart TD
    A[Start Deployment] --> B[Confirm Environment & Version]
    B --> C[Apply Infrastructure Changes]
    C --> D[Deploy Service]
    D --> E[Monitor Health Checks]
    E --> F{Health Checks Pass?}
    F -->|Yes| G[Post-Deployment Verification]
    F -->|No| H[Rollback]
    G --> I[Deployment Complete]
Hold "Alt" / "Option" to enable pan & zoom

See: Post-Deployment Verification for verification steps.

See: Rollback Strategy and Steps for rollback procedures.

Post-Deployment Verification

Post-Deployment Checklist:

  • Service health checks are green
  • No spike in error rate or latency
  • Key business flows tested (smoke tests)
  • Dashboards reflect expected metrics
  • Logs show no new critical errors
  • Custom KPIs are within expected ranges
  • No resource saturation (CPU, memory, queues)

Verification Steps:

  1. Health Checks
  2. Verify /health endpoint returns 200 OK
  3. Verify /ready endpoint returns 200 OK
  4. Check health check metrics in dashboard

  5. Error Rate

  6. Check error rate metric (should be < 1%)
  7. Review error logs for new error patterns
  8. Verify no increase in 5xx errors

  9. Latency

  10. Check p50, p95, p99 latency metrics
  11. Verify latency is within SLO thresholds
  12. Compare to pre-deployment baseline

  13. Business Flows

  14. Run smoke tests for critical user flows
  15. Verify key business operations work
  16. Check domain-specific metrics

  17. Resource Usage

  18. Monitor CPU, memory, disk usage
  19. Check queue backlogs (if applicable)
  20. Verify no resource saturation

Tip

Automated Smoke Tests: Recommend automated smoke tests where possible. Automated tests catch issues faster than manual verification and can be integrated into deployment pipelines.

See: Observability – Dashboards and Alerts for dashboard and alert details.

Rollback Strategy and Steps

Rollback Strategy

Preferred Strategy:

  • Quick Rollback - Revert to previous release if issues can't be mitigated within defined time
  • Rollback Triggers - Sustained elevated errors, severe incidents, or failed health checks
  • Rollback Window - Rollback should be possible within minutes, not hours

Rollback Decision Criteria:

  • Error rate > 5% sustained for > 5 minutes
  • Latency degradation > 2x baseline sustained
  • Health checks failing for > 2 minutes
  • Critical business flows broken
  • Data integrity concerns

Rollback Checklist

Rollback Steps:

  • Identify last known good version
  • Communicate rollback decision to stakeholders
  • Trigger rollback in CI/CD (redeploy previous artifacts)
  • Monitor health checks, logs, and metrics
  • Verify service returns to normal operation
  • Log incident and create follow-up ticket for root cause analysis
  • Update runbook with learnings

Rollback Procedure:

  1. Identify Last Known Good Version
  2. Check deployment history
  3. Identify previous stable version
  4. Verify rollback artifacts are available

  5. Communicate Rollback Decision

  6. Notify stakeholders (Slack/Teams)
  7. Update incident log
  8. Document reason for rollback

  9. Trigger Rollback

  10. Use CI/CD pipeline to redeploy previous version
  11. Monitor deployment progress
  12. Verify rollback completes successfully

  13. Monitor and Verify

  14. Check health checks return to normal
  15. Verify error rate decreases
  16. Confirm latency returns to baseline
  17. Run smoke tests to verify functionality

  18. Post-Rollback Actions

  19. Document incident and rollback
  20. Create follow-up ticket for root cause analysis
  21. Update runbook with learnings
  22. Schedule postmortem if Sev1/Sev2

Warning

Irreversible Migrations: If irreversible database or data migrations were executed, rollback may need a data-level plan. Migration ADRs must include rollback strategy. Always test migration rollback procedures in staging before production.

See: Incident Response Runbook for incident procedures.

Required Telemetry and Observability

Minimal Telemetry Requirements

Required Signals:

  • Health Endpoints - Liveness and readiness checks
  • Request Rate - Requests per second
  • Error Rate - Errors per second, error percentage
  • Latency - p50, p95, p99 latency metrics
  • Logs - Structured logs with correlation IDs
  • Traces - Distributed tracing (if available)

Telemetry Table:

Signal Why It Matters Example
Health Checks Basic liveness/readiness /health endpoint returns 200 OK
Error Rate Detect regressions quickly Error rate < 1%
Latency Performance regressions p95 latency < 500ms
Request Rate Traffic patterns and anomalies Requests/sec within expected range
Custom KPIs Business-level impact Domain-specific metrics (sign-ins, jobs processed)
Logs Debugging and root cause analysis Structured logs with correlation IDs
Traces End-to-end request flow Distributed traces across services

See: Observability – Dashboards and Alerts for dashboard and alert details.

See: Observability-Driven Design for observability principles.