Operations and SRE Overview¶
This document explains how operations and SRE fit with the Factory, platforms, and SaaS solutions, defining responsibilities, environments, reliability principles, and collaboration patterns. It is written for ops engineers, SREs, architects, developers, and anyone responsible for running ConnectSoft systems in production.
Operations ensures Factory, platforms, and SaaS stay healthy in production. Ops keeps the runtime trustworthy while Factory accelerates build-time. This page defines the operational model and shared responsibilities.
Important
Observability and Rollback First-Class: Observability and rollback capabilities must be first-class, not afterthoughts. Every production service must have health checks, metrics, logging, and tracing. Every deployment must have a rollback plan.
Role of Ops in the ConnectSoft Ecosystem¶
Ops/SRE as the Glue:
Operations and SRE serve as the glue ensuring Factory, platforms, and SaaS stay healthy in production. Ops keeps the runtime trustworthy while Factory accelerates build-time.
Key Responsibilities:
- Reliability - Ensure services meet SLOs and SLAs
- Observability - Monitor, alert, and debug production systems
- Incident Response - Respond to incidents, minimize impact, learn from failures
- Capacity Planning - Scale systems to meet demand
- Change Management - Deploy changes safely with rollback capabilities
See: Observability-Driven Design for observability principles.
See: Support and SLA Policy for SLA definitions.
Responsibilities Across Factory, Platforms, and SaaS¶
Shared Responsibility Model:
Devs own code quality and resilience; ops owns runtime health and reliability. This is not "throw over the wall"—devs and ops collaborate throughout the lifecycle.
| Area | Ops/SRE Responsibilities | Dev/Architect Responsibilities |
|---|---|---|
| AI Factory | Monitor runs, infra health, storage, queues | Maintain templates, agents, run safety |
| Identity Platform | Keep auth service available and secure | Design flows, implement features |
| Audit Platform | Ensure logs/events are stored & queryable | Emit meaningful events |
| Config Platform | Keep config store available and backed up | Use config correctly, no hardcoded values |
| Bot Platform | Keep bot service available and responsive | Design conversation flows, implement features |
| SaaS Solutions | Deploy, monitor, and handle incidents | Ship resilient code, fix root causes |
Collaboration Points:
- Design Phase - Ops reviews architecture for operability
- Development - Devs implement health checks, metrics, logging
- Deployment - Ops and devs collaborate on deployment procedures
- Incidents - Devs participate in incident analysis and fixes
- Postmortems - Shared learning and runbook updates
See: Factory Operations for Factory-specific ops.
See: Identity Platform Runbook for Identity Platform ops.
See: Audit Platform Runbook for Audit Platform ops.
Environments and Deployment Targets¶
Environment Model¶
Typical Environments:
- Dev - Development environment for local testing
- Test - Automated testing environment
- Staging - Pre-production environment mirroring production
- Production - Live customer-facing environment
Environment Characteristics:
- Factory-generated services should be deployable across all environments via pipelines
- Each environment should mirror production as closely as possible
- Staging should be used for final validation before production deployment
Note
Infrastructure Flexibility: The exact infrastructure (AKS, Azure Container Apps, VMs, etc.) is configurable per project, but the ops model remains consistent. Factory-generated services follow the same operational patterns regardless of deployment target.
Deployment Targets¶
Supported Targets:
- Azure Kubernetes Service (AKS) - Container orchestration
- Azure Container Apps - Serverless containers
- Azure App Service - Managed web apps
- Azure Virtual Machines - Traditional VMs (legacy or special cases)
Deployment Principles:
- All deployments via CI/CD pipelines (Azure DevOps)
- Infrastructure-as-Code (Bicep/Terraform)
- Blue/green or rolling updates where possible
- Health checks required before traffic routing
See: Deployment and Rollback Runbook for deployment procedures.
Reliability Principles and SLOs (High-Level)¶
Reliability Principles¶
Core Principles:
- SLOs Defined - Critical platforms and SaaS services have defined SLOs (availability, latency, error budget)
- Health Checks Required - Everything has health checks (liveness, readiness)
- Basic Metrics - Request rate, error rate, latency metrics for all services
- Logging/Tracing - Structured logging and distributed tracing for debugging
- Error Budgets - SLOs define error budgets; breaches trigger reviews
SLO Concepts:
| Category | Example SLO (Conceptual) | Notes |
|---|---|---|
| Identity | 99.9% auth API availability | Critical for all services |
| Audit Ingest | 99% of events processed in < 5 seconds | Throughput and latency |
| Config Service | No config outage > 5 minutes | Availability and recovery |
| Factory Runs | 95% of runs complete successfully | Success rate |
| SaaS Solutions | 99.5% availability, p95 latency < 500ms | Per-service SLOs |
Tip
SLO Refinement: SLOs should be refined per product, but this page sets the mental model. Start with conservative SLOs and refine based on actual performance and business needs.
See: Support and SLA Policy for SLA definitions.
See: Observability – Dashboards and Alerts for monitoring and alerting.
Collaboration Between Dev, Ops, and Squads¶
Shared Responsibility Model¶
Dev Responsibilities:
- Write resilient code with proper error handling
- Implement health checks, metrics, logging
- Fix root causes of incidents
- Participate in postmortems and runbook updates
Ops Responsibilities:
- Monitor production systems
- Respond to incidents and alerts
- Deploy changes safely
- Maintain runbooks and operational procedures
Squad Responsibilities (Dev+Ops Mindset):
- Squads combine dev and ops expertise
- Collaborate on deployments and postmortems
- Share knowledge through ADRs/BDRs and runbooks
- Own end-to-end reliability
Incident Response Collaboration¶
Who Is Paged for What:
| Severity | Who Gets Paged | Response Time |
|---|---|---|
| Sev1 | Ops + Dmitry + On-call dev | Immediate (< 15 min) |
| Sev2 | Ops + On-call dev | < 30 minutes |
| Sev3 | Ops (async) | < 4 hours |
Dev Participation in Incidents:
- Devs participate in incident analysis and debugging
- Devs provide context on recent changes
- Devs implement fixes and verify resolution
- Devs contribute to postmortems and runbook updates
See: Incident Response Runbook for incident procedures.
Knowledge Sharing¶
ADRs/BDRs and Runbooks:
- ADRs document architectural decisions affecting operations
- BDRs document business decisions affecting reliability
- Runbooks document operational procedures
- All must be shared knowledge, not siloed
Postmortem Process:
- Every Sev1 incident requires a postmortem
- Postmortems are blameless and focus on learning
- Runbooks are updated based on postmortem findings
- ADRs/BDRs may be created or updated based on learnings
See: How to Write a Good ADR for ADR guidance.
See: How to Write a Good BDR for BDR guidance.
Related Documents¶
- Deployment and Rollback Runbook - Deployment procedures
- Incident Response Runbook - Incident response process
- Observability – Dashboards and Alerts - Monitoring and alerting
- Factory Operations - Factory-specific operations
- Identity Platform Runbook - Identity Platform operations
- Audit Platform Runbook - Audit Platform operations
- Observability-Driven Design - Observability principles
- Support and SLA Policy - SLA definitions
- Microservice Template - Template with health checks and metrics