Operations and SRE Overview¶

This document explains how operations and SRE fit with the Factory, platforms, and SaaS solutions, defining responsibilities, environments, reliability principles, and collaboration patterns. It is written for ops engineers, SREs, architects, developers, and anyone responsible for running ConnectSoft systems in production.

Operations ensures Factory, platforms, and SaaS stay healthy in production. Ops keeps the runtime trustworthy while Factory accelerates build-time. This page defines the operational model and shared responsibilities.

Important

Observability and Rollback First-Class: Observability and rollback capabilities must be first-class, not afterthoughts. Every production service must have health checks, metrics, logging, and tracing. Every deployment must have a rollback plan.

Role of Ops in the ConnectSoft Ecosystem¶

Ops/SRE as the Glue:

Operations and SRE serve as the glue ensuring Factory, platforms, and SaaS stay healthy in production. Ops keeps the runtime trustworthy while Factory accelerates build-time.

Key Responsibilities:

Reliability - Ensure services meet SLOs and SLAs
Observability - Monitor, alert, and debug production systems
Incident Response - Respond to incidents, minimize impact, learn from failures
Capacity Planning - Scale systems to meet demand
Change Management - Deploy changes safely with rollback capabilities

See: Observability-Driven Design for observability principles.

See: Support and SLA Policy for SLA definitions.

Responsibilities Across Factory, Platforms, and SaaS¶

Shared Responsibility Model:

Devs own code quality and resilience; ops owns runtime health and reliability. This is not "throw over the wall"—devs and ops collaborate throughout the lifecycle.

Area	Ops/SRE Responsibilities	Dev/Architect Responsibilities
AI Factory	Monitor runs, infra health, storage, queues	Maintain templates, agents, run safety
Identity Platform	Keep auth service available and secure	Design flows, implement features
Audit Platform	Ensure logs/events are stored & queryable	Emit meaningful events
Config Platform	Keep config store available and backed up	Use config correctly, no hardcoded values
Bot Platform	Keep bot service available and responsive	Design conversation flows, implement features
SaaS Solutions	Deploy, monitor, and handle incidents	Ship resilient code, fix root causes

Collaboration Points:

Design Phase - Ops reviews architecture for operability
Development - Devs implement health checks, metrics, logging
Deployment - Ops and devs collaborate on deployment procedures
Incidents - Devs participate in incident analysis and fixes
Postmortems - Shared learning and runbook updates

See: Factory Operations for Factory-specific ops.

See: Identity Platform Runbook for Identity Platform ops.

See: Audit Platform Runbook for Audit Platform ops.

Environments and Deployment Targets¶

Environment Model¶

Typical Environments:

Dev - Development environment for local testing
Test - Automated testing environment
Staging - Pre-production environment mirroring production
Production - Live customer-facing environment

Environment Characteristics:

Factory-generated services should be deployable across all environments via pipelines
Each environment should mirror production as closely as possible
Staging should be used for final validation before production deployment

Note

Infrastructure Flexibility: The exact infrastructure (AKS, Azure Container Apps, VMs, etc.) is configurable per project, but the ops model remains consistent. Factory-generated services follow the same operational patterns regardless of deployment target.

Deployment Targets¶

Supported Targets:

Azure Kubernetes Service (AKS) - Container orchestration
Azure Container Apps - Serverless containers
Azure App Service - Managed web apps
Azure Virtual Machines - Traditional VMs (legacy or special cases)

Deployment Principles:

All deployments via CI/CD pipelines (Azure DevOps)
Infrastructure-as-Code (Bicep/Terraform)
Blue/green or rolling updates where possible
Health checks required before traffic routing

See: Deployment and Rollback Runbook for deployment procedures.

Reliability Principles and SLOs (High-Level)¶

Reliability Principles¶

Core Principles:

SLOs Defined - Critical platforms and SaaS services have defined SLOs (availability, latency, error budget)
Health Checks Required - Everything has health checks (liveness, readiness)
Basic Metrics - Request rate, error rate, latency metrics for all services
Logging/Tracing - Structured logging and distributed tracing for debugging
Error Budgets - SLOs define error budgets; breaches trigger reviews

SLO Concepts:

Category	Example SLO (Conceptual)	Notes
Identity	99.9% auth API availability	Critical for all services
Audit Ingest	99% of events processed in < 5 seconds	Throughput and latency
Config Service	No config outage > 5 minutes	Availability and recovery
Factory Runs	95% of runs complete successfully	Success rate
SaaS Solutions	99.5% availability, p95 latency < 500ms	Per-service SLOs

Tip

SLO Refinement: SLOs should be refined per product, but this page sets the mental model. Start with conservative SLOs and refine based on actual performance and business needs.

See: Support and SLA Policy for SLA definitions.

See: Observability – Dashboards and Alerts for monitoring and alerting.

Collaboration Between Dev, Ops, and Squads¶

Shared Responsibility Model¶

Dev Responsibilities:

Write resilient code with proper error handling
Implement health checks, metrics, logging
Fix root causes of incidents
Participate in postmortems and runbook updates

Ops Responsibilities:

Monitor production systems
Respond to incidents and alerts
Deploy changes safely
Maintain runbooks and operational procedures

Squad Responsibilities (Dev+Ops Mindset):

Squads combine dev and ops expertise
Collaborate on deployments and postmortems
Share knowledge through ADRs/BDRs and runbooks
Own end-to-end reliability

Incident Response Collaboration¶

Who Is Paged for What:

Severity	Who Gets Paged	Response Time
Sev1	Ops + Dmitry + On-call dev	Immediate (< 15 min)
Sev2	Ops + On-call dev	< 30 minutes
Sev3	Ops (async)	< 4 hours

Dev Participation in Incidents:

Devs participate in incident analysis and debugging
Devs provide context on recent changes
Devs implement fixes and verify resolution
Devs contribute to postmortems and runbook updates

See: Incident Response Runbook for incident procedures.

ADRs/BDRs and Runbooks:

ADRs document architectural decisions affecting operations
BDRs document business decisions affecting reliability
Runbooks document operational procedures
All must be shared knowledge, not siloed

Postmortem Process:

Every Sev1 incident requires a postmortem
Postmortems are blameless and focus on learning
Runbooks are updated based on postmortem findings
ADRs/BDRs may be created or updated based on learnings

See: How to Write a Good ADR for ADR guidance.

See: How to Write a Good BDR for BDR guidance.

Deployment and Rollback Runbook - Deployment procedures
Incident Response Runbook - Incident response process
Observability – Dashboards and Alerts - Monitoring and alerting
Factory Operations - Factory-specific operations
Identity Platform Runbook - Identity Platform operations
Audit Platform Runbook - Audit Platform operations
Observability-Driven Design - Observability principles
Support and SLA Policy - SLA definitions
Microservice Template - Template with health checks and metrics