Skip to content

Operations and SRE Overview

This document explains how operations and SRE fit with the Factory, platforms, and SaaS solutions, defining responsibilities, environments, reliability principles, and collaboration patterns. It is written for ops engineers, SREs, architects, developers, and anyone responsible for running ConnectSoft systems in production.

Operations ensures Factory, platforms, and SaaS stay healthy in production. Ops keeps the runtime trustworthy while Factory accelerates build-time. This page defines the operational model and shared responsibilities.

Important

Observability and Rollback First-Class: Observability and rollback capabilities must be first-class, not afterthoughts. Every production service must have health checks, metrics, logging, and tracing. Every deployment must have a rollback plan.

Role of Ops in the ConnectSoft Ecosystem

Ops/SRE as the Glue:

Operations and SRE serve as the glue ensuring Factory, platforms, and SaaS stay healthy in production. Ops keeps the runtime trustworthy while Factory accelerates build-time.

Key Responsibilities:

  • Reliability - Ensure services meet SLOs and SLAs
  • Observability - Monitor, alert, and debug production systems
  • Incident Response - Respond to incidents, minimize impact, learn from failures
  • Capacity Planning - Scale systems to meet demand
  • Change Management - Deploy changes safely with rollback capabilities

See: Observability-Driven Design for observability principles.

See: Support and SLA Policy for SLA definitions.

Responsibilities Across Factory, Platforms, and SaaS

Shared Responsibility Model:

Devs own code quality and resilience; ops owns runtime health and reliability. This is not "throw over the wall"—devs and ops collaborate throughout the lifecycle.

Area Ops/SRE Responsibilities Dev/Architect Responsibilities
AI Factory Monitor runs, infra health, storage, queues Maintain templates, agents, run safety
Identity Platform Keep auth service available and secure Design flows, implement features
Audit Platform Ensure logs/events are stored & queryable Emit meaningful events
Config Platform Keep config store available and backed up Use config correctly, no hardcoded values
Bot Platform Keep bot service available and responsive Design conversation flows, implement features
SaaS Solutions Deploy, monitor, and handle incidents Ship resilient code, fix root causes

Collaboration Points:

  • Design Phase - Ops reviews architecture for operability
  • Development - Devs implement health checks, metrics, logging
  • Deployment - Ops and devs collaborate on deployment procedures
  • Incidents - Devs participate in incident analysis and fixes
  • Postmortems - Shared learning and runbook updates

See: Factory Operations for Factory-specific ops.

See: Identity Platform Runbook for Identity Platform ops.

See: Audit Platform Runbook for Audit Platform ops.

Environments and Deployment Targets

Environment Model

Typical Environments:

  • Dev - Development environment for local testing
  • Test - Automated testing environment
  • Staging - Pre-production environment mirroring production
  • Production - Live customer-facing environment

Environment Characteristics:

  • Factory-generated services should be deployable across all environments via pipelines
  • Each environment should mirror production as closely as possible
  • Staging should be used for final validation before production deployment

Note

Infrastructure Flexibility: The exact infrastructure (AKS, Azure Container Apps, VMs, etc.) is configurable per project, but the ops model remains consistent. Factory-generated services follow the same operational patterns regardless of deployment target.

Deployment Targets

Supported Targets:

  • Azure Kubernetes Service (AKS) - Container orchestration
  • Azure Container Apps - Serverless containers
  • Azure App Service - Managed web apps
  • Azure Virtual Machines - Traditional VMs (legacy or special cases)

Deployment Principles:

  • All deployments via CI/CD pipelines (Azure DevOps)
  • Infrastructure-as-Code (Bicep/Terraform)
  • Blue/green or rolling updates where possible
  • Health checks required before traffic routing

See: Deployment and Rollback Runbook for deployment procedures.

Reliability Principles and SLOs (High-Level)

Reliability Principles

Core Principles:

  • SLOs Defined - Critical platforms and SaaS services have defined SLOs (availability, latency, error budget)
  • Health Checks Required - Everything has health checks (liveness, readiness)
  • Basic Metrics - Request rate, error rate, latency metrics for all services
  • Logging/Tracing - Structured logging and distributed tracing for debugging
  • Error Budgets - SLOs define error budgets; breaches trigger reviews

SLO Concepts:

Category Example SLO (Conceptual) Notes
Identity 99.9% auth API availability Critical for all services
Audit Ingest 99% of events processed in < 5 seconds Throughput and latency
Config Service No config outage > 5 minutes Availability and recovery
Factory Runs 95% of runs complete successfully Success rate
SaaS Solutions 99.5% availability, p95 latency < 500ms Per-service SLOs

Tip

SLO Refinement: SLOs should be refined per product, but this page sets the mental model. Start with conservative SLOs and refine based on actual performance and business needs.

See: Support and SLA Policy for SLA definitions.

See: Observability – Dashboards and Alerts for monitoring and alerting.

Collaboration Between Dev, Ops, and Squads

Shared Responsibility Model

Dev Responsibilities:

  • Write resilient code with proper error handling
  • Implement health checks, metrics, logging
  • Fix root causes of incidents
  • Participate in postmortems and runbook updates

Ops Responsibilities:

  • Monitor production systems
  • Respond to incidents and alerts
  • Deploy changes safely
  • Maintain runbooks and operational procedures

Squad Responsibilities (Dev+Ops Mindset):

  • Squads combine dev and ops expertise
  • Collaborate on deployments and postmortems
  • Share knowledge through ADRs/BDRs and runbooks
  • Own end-to-end reliability

Incident Response Collaboration

Who Is Paged for What:

Severity Who Gets Paged Response Time
Sev1 Ops + Dmitry + On-call dev Immediate (< 15 min)
Sev2 Ops + On-call dev < 30 minutes
Sev3 Ops (async) < 4 hours

Dev Participation in Incidents:

  • Devs participate in incident analysis and debugging
  • Devs provide context on recent changes
  • Devs implement fixes and verify resolution
  • Devs contribute to postmortems and runbook updates

See: Incident Response Runbook for incident procedures.

Knowledge Sharing

ADRs/BDRs and Runbooks:

  • ADRs document architectural decisions affecting operations
  • BDRs document business decisions affecting reliability
  • Runbooks document operational procedures
  • All must be shared knowledge, not siloed

Postmortem Process:

  • Every Sev1 incident requires a postmortem
  • Postmortems are blameless and focus on learning
  • Runbooks are updated based on postmortem findings
  • ADRs/BDRs may be created or updated based on learnings

See: How to Write a Good ADR for ADR guidance.

See: How to Write a Good BDR for BDR guidance.