DevOps / SRE Quickstart¶

This quickstart helps DevOps engineers and SREs get oriented with ConnectSoft's CI/CD pipelines, infrastructure, monitoring, and operations. It is written for DevOps engineers, SREs, and platform engineers responsible for deployment, monitoring, and incident response.

As DevOps/SRE at ConnectSoft, you'll manage CI/CD pipelines, deploy services, monitor systems, respond to incidents, and ensure reliability and performance.

Your Goals as DevOps / SRE¶

Manage CI/CD Pipelines - Configure and maintain build and deployment pipelines
Deploy Services - Deploy services to Azure environments
Monitor Systems - Set up monitoring, dashboards, and alerts
Respond to Incidents - Handle incidents and maintain runbooks
Ensure Reliability - Ensure services meet SLAs and performance targets

CI/CD and Environments Overview¶

Pipeline Structure¶

Generated pipelines follow a standard structure:

stages:
  - stage: Build
    jobs:
      - job: BuildAndTest
        steps:
          - task: UseDotNet@2
          - script: dotnet build
          - script: dotnet test

  - stage: Deploy
    jobs:
      - job: DeployToDev
        steps:
          - script: az deployment group create ...

See: CI/CD Guidelines for detailed guidance.

Environments¶

Standard environments:

Development - For development and testing
Staging - For pre-production testing
Production - For live customer traffic

Deployment Patterns¶

Blue/Green - Zero-downtime deployments
Canary - Gradual rollout to subset of traffic
Rolling - Standard rolling update

See: CI/CD Guidelines for deployment patterns.

Monitoring, Dashboards, and Alerts¶

What We Monitor¶

Technical Metrics - CPU, memory, request rates, latency, error rates
Business Metrics - User actions, transactions, feature usage
Platform Metrics - Factory runs, agent performance, knowledge system

See: Monitoring & Dashboards for detailed guidance.

Dashboards¶

Service Dashboards - Per-service metrics and health
Platform Dashboards - Factory and platform metrics
Business Dashboards - Business KPIs and metrics

Alerting Principles¶

Avoid Alert Fatigue - Only alert on actionable issues
Use SLOs - Set Service Level Objectives
Clear Runbooks - Each alert should have a runbook

See: Monitoring & Dashboards for alerting guidance.

Incident & Runbook References¶

Incident Management Process¶

Detection - Alerts or user reports
Triage - Assess severity and impact
Mitigation - Take immediate action to restore service
Resolution - Fix root cause
Postmortem - Document learnings

See: Incident Management for detailed process.

Runbooks¶

Identity Platform Runbook - Identity Platform operations
Audit Platform Runbook - Audit Platform operations
Factory Operations - Factory operations
Monitoring & Dashboards - Monitoring setup

Common Incidents¶

Service Down: 1. Check health endpoints 2. Review logs and metrics 3. Check recent deployments 4. Review runbook for service

Performance Degradation: 1. Check metrics for bottlenecks 2. Review recent changes 3. Check resource utilization 4. Scale if needed

Factory Run Failed: 1. Check Factory logs 2. Review agent execution logs 3. Check Azure DevOps integration 4. Review runbook for Factory

Common Tasks¶

Task: Set Up New Service Pipeline¶

Review Generated Pipeline - Check generated azure-pipelines.yml
Configure Variables - Set environment-specific variables
Set Up Environments - Configure dev, staging, production
Test Pipeline - Run pipeline and verify deployment
Set Up Monitoring - Configure dashboards and alerts