Skip to content

DevOps / SRE Quickstart

This quickstart helps DevOps engineers and SREs get oriented with ConnectSoft's CI/CD pipelines, infrastructure, monitoring, and operations. It is written for DevOps engineers, SREs, and platform engineers responsible for deployment, monitoring, and incident response.

As DevOps/SRE at ConnectSoft, you'll manage CI/CD pipelines, deploy services, monitor systems, respond to incidents, and ensure reliability and performance.

Your Goals as DevOps / SRE

  • Manage CI/CD Pipelines - Configure and maintain build and deployment pipelines
  • Deploy Services - Deploy services to Azure environments
  • Monitor Systems - Set up monitoring, dashboards, and alerts
  • Respond to Incidents - Handle incidents and maintain runbooks
  • Ensure Reliability - Ensure services meet SLAs and performance targets

Top Docs to Read

  1. CI/CD Guidelines - Pipeline structure and best practices
  2. Observability-Driven Design - Monitoring and observability patterns
  3. Operations Overview - Operations documentation overview
  4. Monitoring & Dashboards - Monitoring setup and dashboards
  5. Incident Management - Incident response process
  6. Factory Operations - Operating the Factory itself

CI/CD and Environments Overview

Pipeline Structure

Generated pipelines follow a standard structure:

stages:
  - stage: Build
    jobs:
      - job: BuildAndTest
        steps:
          - task: UseDotNet@2
          - script: dotnet build
          - script: dotnet test

  - stage: Deploy
    jobs:
      - job: DeployToDev
        steps:
          - script: az deployment group create ...

See: CI/CD Guidelines for detailed guidance.

Environments

Standard environments:

  • Development - For development and testing
  • Staging - For pre-production testing
  • Production - For live customer traffic

Deployment Patterns

  • Blue/Green - Zero-downtime deployments
  • Canary - Gradual rollout to subset of traffic
  • Rolling - Standard rolling update

See: CI/CD Guidelines for deployment patterns.

Monitoring, Dashboards, and Alerts

What We Monitor

  • Technical Metrics - CPU, memory, request rates, latency, error rates
  • Business Metrics - User actions, transactions, feature usage
  • Platform Metrics - Factory runs, agent performance, knowledge system

See: Monitoring & Dashboards for detailed guidance.

Dashboards

  • Service Dashboards - Per-service metrics and health
  • Platform Dashboards - Factory and platform metrics
  • Business Dashboards - Business KPIs and metrics

Alerting Principles

  • Avoid Alert Fatigue - Only alert on actionable issues
  • Use SLOs - Set Service Level Objectives
  • Clear Runbooks - Each alert should have a runbook

See: Monitoring & Dashboards for alerting guidance.

Incident & Runbook References

Incident Management Process

  1. Detection - Alerts or user reports
  2. Triage - Assess severity and impact
  3. Mitigation - Take immediate action to restore service
  4. Resolution - Fix root cause
  5. Postmortem - Document learnings

See: Incident Management for detailed process.

Runbooks

Common Incidents

Service Down: 1. Check health endpoints 2. Review logs and metrics 3. Check recent deployments 4. Review runbook for service

Performance Degradation: 1. Check metrics for bottlenecks 2. Review recent changes 3. Check resource utilization 4. Scale if needed

Factory Run Failed: 1. Check Factory logs 2. Review agent execution logs 3. Check Azure DevOps integration 4. Review runbook for Factory

Common Tasks

Task: Set Up New Service Pipeline

  1. Review Generated Pipeline - Check generated azure-pipelines.yml
  2. Configure Variables - Set environment-specific variables
  3. Set Up Environments - Configure dev, staging, production
  4. Test Pipeline - Run pipeline and verify deployment
  5. Set Up Monitoring - Configure dashboards and alerts

See: CI/CD Guidelines for detailed guidance.

Task: Deploy Service Update

  1. Review Changes - Check what changed
  2. Run Tests - Ensure tests pass
  3. Deploy to Dev - Deploy to development first
  4. Verify - Check health and metrics
  5. Deploy to Staging - Deploy to staging
  6. Deploy to Production - Deploy to production

See: CI/CD Guidelines for deployment guidance.

Task: Investigate Performance Issue

  1. Check Dashboards - Review service metrics
  2. Review Logs - Check for errors or warnings
  3. Check Traces - Review distributed traces
  4. Identify Bottleneck - Find slow operations
  5. Fix and Deploy - Fix issue and deploy fix

See: Observability-Driven Design for debugging guidance.