DevOps / SRE Quickstart¶
This quickstart helps DevOps engineers and SREs get oriented with ConnectSoft's CI/CD pipelines, infrastructure, monitoring, and operations. It is written for DevOps engineers, SREs, and platform engineers responsible for deployment, monitoring, and incident response.
As DevOps/SRE at ConnectSoft, you'll manage CI/CD pipelines, deploy services, monitor systems, respond to incidents, and ensure reliability and performance.
Your Goals as DevOps / SRE¶
- Manage CI/CD Pipelines - Configure and maintain build and deployment pipelines
- Deploy Services - Deploy services to Azure environments
- Monitor Systems - Set up monitoring, dashboards, and alerts
- Respond to Incidents - Handle incidents and maintain runbooks
- Ensure Reliability - Ensure services meet SLAs and performance targets
Top Docs to Read¶
- CI/CD Guidelines - Pipeline structure and best practices
- Observability-Driven Design - Monitoring and observability patterns
- Operations Overview - Operations documentation overview
- Monitoring & Dashboards - Monitoring setup and dashboards
- Incident Management - Incident response process
- Factory Operations - Operating the Factory itself
CI/CD and Environments Overview¶
Pipeline Structure¶
Generated pipelines follow a standard structure:
stages:
- stage: Build
jobs:
- job: BuildAndTest
steps:
- task: UseDotNet@2
- script: dotnet build
- script: dotnet test
- stage: Deploy
jobs:
- job: DeployToDev
steps:
- script: az deployment group create ...
See: CI/CD Guidelines for detailed guidance.
Environments¶
Standard environments:
- Development - For development and testing
- Staging - For pre-production testing
- Production - For live customer traffic
Deployment Patterns¶
- Blue/Green - Zero-downtime deployments
- Canary - Gradual rollout to subset of traffic
- Rolling - Standard rolling update
See: CI/CD Guidelines for deployment patterns.
Monitoring, Dashboards, and Alerts¶
What We Monitor¶
- Technical Metrics - CPU, memory, request rates, latency, error rates
- Business Metrics - User actions, transactions, feature usage
- Platform Metrics - Factory runs, agent performance, knowledge system
See: Monitoring & Dashboards for detailed guidance.
Dashboards¶
- Service Dashboards - Per-service metrics and health
- Platform Dashboards - Factory and platform metrics
- Business Dashboards - Business KPIs and metrics
Alerting Principles¶
- Avoid Alert Fatigue - Only alert on actionable issues
- Use SLOs - Set Service Level Objectives
- Clear Runbooks - Each alert should have a runbook
See: Monitoring & Dashboards for alerting guidance.
Incident & Runbook References¶
Incident Management Process¶
- Detection - Alerts or user reports
- Triage - Assess severity and impact
- Mitigation - Take immediate action to restore service
- Resolution - Fix root cause
- Postmortem - Document learnings
See: Incident Management for detailed process.
Runbooks¶
- Identity Platform Runbook - Identity Platform operations
- Audit Platform Runbook - Audit Platform operations
- Factory Operations - Factory operations
- Monitoring & Dashboards - Monitoring setup
Common Incidents¶
Service Down: 1. Check health endpoints 2. Review logs and metrics 3. Check recent deployments 4. Review runbook for service
Performance Degradation: 1. Check metrics for bottlenecks 2. Review recent changes 3. Check resource utilization 4. Scale if needed
Factory Run Failed: 1. Check Factory logs 2. Review agent execution logs 3. Check Azure DevOps integration 4. Review runbook for Factory
Common Tasks¶
Task: Set Up New Service Pipeline¶
- Review Generated Pipeline - Check generated
azure-pipelines.yml - Configure Variables - Set environment-specific variables
- Set Up Environments - Configure dev, staging, production
- Test Pipeline - Run pipeline and verify deployment
- Set Up Monitoring - Configure dashboards and alerts
See: CI/CD Guidelines for detailed guidance.
Task: Deploy Service Update¶
- Review Changes - Check what changed
- Run Tests - Ensure tests pass
- Deploy to Dev - Deploy to development first
- Verify - Check health and metrics
- Deploy to Staging - Deploy to staging
- Deploy to Production - Deploy to production
See: CI/CD Guidelines for deployment guidance.
Task: Investigate Performance Issue¶
- Check Dashboards - Review service metrics
- Review Logs - Check for errors or warnings
- Check Traces - Review distributed traces
- Identify Bottleneck - Find slow operations
- Fix and Deploy - Fix issue and deploy fix
See: Observability-Driven Design for debugging guidance.
Related Documents¶
- CI/CD Guidelines - Pipeline best practices
- Observability-Driven Design - Monitoring patterns
- Operations Overview - Operations documentation
- Monitoring & Dashboards - Monitoring setup
- Incident Management - Incident response
- Factory Operations - Factory operations
- Identity Platform Runbook - Identity Platform ops
- Audit Platform Runbook - Audit Platform ops