Factory Operations¶
This document provides operational procedures and troubleshooting guides for the ConnectSoft AI Software Factory. It is written for operations teams and SREs running the Factory.
The Factory is a complex system with multiple agents, orchestration, and knowledge storage. This runbook covers Factory operations, monitoring, scaling, and incident response.
Note
This runbook focuses on operational procedures. For architecture details, see Factory Overview.
Architecture Overview for Ops¶
Key Components¶
Factory Services:
flowchart TD
subgraph "Factory Services"
ORCH[Orchestrator<br/>Workflow Management]
AGENTS[Agent Host<br/>Agent Execution]
KNOWLEDGE[Knowledge System<br/>Pattern Storage]
API[Factory API<br/>REST API]
end
subgraph "Storage"
VECTOR[Vector DB<br/>Semantic Search]
METADATA[Metadata DB<br/>Structured Data]
ADO[Azure DevOps<br/>Repos & Work Items]
end
subgraph "Dependencies"
AZURE[Azure Services<br/>Key Vault, Service Bus]
end
API -->|Orchestrates| ORCH
ORCH -->|Coordinates| AGENTS
AGENTS -->|Query/Store| KNOWLEDGE
KNOWLEDGE -->|Uses| VECTOR
KNOWLEDGE -->|Uses| METADATA
AGENTS -->|Reads/Writes| ADO
AGENTS -->|Uses| AZURE
style ORCH fill:#2563EB,color:#fff
style AGENTS fill:#4F46E5,color:#fff
style KNOWLEDGE fill:#10B981,color:#fff
Components: - Orchestrator - Coordinates agent workflows - Agent Host - Executes agents (Vision, Architect, Engineering, QA, DevOps) - Knowledge System - Stores patterns, blueprints, code - Factory API - REST API for Factory operations
Dependencies: - Vector DB - Semantic search for knowledge - Metadata DB - Structured metadata storage - Azure DevOps - Code repositories and work items - Azure Services - Key Vault, Service Bus, etc.
Health Checks and Monitoring¶
Health Check Endpoints¶
Factory API Health:
Expected Response:
{
"status": "healthy",
"components": {
"orchestrator": "healthy",
"agentHost": "healthy",
"knowledgeSystem": "healthy"
}
}
Key Metrics to Monitor¶
Factory Metrics: - Generation success rate - Percentage of successful generation runs - Average generation time - Time to complete generation - Agent task completion rate - Percentage of agent tasks completed - Knowledge reuse rate - Percentage of patterns reused
Infrastructure Metrics: - API request rate - Requests per second - API latency - p50, p95, p99 latency - Error rate - Errors per second - Resource usage - CPU, memory, disk
Agent Metrics: - Agent execution time - Time per agent task - Agent success rate - Percentage of successful agent tasks - Agent queue length - Number of queued agent tasks
Monitoring Dashboards¶
Factory Dashboard: - Generation runs (success rate, duration) - Agent performance (tasks completed, errors) - Knowledge system (patterns stored, reuse rate) - Infrastructure health (CPU, memory, latency)
Handling Failed Runs¶
Where Failures Show Up¶
Failure Locations: - Factory API logs - API errors and failures - Orchestrator logs - Workflow failures - Agent logs - Agent execution failures - Azure DevOps - Failed commits, pipeline failures
Inspecting Failed Runs¶
Steps:
-
Check Run Status
-
Check Orchestrator Logs
-
Check Agent Logs
-
Check Azure DevOps
- Check for failed commits
- Check pipeline failures
- Review work items
Retry Mechanisms¶
Automatic Retries: - Transient failures - Automatically retried (3 attempts) - Agent failures - Retried with exponential backoff - Network failures - Retried automatically
Manual Retries: - Failed runs - Can be retried via API - Failed agent tasks - Can be retried individually - Failed deployments - Can be retried via Azure DevOps
Retry Process:
Common Failure Scenarios¶
Scenario 1: Agent Timeout - Symptom: Agent task times out - Cause: Long-running task, resource constraints - Resolution: Increase timeout, scale resources, optimize task
Scenario 2: Azure DevOps Failure - Symptom: Failed to commit to Azure DevOps - Cause: Authentication issues, repository permissions - Resolution: Verify PAT/service connection, check permissions
Scenario 3: Knowledge System Failure - Symptom: Failed to query/store knowledge - Cause: Vector DB unavailable, metadata DB issues - Resolution: Check database connectivity, scale if needed
Scaling and Performance Considerations¶
Scaling Levers¶
Horizontal Scaling: - Agent Host replicas - Scale agent execution capacity - Orchestrator replicas - Scale workflow coordination - API replicas - Scale API request handling
Vertical Scaling: - Agent Host resources - Increase CPU/memory for agents - Knowledge System resources - Increase resources for vector/metadata DB
Database Scaling: - Vector DB - Scale for semantic search capacity - Metadata DB - Scale for metadata storage - Azure DevOps - Scale via Azure DevOps capacity
Performance Optimization¶
Agent Performance: - Parallel execution - Run agents in parallel when possible - Caching - Cache frequently used patterns - Optimization - Optimize agent execution logic
Knowledge System Performance: - Indexing - Optimize vector indexes - Caching - Cache frequent queries - Partitioning - Partition knowledge by tenant/domain
API Performance: - Caching - Cache API responses - Rate limiting - Prevent overload - Connection pooling - Optimize database connections
Capacity Planning¶
Monitoring: - Generation run rate trends - Agent task completion trends - Knowledge storage growth - Resource usage trends
Planning: - Project generation capacity needs - Plan for peak loads - Plan knowledge storage growth - Plan resource scaling
Upgrade and Change Management¶
Factory Version Upgrades¶
Process:
- Plan Upgrade
- Review release notes
- Identify breaking changes
-
Plan deployment window
-
Deploy to Dev
- Deploy new version to dev
- Test generation runs
-
Verify functionality
-
Deploy to Staging
- Deploy to staging
- Run integration tests
-
Verify with production-like data
-
Deploy to Production
- Deploy during maintenance window
- Monitor metrics and logs
- Rollback if issues occur
Agent Updates¶
Process:
- Update Agent Code
- Update agent implementations
- Test agent execution
-
Verify agent outputs
-
Deploy Agents
- Deploy updated agents
- Monitor agent performance
-
Verify agent outputs
-
Rollback if Needed
- Rollback to previous agent version
- Verify system stability
Knowledge System Updates¶
Process:
- Update Knowledge Schema
- Update vector/metadata schemas
- Migrate existing knowledge
-
Verify migration
-
Update Indexes
- Rebuild indexes if needed
- Optimize indexes
- Monitor performance
Related Documents¶
- Operations Overview - Operations documentation overview
- Monitoring & Dashboards - Monitoring practices
- Incident Management - Incident response process
- Factory Overview - Factory architecture
- Runtime Overview - Business view of Factory runtime operations
- Business Continuity - Risk management and failure handling from business perspective
- Monitoring & Insights - Business metrics and customer dashboards