Factory Operations¶

This document provides operational procedures and troubleshooting guides for the ConnectSoft AI Software Factory. It is written for operations teams and SREs running the Factory.

The Factory is a complex system with multiple agents, orchestration, and knowledge storage. This runbook covers Factory operations, monitoring, scaling, and incident response.

Note

This runbook focuses on operational procedures. For architecture details, see Factory Overview.

Architecture Overview for Ops¶

Key Components¶

Factory Services:

flowchart TD
    subgraph "Factory Services"
        ORCH[Orchestrator<br/>Workflow Management]
        AGENTS[Agent Host<br/>Agent Execution]
        KNOWLEDGE[Knowledge System<br/>Pattern Storage]
        API[Factory API<br/>REST API]
    end

    subgraph "Storage"
        VECTOR[Vector DB<br/>Semantic Search]
        METADATA[Metadata DB<br/>Structured Data]
        ADO[Azure DevOps<br/>Repos & Work Items]
    end

    subgraph "Dependencies"
        AZURE[Azure Services<br/>Key Vault, Service Bus]
    end

    API -->|Orchestrates| ORCH
    ORCH -->|Coordinates| AGENTS
    AGENTS -->|Query/Store| KNOWLEDGE
    KNOWLEDGE -->|Uses| VECTOR
    KNOWLEDGE -->|Uses| METADATA
    AGENTS -->|Reads/Writes| ADO
    AGENTS -->|Uses| AZURE

    style ORCH fill:#2563EB,color:#fff
    style AGENTS fill:#4F46E5,color:#fff
    style KNOWLEDGE fill:#10B981,color:#fff

Hold "Alt" / "Option" to enable pan & zoom

Components: - Orchestrator - Coordinates agent workflows - Agent Host - Executes agents (Vision, Architect, Engineering, QA, DevOps) - Knowledge System - Stores patterns, blueprints, code - Factory API - REST API for Factory operations

Dependencies: - Vector DB - Semantic search for knowledge - Metadata DB - Structured metadata storage - Azure DevOps - Code repositories and work items - Azure Services - Key Vault, Service Bus, etc.

Health Checks and Monitoring¶

Health Check Endpoints¶

Factory API Health:

curl https://factory.connectsoft.ai/health

Expected Response:

{
  "status": "healthy",
  "components": {
    "orchestrator": "healthy",
    "agentHost": "healthy",
    "knowledgeSystem": "healthy"
  }
}

Key Metrics to Monitor¶

Factory Metrics: - Generation success rate - Percentage of successful generation runs - Average generation time - Time to complete generation - Agent task completion rate - Percentage of agent tasks completed - Knowledge reuse rate - Percentage of patterns reused

Infrastructure Metrics: - API request rate - Requests per second - API latency - p50, p95, p99 latency - Error rate - Errors per second - Resource usage - CPU, memory, disk

Agent Metrics: - Agent execution time - Time per agent task - Agent success rate - Percentage of successful agent tasks - Agent queue length - Number of queued agent tasks

Monitoring Dashboards¶

Factory Dashboard: - Generation runs (success rate, duration) - Agent performance (tasks completed, errors) - Knowledge system (patterns stored, reuse rate) - Infrastructure health (CPU, memory, latency)

Handling Failed Runs¶

Where Failures Show Up¶

Failure Locations: - Factory API logs - API errors and failures - Orchestrator logs - Workflow failures - Agent logs - Agent execution failures - Azure DevOps - Failed commits, pipeline failures

Inspecting Failed Runs¶

Steps:

Check Run Status

# Get run status via API
curl https://factory.connectsoft.ai/api/runs/{runId}

Check Orchestrator Logs

kubectl logs -n factory orchestrator-<pod-name>

Check Agent Logs

kubectl logs -n factory agent-host-<pod-name>

Check Azure DevOps
Check for failed commits
Check pipeline failures
Review work items

Retry Mechanisms¶

Automatic Retries: - Transient failures - Automatically retried (3 attempts) - Agent failures - Retried with exponential backoff - Network failures - Retried automatically

Manual Retries: - Failed runs - Can be retried via API - Failed agent tasks - Can be retried individually - Failed deployments - Can be retried via Azure DevOps

Retry Process:

# Retry failed run
curl -X POST https://factory.connectsoft.ai/api/runs/{runId}/retry

Common Failure Scenarios¶

Scenario 1: Agent Timeout - Symptom: Agent task times out - Cause: Long-running task, resource constraints - Resolution: Increase timeout, scale resources, optimize task

Scenario 2: Azure DevOps Failure - Symptom: Failed to commit to Azure DevOps - Cause: Authentication issues, repository permissions - Resolution: Verify PAT/service connection, check permissions

Scenario 3: Knowledge System Failure - Symptom: Failed to query/store knowledge - Cause: Vector DB unavailable, metadata DB issues - Resolution: Check database connectivity, scale if needed

Scaling and Performance Considerations¶

Scaling Levers¶

Horizontal Scaling: - Agent Host replicas - Scale agent execution capacity - Orchestrator replicas - Scale workflow coordination - API replicas - Scale API request handling

Vertical Scaling: - Agent Host resources - Increase CPU/memory for agents - Knowledge System resources - Increase resources for vector/metadata DB

Database Scaling: - Vector DB - Scale for semantic search capacity - Metadata DB - Scale for metadata storage - Azure DevOps - Scale via Azure DevOps capacity

Performance Optimization¶

Agent Performance: - Parallel execution - Run agents in parallel when possible - Caching - Cache frequently used patterns - Optimization - Optimize agent execution logic

Knowledge System Performance: - Indexing - Optimize vector indexes - Caching - Cache frequent queries - Partitioning - Partition knowledge by tenant/domain

API Performance: - Caching - Cache API responses - Rate limiting - Prevent overload - Connection pooling - Optimize database connections

Capacity Planning¶

Monitoring: - Generation run rate trends - Agent task completion trends - Knowledge storage growth - Resource usage trends

Planning: - Project generation capacity needs - Plan for peak loads - Plan knowledge storage growth - Plan resource scaling

Upgrade and Change Management¶

Factory Version Upgrades¶

Process:

Plan Upgrade
Review release notes
Identify breaking changes
Plan deployment window
Deploy to Dev
Deploy new version to dev
Test generation runs
Verify functionality
Deploy to Staging
Deploy to staging
Run integration tests
Verify with production-like data
Deploy to Production
Deploy during maintenance window
Monitor metrics and logs
Rollback if issues occur

Agent Updates¶

Process:

Update Agent Code
Update agent implementations
Test agent execution
Verify agent outputs
Deploy Agents
Deploy updated agents
Monitor agent performance
Verify agent outputs
Rollback if Needed
Rollback to previous agent version
Verify system stability

Knowledge System Updates¶

Process:

Update Knowledge Schema
Update vector/metadata schemas
Migrate existing knowledge
Verify migration
Update Indexes
Rebuild indexes if needed
Optimize indexes
Monitor performance

Operations Overview - Operations documentation overview
Monitoring & Dashboards - Monitoring practices
Incident Management - Incident response process
Factory Overview - Factory architecture
Runtime Overview - Business view of Factory runtime operations
Business Continuity - Risk management and failure handling from business perspective
Monitoring & Insights - Business metrics and customer dashboards

Factory Operations¶

Architecture Overview for Ops¶

Key Components¶

Health Checks and Monitoring¶

Health Check Endpoints¶

Key Metrics to Monitor¶

Monitoring Dashboards¶

Handling Failed Runs¶

Where Failures Show Up¶

Inspecting Failed Runs¶

Retry Mechanisms¶

Common Failure Scenarios¶

Scaling and Performance Considerations¶

Scaling Levers¶

Performance Optimization¶

Capacity Planning¶

Upgrade and Change Management¶

Factory Version Upgrades¶

Agent Updates¶

Knowledge System Updates¶

Related Documents¶