Skip to content

Factory Operations

This document provides operational procedures and troubleshooting guides for the ConnectSoft AI Software Factory. It is written for operations teams and SREs running the Factory.

The Factory is a complex system with multiple agents, orchestration, and knowledge storage. This runbook covers Factory operations, monitoring, scaling, and incident response.

Note

This runbook focuses on operational procedures. For architecture details, see Factory Overview.

Architecture Overview for Ops

Key Components

Factory Services:

flowchart TD
    subgraph "Factory Services"
        ORCH[Orchestrator<br/>Workflow Management]
        AGENTS[Agent Host<br/>Agent Execution]
        KNOWLEDGE[Knowledge System<br/>Pattern Storage]
        API[Factory API<br/>REST API]
    end

    subgraph "Storage"
        VECTOR[Vector DB<br/>Semantic Search]
        METADATA[Metadata DB<br/>Structured Data]
        ADO[Azure DevOps<br/>Repos & Work Items]
    end

    subgraph "Dependencies"
        AZURE[Azure Services<br/>Key Vault, Service Bus]
    end

    API -->|Orchestrates| ORCH
    ORCH -->|Coordinates| AGENTS
    AGENTS -->|Query/Store| KNOWLEDGE
    KNOWLEDGE -->|Uses| VECTOR
    KNOWLEDGE -->|Uses| METADATA
    AGENTS -->|Reads/Writes| ADO
    AGENTS -->|Uses| AZURE

    style ORCH fill:#2563EB,color:#fff
    style AGENTS fill:#4F46E5,color:#fff
    style KNOWLEDGE fill:#10B981,color:#fff
Hold "Alt" / "Option" to enable pan & zoom

Components: - Orchestrator - Coordinates agent workflows - Agent Host - Executes agents (Vision, Architect, Engineering, QA, DevOps) - Knowledge System - Stores patterns, blueprints, code - Factory API - REST API for Factory operations

Dependencies: - Vector DB - Semantic search for knowledge - Metadata DB - Structured metadata storage - Azure DevOps - Code repositories and work items - Azure Services - Key Vault, Service Bus, etc.

Health Checks and Monitoring

Health Check Endpoints

Factory API Health:

curl https://factory.connectsoft.ai/health

Expected Response:

{
  "status": "healthy",
  "components": {
    "orchestrator": "healthy",
    "agentHost": "healthy",
    "knowledgeSystem": "healthy"
  }
}

Key Metrics to Monitor

Factory Metrics: - Generation success rate - Percentage of successful generation runs - Average generation time - Time to complete generation - Agent task completion rate - Percentage of agent tasks completed - Knowledge reuse rate - Percentage of patterns reused

Infrastructure Metrics: - API request rate - Requests per second - API latency - p50, p95, p99 latency - Error rate - Errors per second - Resource usage - CPU, memory, disk

Agent Metrics: - Agent execution time - Time per agent task - Agent success rate - Percentage of successful agent tasks - Agent queue length - Number of queued agent tasks

Monitoring Dashboards

Factory Dashboard: - Generation runs (success rate, duration) - Agent performance (tasks completed, errors) - Knowledge system (patterns stored, reuse rate) - Infrastructure health (CPU, memory, latency)

Handling Failed Runs

Where Failures Show Up

Failure Locations: - Factory API logs - API errors and failures - Orchestrator logs - Workflow failures - Agent logs - Agent execution failures - Azure DevOps - Failed commits, pipeline failures

Inspecting Failed Runs

Steps:

  1. Check Run Status

    # Get run status via API
    curl https://factory.connectsoft.ai/api/runs/{runId}
    

  2. Check Orchestrator Logs

    kubectl logs -n factory orchestrator-<pod-name>
    

  3. Check Agent Logs

    kubectl logs -n factory agent-host-<pod-name>
    

  4. Check Azure DevOps

  5. Check for failed commits
  6. Check pipeline failures
  7. Review work items

Retry Mechanisms

Automatic Retries: - Transient failures - Automatically retried (3 attempts) - Agent failures - Retried with exponential backoff - Network failures - Retried automatically

Manual Retries: - Failed runs - Can be retried via API - Failed agent tasks - Can be retried individually - Failed deployments - Can be retried via Azure DevOps

Retry Process:

# Retry failed run
curl -X POST https://factory.connectsoft.ai/api/runs/{runId}/retry

Common Failure Scenarios

Scenario 1: Agent Timeout - Symptom: Agent task times out - Cause: Long-running task, resource constraints - Resolution: Increase timeout, scale resources, optimize task

Scenario 2: Azure DevOps Failure - Symptom: Failed to commit to Azure DevOps - Cause: Authentication issues, repository permissions - Resolution: Verify PAT/service connection, check permissions

Scenario 3: Knowledge System Failure - Symptom: Failed to query/store knowledge - Cause: Vector DB unavailable, metadata DB issues - Resolution: Check database connectivity, scale if needed

Scaling and Performance Considerations

Scaling Levers

Horizontal Scaling: - Agent Host replicas - Scale agent execution capacity - Orchestrator replicas - Scale workflow coordination - API replicas - Scale API request handling

Vertical Scaling: - Agent Host resources - Increase CPU/memory for agents - Knowledge System resources - Increase resources for vector/metadata DB

Database Scaling: - Vector DB - Scale for semantic search capacity - Metadata DB - Scale for metadata storage - Azure DevOps - Scale via Azure DevOps capacity

Performance Optimization

Agent Performance: - Parallel execution - Run agents in parallel when possible - Caching - Cache frequently used patterns - Optimization - Optimize agent execution logic

Knowledge System Performance: - Indexing - Optimize vector indexes - Caching - Cache frequent queries - Partitioning - Partition knowledge by tenant/domain

API Performance: - Caching - Cache API responses - Rate limiting - Prevent overload - Connection pooling - Optimize database connections

Capacity Planning

Monitoring: - Generation run rate trends - Agent task completion trends - Knowledge storage growth - Resource usage trends

Planning: - Project generation capacity needs - Plan for peak loads - Plan knowledge storage growth - Plan resource scaling

Upgrade and Change Management

Factory Version Upgrades

Process:

  1. Plan Upgrade
  2. Review release notes
  3. Identify breaking changes
  4. Plan deployment window

  5. Deploy to Dev

  6. Deploy new version to dev
  7. Test generation runs
  8. Verify functionality

  9. Deploy to Staging

  10. Deploy to staging
  11. Run integration tests
  12. Verify with production-like data

  13. Deploy to Production

  14. Deploy during maintenance window
  15. Monitor metrics and logs
  16. Rollback if issues occur

Agent Updates

Process:

  1. Update Agent Code
  2. Update agent implementations
  3. Test agent execution
  4. Verify agent outputs

  5. Deploy Agents

  6. Deploy updated agents
  7. Monitor agent performance
  8. Verify agent outputs

  9. Rollback if Needed

  10. Rollback to previous agent version
  11. Verify system stability

Knowledge System Updates

Process:

  1. Update Knowledge Schema
  2. Update vector/metadata schemas
  3. Migrate existing knowledge
  4. Verify migration

  5. Update Indexes

  6. Rebuild indexes if needed
  7. Optimize indexes
  8. Monitor performance