Skip to content

Operational Excellence

Cost Implications

How Runtime Architecture Affects Operational Costs

The Factory's control plane / data plane separation significantly impacts operational costs:

Control Plane Costs

  • Always-On Services — Orchestrator, schedulers, and API services run continuously
  • Database Costs — Run state database requires persistent storage and replication
  • Queue Costs — Message queue/bus requires persistent storage
  • Fixed Costs — Control plane costs are relatively fixed (don't scale with workload)

Cost Characteristics:

  • Base Cost — Fixed monthly cost for control plane infrastructure
  • Scales with Customers — Control plane costs scale with number of customers (not runs)
  • High Availability — Redundancy increases costs but ensures reliability

Data Plane Costs

  • Variable Costs — Worker costs scale with workload (number of runs)
  • Auto-Scaling — Workers scale up/down based on queue depth
  • Scale-to-Zero — Workers can scale down to zero during low usage
  • Resource Optimization — Different worker pools can use different instance types

Cost Characteristics:

  • Usage-Based — Costs scale with number of runs and job complexity
  • Peak/Low Usage — Costs vary based on customer usage patterns
  • Optimization Opportunities — Instance type selection and auto-scaling enable cost optimization

Cost Efficiency of Control/Data Plane Separation

**Benefits

  • Fixed + Variable Costs — Control plane (fixed) + data plane (variable) enables cost predictability
  • Scale-to-Zero — Data plane can scale to zero during low usage (no idle costs)
  • Resource Optimization — Different instance types for control (CPU-light) vs data (CPU-heavy)
  • Cost Allocation — Clear separation enables per-customer cost allocation

Cost Optimization:

  • Right-Sizing — Control plane sized for peak orchestration load (not execution load)
  • Instance Types — Control plane uses smaller instances (orchestration is CPU-light)
  • Auto-Scaling — Data plane auto-scales based on actual demand (no over-provisioning)

Resource Optimization Strategies

Instance Type Selection

  • Control Plane — Smaller instances (CPU-light, orchestration workload)
  • Data Plane — Larger instances (CPU-heavy, execution workload)
  • Cost Optimization — Right-size instances based on actual workload characteristics

Auto-Scaling Policies

  • Scale-Up Threshold — Scale up when queue depth exceeds threshold
  • Scale-Down Threshold — Scale down when queue depth below threshold
  • Cooldown Periods — Prevent rapid scale-up/down oscillations

Reserved Instances

  • Control Plane — Reserved instances for predictable base load
  • Data Plane — Spot instances for variable workload (cost savings)
  • Hybrid Approach — Reserved for base load, spot for peak load

Cost per Run/Customer Metrics

Cost per Run

  • Components — Infrastructure costs + AI token costs
  • Variability — Costs vary by run type and complexity
  • Optimization — Track costs to identify optimization opportunities

Cost per Customer

  • Allocation — Allocate infrastructure and AI costs to customers
  • Usage-Based — Costs scale with customer usage
  • Profitability — Track costs vs. revenue per customer

Resource Management

Efficient Resource Utilization

Control Plane Utilization

  • Orchestration Load — Control plane handles orchestration (CPU-light workload)
  • Right-Sizing — Sized for peak orchestration load (not execution load)
  • Utilization Targets — Target 60-80% CPU utilization (headroom for spikes)

Data Plane Utilization

  • Execution Load — Data plane handles execution (CPU-heavy workload)
  • Auto-Scaling — Workers scale based on queue depth (maintain target utilization)
  • Utilization Targets — Target 70-90% CPU utilization (maximize efficiency)

Auto-Scaling Benefits

Cost Efficiency

  • Scale-to-Zero — Workers scale down to zero during low usage (no idle costs)
  • Pay for Usage — Pay only for resources used (no over-provisioning)
  • Peak Handling — Scale up during peak times, down during low usage

Performance

  • Low Latency — More workers mean faster job processing
  • High Throughput — Auto-scaling enables high throughput
  • No Bottlenecks — Scaling prevents bottlenecks

Reliability

  • Redundancy — Multiple workers provide redundancy
  • Failure Isolation — Worker failures don't affect other workers
  • Graceful Degradation — System continues operating even if some workers fail

Capacity Optimization

Queue-Based Scaling

  • Queue Depth — Scale workers based on queue depth (predictable scaling trigger)
  • Target Queue Depth — Maintain target queue depth (balance latency vs. cost)
  • Scaling Policies — Scale up when queue depth high, down when low

Resource Allocation

  • Worker Pools — Separate pools for different job types (right-size each pool)
  • Instance Types — Different instance types for different job types (cost optimization)
  • Priority Queues — Prioritize high-value jobs (ensure SLA compliance)

Operational Efficiency

Reduced Manual Intervention

Automated Recovery

  • Automatic Retry — Transient failures automatically retried (no manual intervention)
  • Self-Healing — Workers automatically restart and recover from failures
  • State Preservation — Run state preserved, enabling automatic recovery

**Business Impact

  • Lower Support Costs — Fewer manual interventions reduce support overhead
  • Faster Resolution — Automatic recovery faster than manual intervention
  • Higher Reliability — Automated recovery improves reliability

Automated Scaling

  • Auto-Scaling — Workers automatically scale up/down based on demand
  • No Manual Configuration — Scaling policies configured once, applied automatically
  • Predictable Behavior — Auto-scaling provides predictable scaling behavior

Business Impact:

  • Lower Operational Overhead — No manual scaling decisions
  • Cost Optimization — Auto-scaling optimizes costs automatically
  • Performance — Auto-scaling maintains performance under varying load

Self-Healing Capabilities

Worker Health

  • Health Checks — Automated health checks detect unhealthy workers
  • Automatic Replacement — Unhealthy workers automatically replaced
  • No Manual Intervention — Worker replacement happens automatically

Queue Health

  • Dead Letter Queue — Failed jobs moved to DLQ automatically
  • Alerting — DLQ entries trigger alerts for investigation
  • Manual Retry — Operations can manually retry DLQ jobs after investigation

State Recovery

  • State Preservation — Run state preserved in database
  • Resume Capability — Runs can resume from last successful step
  • No Data Loss — State preservation prevents data loss

Operational Metrics

Key Metrics for Business Operations

Run Metrics

  • Run Success Rate — Percentage of runs that complete successfully
  • Run Failure Rate — Percentage of runs that fail (by failure type)
  • Average Run Duration — Average time to complete a run
  • Run Volume — Number of runs per day/week/month

Cost Metrics

  • Cost per Run — Average cost per run (infrastructure + AI)
  • Cost per Customer — Average cost per customer
  • Cost Trends — Cost trends over time (identify optimization opportunities)
  • AI Token Costs — AI token usage and costs (by model, by customer)

Performance Metrics

  • Queue Depth — Average queue depth (indicates system load)
  • Job Processing Rate — Jobs processed per second
  • Worker Utilization — Average worker CPU/memory utilization
  • Response Times — API response times (run creation, status queries)

Reliability Metrics

  • Uptime — Control plane uptime percentage
  • Worker Error Rate — Percentage of jobs that fail due to worker errors
  • Retry Rate — Percentage of jobs that require retries
  • Dead Letter Queue Size — Number of jobs in DLQ (indicates persistent issues)

Cost Tracking and Optimization

Cost Allocation

  • Per Customer — Allocate costs to customers (enable customer profitability analysis)
  • Per Project — Allocate costs to projects (enable project profitability analysis)
  • Per Run Type — Track costs by run type (identify high-cost run types)

Cost Optimization

  • Identify High-Cost Areas — Track costs to identify optimization opportunities
  • Right-Size Resources — Optimize instance types and scaling policies
  • Reserved Instances — Use reserved instances for predictable workloads
  • Spot Instances — Use spot instances for variable workloads (cost savings)

Efficiency Improvements Over Time

Learning and Optimization

  • Pattern Recognition — Identify patterns in usage and costs
  • Optimization Opportunities — Identify opportunities for cost and performance optimization
  • Continuous Improvement — Iteratively improve efficiency over time
  • Cost Trends — Track cost trends over time (identify cost increases)
  • Performance Trends — Track performance trends (identify performance degradation)
  • Efficiency Trends — Track efficiency trends (cost per run, cost per customer)