Skip to content

Reliability & Scalability

Business Value of Reliability

Why Reliability Matters for Customer Trust

Reliability is fundamental to customer trust and satisfaction:

  • Predictable Execution — Customers can rely on Factory runs completing successfully
  • Reduced Friction — Fewer failures mean less manual intervention and faster delivery
  • Professional Image — High reliability positions Factory as enterprise-grade platform
  • Customer Retention — Reliable service reduces churn and increases customer lifetime value

Impact on Customer Satisfaction and Retention

High Reliability Leads To:

  • Higher Customer Satisfaction — Customers trust Factory to deliver consistently
  • Reduced Support Burden — Fewer failures mean fewer support tickets
  • Faster Time-to-Value — Successful runs mean faster project delivery
  • Customer Retention — Reliable service reduces churn

Low Reliability Leads To:

  • Customer Frustration — Failed runs require manual intervention
  • Support Overhead — High failure rates increase support costs
  • Delayed Projects — Failed runs delay customer projects
  • Customer Churn — Unreliable service drives customers away

SLAs and Service Commitments

Uptime SLA

Target: 99.9% uptime (approximately 8.76 hours downtime per year)

Measurement:

  • Availability — Control plane API availability
  • Exclusions — Planned maintenance, customer-caused issues, force majeure
  • Monitoring — Continuous monitoring with automated alerting

Business Impact:

  • SLA Credits — Customers receive service credits if uptime falls below SLA
  • Customer Confidence — Published SLAs demonstrate commitment to reliability
  • Competitive Advantage — High uptime differentiates from competitors

Run Success Rate

Target: 95%+ run success rate (after automatic retries)

Measurement:

  • Success Rate — Percentage of runs that complete successfully
  • Exclusions — Validation errors (user input issues), cancelled runs
  • Time Period — Measured over rolling 30-day window

Business Impact:

  • Customer Satisfaction — High success rates mean fewer customer issues
  • Operational Efficiency — Fewer failed runs reduce support overhead
  • Cost Optimization — Successful runs are more cost-efficient than retries

Execution Time SLA

Target: 90% of runs complete within expected time (varies by run type)

Measurement:

  • P50 Duration — Median run duration
  • P95 Duration — 95th percentile run duration
  • Run Type — Different SLAs for different run types (microservice generation, library generation, etc.)

Business Impact:

  • Customer Expectations — Clear expectations enable better planning
  • Competitive Positioning — Fast execution differentiates Factory
  • Resource Planning — Predictable execution times enable capacity planning

Scalability for Business Growth

How Factory Scales to Support Growing Customer Base

The Factory runtime is designed to scale horizontally to support business growth:

Horizontal Scaling

  • Worker Pools — Separate worker pools for different job types scale independently
  • Auto-Scaling — Workers automatically scale up/down based on queue depth and workload
  • Multi-Tenant — Supports multiple customers with resource isolation

Business Benefits:

  • No Infrastructure Redesign — Scale workers without changing architecture
  • Cost Efficiency — Pay only for resources used (scale up during peak, down during low usage)
  • Predictable Scaling — Linear scaling enables predictable cost growth

Capacity Planning

  • Queue-Based Scaling — Workers scale based on queue depth (predictable scaling trigger)
  • Resource Allocation — Different worker pools can use different instance types (cost optimization)
  • Peak Load Handling — Auto-scaling handles seasonal spikes (e.g., end-of-quarter project generation)

Business Benefits:

  • Handle Growth — Support 10x customer growth without architecture changes
  • Seasonal Spikes — Handle end-of-quarter or end-of-year project generation spikes
  • Cost Predictability — Scaling based on queue depth enables cost prediction

Horizontal Scaling Benefits

Cost Efficiency

  • Pay for Usage — Scale workers up during peak times, down during low usage
  • Resource Optimization — Different worker pools can use different instance types (CPU-heavy vs. memory-heavy)
  • No Over-Provisioning — Auto-scaling prevents over-provisioning resources

Performance

  • Low Latency — More workers mean faster job processing
  • High Throughput — Horizontal scaling enables high throughput (thousands of concurrent runs)
  • No Bottlenecks — Independent scaling prevents bottlenecks

Reliability

  • Failure Isolation — Worker failures don't affect other workers
  • Redundancy — Multiple workers provide redundancy
  • Graceful Degradation — System continues operating even if some workers fail

High Availability

Uptime Commitments

Target: 99.9% uptime (approximately 8.76 hours downtime per year)

Components:

  • Control Plane — High availability with redundancy and failover
  • Data Plane — Stateless workers enable rapid recovery
  • State Storage — Database replication ensures state availability

Redundancy and Failover Strategies

Control Plane Redundancy

  • Multiple Instances — Orchestrator and scheduler services deployed with 2-3 instances
  • Load Balancing — Load balancers distribute traffic across instances
  • Health Checks — Automated health checks detect failures and route traffic away from unhealthy instances

Data Plane Redundancy

  • Worker Pools — Multiple workers in each pool provide redundancy
  • Auto-Recovery — Failed workers are automatically replaced
  • Queue Persistence — Jobs persist in queue, enabling recovery after worker failures

State Storage Redundancy

  • Database Replication — Run state database replicated across availability zones
  • Backup and Recovery — Regular backups enable recovery from data corruption
  • Failover — Automatic failover to backup database in case of primary failure

Business Continuity Implications

High Availability Enables:

  • Reduced Downtime — Redundancy and failover minimize downtime
  • Customer Trust — High availability builds customer confidence
  • SLA Compliance — Redundancy helps meet uptime SLAs
  • Competitive Advantage — High availability differentiates Factory

Performance Characteristics

Expected Execution Times

Run Type Examples:

  • Microservice Generation — 15-30 minutes (depending on complexity)
  • Library Generation — 5-10 minutes
  • Pipeline Generation — 2-5 minutes
  • Documentation Generation — 3-8 minutes

Factors Affecting Duration:

  • Template Complexity — More complex templates take longer
  • External System Latency — Azure DevOps API latency affects duration
  • AI Model Response Time — LLM API response time affects agent execution
  • Queue Depth — High queue depth may delay job start

Throughput Capabilities

Current Capacity:

  • Concurrent Runs — Support hundreds of concurrent runs
  • Jobs per Second — Process hundreds of jobs per second (across all worker pools)
  • Scalability — Horizontal scaling enables linear throughput growth

Scaling Characteristics:

  • Linear Scaling — Throughput scales linearly with worker count
  • No Bottlenecks — Architecture designed to avoid bottlenecks
  • Predictable Performance — Performance characteristics remain consistent as scale increases

Response Time Guarantees

API Response Times:

  • Run Creation — < 1 second (validation and queueing)
  • Run Status Query — < 100ms (database query)
  • Run List Query — < 500ms (paginated query)

Job Execution:

  • Job Start Time — < 30 seconds from queue to worker start (under normal load)
  • Job Completion — Varies by job type (see Expected Execution Times above)

Business Impact

Customer Trust

  • Reliability — High reliability builds customer trust
  • Predictability — Predictable execution times enable customer planning
  • Professional Image — High availability positions Factory as enterprise-grade

Growth Enablement

  • Scalability — Horizontal scaling enables business growth without infrastructure changes
  • Cost Efficiency — Efficient scaling enables competitive pricing
  • Performance — High throughput supports large customer base

Competitive Advantage

  • Uptime — High uptime differentiates from competitors
  • Performance — Fast execution times provide competitive advantage
  • Scalability — Ability to handle growth without infrastructure changes