Skip to content

Business Continuity

Risk Management

How Factory Handles Failures from Business Perspective

The Factory is designed to handle failures gracefully to minimize business impact:

Failure Types and Business Impact

Validation Errors:

  • Business Impact: Low — Failures occur before execution starts
  • Customer Impact: User must fix input and retry
  • Resolution Time: Immediate (user fixes input)
  • Cost Impact: Minimal (no resources consumed)

Transient Failures (Network, Rate Limits):

  • Business Impact: Low — Automatically retried
  • Customer Impact: Minimal (runs complete after retry)
  • Resolution Time: Automatic (seconds to minutes)
  • Cost Impact: Low (retries consume minimal additional resources)

Agent Failures (LLM, Reasoning):

  • Business Impact: Medium — May require fallback or human intervention
  • Customer Impact: Delayed completion or manual intervention
  • Resolution Time: Minutes to hours (depending on fallback strategy)
  • Cost Impact: Medium (retries and fallbacks consume resources)

Infrastructure Failures (Workers, Database):

  • Business Impact: High — May affect multiple runs
  • Customer Impact: Runs may be delayed or fail
  • Resolution Time: Minutes to hours (depending on failure severity)
  • Cost Impact: High (may require manual intervention and recovery)

Risk Mitigation Strategies

Automatic Retry

  • Transient Failures — Automatically retried with exponential backoff
  • Business Impact: Reduces customer-visible failures
  • Cost Impact: Minimal (retries are efficient)

State Preservation

  • Run State — Preserved in database, enabling recovery
  • Business Impact: Runs can resume after failures
  • Cost Impact: Low (state storage is inexpensive)

Failure Isolation

  • Worker Failures — Don't affect other workers or control plane
  • Business Impact: Failures are contained, don't cascade
  • Cost Impact: Minimal (only affected runs are impacted)

Redundancy

  • Control Plane — Multiple instances provide redundancy
  • Data Plane — Multiple workers provide redundancy
  • Business Impact: High availability ensures continuous operation
  • Cost Impact: Moderate (redundancy increases costs but ensures reliability)

Impact on Customer Projects

Project Delivery Timelines

  • Failed Runs — May delay project delivery timelines
  • Retry Delays — Automatic retries may add minutes to hours
  • Manual Intervention — May require manual intervention for complex failures

Customer Communication

  • Proactive Communication — Notify customers of known issues
  • Status Updates — Provide real-time status updates during failures
  • Recovery Estimates — Provide estimates for recovery time

Failure Handling

Automatic Retry and Recovery

Retry Strategy

  • Transient Failures — Automatically retried (up to 3-5 attempts)
  • Exponential Backoff — Retries with increasing delays (prevents overwhelming systems)
  • Business Impact: Most failures resolve automatically (no customer action required)

Recovery Capabilities

  • State Preservation — Run state preserved, enabling recovery
  • Resume from Checkpoint — Runs can resume from last successful step
  • Business Impact: Failed runs can be recovered without starting over

Business Impact of Different Failure Types

Low Impact Failures

  • Validation Errors — Fail immediately, user fixes input
  • Transient Network Errors — Automatically retried, typically resolve quickly
  • Business Impact: Minimal — Resolved quickly with minimal customer impact

Medium Impact Failures

  • Agent Failures — May require fallback or human intervention
  • External System Errors — May require manual investigation
  • Business Impact: Moderate — May delay completion, require customer communication

High Impact Failures

  • Infrastructure Failures — May affect multiple runs
  • Control Plane Failures — May pause all Factory operations
  • Business Impact: High — May require immediate response, customer communication, SLA credits

Customer Communication During Failures

Communication Strategy

  • Proactive Notification — Notify customers of known issues
  • Status Updates — Provide real-time status updates
  • Recovery Estimates — Provide estimates for recovery time
  • Post-Incident Communication — Provide post-incident reports and lessons learned

Communication Channels

  • Email — Email notifications for significant failures
  • Dashboard — Real-time status updates in customer dashboard
  • Support — Support team available for customer inquiries

Recovery Capabilities

Resume from Failures

Per Run Resume

  • Run State — Preserved in database
  • Resume Capability — Runs can resume from last successful step
  • Business Impact: Failed runs can be recovered without starting over (saves time and costs)

Per Step/Job Resume

  • Job State — Individual jobs can be retried independently
  • Resume Capability — Failed jobs can be retried without re-running successful jobs
  • Business Impact: Partial failures don't require full re-run (efficient recovery)

Data Protection and Backup

State Backup

  • Database Replication — Run state database replicated across availability zones
  • Regular Backups — Regular backups enable recovery from data corruption
  • Business Impact: Data protection ensures no data loss

Artifact Protection

  • External Storage — Generated artifacts stored in external systems (Git, Azure DevOps)
  • Version Control — Artifacts versioned in Git (enables recovery)
  • Business Impact: Artifacts protected even if Factory state is lost

Recovery Time Objectives (RTO)

RTO Targets:

  • Control Plane Failover — < 5 minutes (automatic failover)
  • Worker Recovery — < 2 minutes (automatic worker replacement)
  • Run Resume — < 1 minute (resume from checkpoint)
  • Full Recovery — < 30 minutes (for major infrastructure failures)

Business Impact:

  • Minimal Downtime — Fast recovery minimizes customer impact
  • SLA Compliance — Fast recovery helps meet SLA commitments
  • Customer Trust — Fast recovery builds customer confidence

Recovery Point Objectives (RPO)

RPO Targets:

  • Run State — < 1 minute (database replication lag)
  • Artifacts — 0 minutes (artifacts stored in external systems, not Factory)
  • Business Impact: Minimal data loss ensures customer work is protected

Business Impact

How Failures Affect Customer Delivery Timelines

Project Delays

  • Failed Runs — May delay project delivery timelines
  • Retry Delays — Automatic retries may add minutes to hours
  • Manual Intervention — May require manual intervention for complex failures

Customer Communication

  • Proactive Communication — Notify customers of known issues
  • Status Updates — Provide real-time status updates during failures
  • Recovery Estimates — Provide estimates for recovery time

Compensation and SLA Credits

SLA Credit Policy

  • Uptime SLA — Service credits if uptime falls below 99.9%
  • Run Success Rate — Service credits if success rate falls below 95%
  • Execution Time SLA — Service credits if execution times exceed SLA

Compensation Process

  • Automatic Calculation — SLA credits calculated automatically
  • Customer Notification — Customers notified of SLA credits
  • Credit Application — Credits applied to next billing cycle

Customer Communication Protocols

Incident Communication

  • Immediate Notification — Notify customers of significant incidents immediately
  • Status Updates — Provide regular status updates during incidents
  • Resolution Notification — Notify customers when incidents are resolved

Post-Incident Communication

  • Incident Report — Provide post-incident reports
  • Root Cause Analysis — Share root cause analysis (when appropriate)
  • Prevention Measures — Communicate measures taken to prevent recurrence

Business Resilience

Failure Recovery Flow

flowchart TD
    Failure[Failure Detected]
    AutoRetry{Automatic Retry?}
    RetrySuccess{Retry Successful?}
    ManualIntervention[Manual Intervention]
    CustomerNotify[Notify Customer]
    Resume[Resume from Checkpoint]
    Complete[Run Complete]
    SLA[Apply SLA Credits if Needed]

    Failure --> AutoRetry
    AutoRetry -->|Yes| RetrySuccess
    AutoRetry -->|No| ManualIntervention
    RetrySuccess -->|Yes| Complete
    RetrySuccess -->|No| ManualIntervention
    ManualIntervention --> CustomerNotify
    ManualIntervention --> Resume
    Resume --> Complete
    Complete --> SLA
Hold "Alt" / "Option" to enable pan & zoom

Business Continuity Measures

Redundancy

  • Control Plane — Multiple instances provide redundancy
  • Data Plane — Multiple workers provide redundancy
  • State Storage — Database replication ensures state availability

Monitoring and Alerting

  • Real-Time Monitoring — Continuous monitoring detects failures quickly
  • Automated Alerting — Alerts trigger immediate response
  • Escalation — Escalation procedures ensure critical issues are addressed

Incident Response

  • Incident Response Team — Dedicated team for incident response
  • Response Procedures — Documented procedures for common failure scenarios
  • Communication Plans — Communication plans for customer notification