Skip to content

Monitoring & Insights

Business Metrics

Customer-Facing Metrics

Run Success Rate

  • Metric: Percentage of runs that complete successfully
  • Target: 95%+ success rate (after automatic retries)
  • Customer Impact: High success rates mean fewer customer issues
  • Business Impact: High success rates reduce support overhead and improve customer satisfaction

Execution Time

  • Metric: Average and percentile run durations
  • Target: 90% of runs complete within expected time
  • Customer Impact: Predictable execution times enable better planning
  • Business Impact: Fast execution differentiates Factory and improves customer satisfaction

Run Volume

  • Metric: Number of runs per day/week/month
  • Customer Impact: Run volume indicates customer usage and engagement
  • Business Impact: Run volume correlates with revenue and customer value

Cost Metrics per Customer/Project

Cost per Customer

  • Metric: Total cost (infrastructure + AI) per customer
  • Components: Infrastructure costs + AI token costs
  • Business Impact: Enables customer profitability analysis
  • Optimization: Identify high-cost customers and optimization opportunities

Cost per Project

  • Metric: Total cost per project
  • Components: Infrastructure costs + AI token costs
  • Business Impact: Enables project profitability analysis
  • Optimization: Identify high-cost projects and optimization opportunities

Cost per Run

  • Metric: Average cost per run
  • Components: Infrastructure costs + AI token costs
  • Business Impact: Enables cost optimization and pricing decisions
  • Optimization: Identify high-cost run types and optimization opportunities

Usage Analytics

Run Type Distribution

  • Metric: Distribution of runs by type (microservice, library, pipeline, etc.)
  • Business Impact: Understand customer usage patterns
  • Optimization: Optimize resources based on run type distribution

Customer Usage Patterns

  • Metric: Usage patterns by customer (peak times, run frequency, etc.)
  • Business Impact: Understand customer behavior and engagement
  • Optimization: Optimize resources based on customer usage patterns

Template Usage

  • Metric: Most-used templates and patterns
  • Business Impact: Understand which templates provide most value
  • Optimization: Focus development on high-value templates
  • Metric: Run success rate trends over time
  • Business Impact: Identify degradation or improvement in reliability
  • Action: Investigate and address declining success rates
  • Metric: Execution time trends over time
  • Business Impact: Identify performance degradation or improvement
  • Action: Investigate and address performance issues
  • Metric: Cost trends over time (per customer, per run, total)
  • Business Impact: Identify cost increases and optimization opportunities
  • Action: Optimize costs based on trends

Customer Dashboards

What Customers Can See About Their Factory Usage

Project Execution Status

  • Active Runs — Real-time status of active runs
  • Run History — Historical run data (success/failure, duration, etc.)
  • Run Details — Detailed information about individual runs (jobs, steps, errors)

Cost Tracking

  • Cost per Project — Cost breakdown by project
  • Cost per Run — Cost breakdown by run type
  • AI Token Usage — AI token usage and costs
  • Cost Trends — Cost trends over time

Usage Reports

  • Run Volume — Number of runs per day/week/month
  • Run Types — Distribution of runs by type
  • Template Usage — Most-used templates
  • Performance Metrics — Success rates, execution times

Customer-Facing Dashboard Features

Real-Time Status

  • Active Runs — Real-time status of active runs
  • Queue Status — Current queue depth and processing rate
  • System Health — Overall Factory health indicators

Historical Analytics

  • Run History — Historical run data with filtering and search
  • Cost History — Historical cost data with trends
  • Performance History — Historical performance metrics

Alerts and Notifications

  • Run Completion — Notifications when runs complete
  • Run Failures — Notifications when runs fail
  • Cost Alerts — Alerts when costs exceed thresholds

Cost Tracking

AI Token Usage and Costs

Token Usage Metrics

  • Tokens per Run — Average tokens used per run
  • Tokens per Customer — Total tokens used per customer
  • Tokens per Model — Token usage by AI model
  • Token Trends — Token usage trends over time

Cost Allocation

  • Cost per Model — AI costs by model (GPT-4, GPT-3.5, etc.)
  • Cost per Operation — AI costs by operation type (generation, reasoning, etc.)
  • Cost per Customer — AI costs allocated to customers
  • Cost per Project — AI costs allocated to projects

Cost Optimization

  • Model Selection — Optimize model selection based on cost and performance
  • Prompt Optimization — Optimize prompts to reduce token usage
  • Caching — Cache responses to reduce token usage
  • Cost Alerts — Alerts when costs exceed thresholds

Infrastructure Costs

Infrastructure Cost Components

  • Control Plane — Orchestrator, schedulers, API services
  • Data Plane — Worker pools (compute costs)
  • Storage — Run state database, queue storage
  • Network — Data transfer costs

Cost Allocation

  • Cost per Customer — Infrastructure costs allocated to customers
  • Cost per Project — Infrastructure costs allocated to projects
  • Cost per Run — Infrastructure costs allocated to runs
  • Fixed vs Variable — Fixed costs (control plane) vs variable costs (data plane)

Cost Optimization

  • Right-Sizing — Optimize instance types and sizes
  • Auto-Scaling — Optimize auto-scaling policies
  • Reserved Instances — Use reserved instances for predictable workloads
  • Spot Instances — Use spot instances for variable workloads

Cost Allocation per Project/Customer

Allocation Methods

  • Usage-Based — Allocate costs based on usage (runs, tokens, etc.)
  • Proportional — Allocate costs proportionally (e.g., by run count)
  • Fixed + Variable — Allocate fixed costs (control plane) and variable costs (data plane)

Cost Reporting

  • Customer Reports — Monthly cost reports per customer
  • Project Reports — Cost reports per project
  • Cost Trends — Cost trends over time
  • Cost Forecasts — Cost forecasts based on usage trends

Cost Optimization Recommendations

Automated Recommendations

  • Right-Sizing — Recommendations for instance type optimization
  • Reserved Instances — Recommendations for reserved instance usage
  • Auto-Scaling — Recommendations for auto-scaling policy optimization
  • Model Selection — Recommendations for AI model selection

Cost Alerts

  • Threshold Alerts — Alerts when costs exceed thresholds
  • Anomaly Detection — Alerts for cost anomalies
  • Trend Alerts — Alerts for cost trend changes

Operational Insights

Factory Health Indicators

System Health

  • Control Plane Health — Orchestrator and scheduler health
  • Data Plane Health — Worker pool health
  • Queue Health — Queue depth and processing rate
  • Database Health — Run state database health

Performance Health

  • Run Success Rate — Overall run success rate
  • Execution Times — Average and percentile execution times
  • Queue Latency — Time from queue to worker start
  • API Response Times — API response times

Capacity Utilization

Resource Utilization

  • Control Plane Utilization — CPU and memory utilization
  • Data Plane Utilization — Worker CPU and memory utilization
  • Queue Utilization — Queue depth vs capacity
  • Database Utilization — Database CPU and storage utilization

Capacity Planning

  • Current Capacity — Current system capacity
  • Projected Capacity — Projected capacity based on growth trends
  • Capacity Alerts — Alerts when capacity approaches limits
  • Scaling Recommendations — Recommendations for capacity scaling
  • Trend Analysis — Success rate trends over time
  • Anomaly Detection — Detect anomalies in success rates
  • Root Cause Analysis — Analyze root causes of success rate changes
  • Trend Analysis — Execution time trends over time
  • Performance Degradation — Detect performance degradation
  • Optimization Opportunities — Identify optimization opportunities

Predictive Insights

Usage Forecasting

  • Run Volume Forecast — Forecast run volume based on historical trends
  • Cost Forecast — Forecast costs based on usage trends
  • Capacity Forecast — Forecast capacity needs based on growth trends

Anomaly Detection

  • Cost Anomalies — Detect unusual cost patterns
  • Usage Anomalies — Detect unusual usage patterns
  • Performance Anomalies — Detect unusual performance patterns

Monitoring & Insights Flow

graph LR
    Factory[Factory Runtime]
    Metrics[Metrics Collection]
    Storage[(Metrics Storage)]
    Dashboards[Customer Dashboards]
    Insights[Operational Insights]
    Alerts[Alerts & Notifications]

    Factory --> Metrics
    Metrics --> Storage
    Storage --> Dashboards
    Storage --> Insights
    Storage --> Alerts
Hold "Alt" / "Option" to enable pan & zoom

Data Flow:

  1. Collection — Factory runtime emits metrics (success rates, costs, performance)
  2. Storage — Metrics stored in time-series database
  3. Visualization — Dashboards query storage for visualization
  4. Analysis — Insights analyze metrics for trends and anomalies
  5. Alerting — Alerts trigger when metrics exceed thresholds