Monitoring & Insights¶

Business Metrics¶

Customer-Facing Metrics¶

Run Success Rate¶

Metric: Percentage of runs that complete successfully
Target: 95%+ success rate (after automatic retries)
Customer Impact: High success rates mean fewer customer issues
Business Impact: High success rates reduce support overhead and improve customer satisfaction

Execution Time¶

Metric: Average and percentile run durations
Target: 90% of runs complete within expected time
Customer Impact: Predictable execution times enable better planning
Business Impact: Fast execution differentiates Factory and improves customer satisfaction

Run Volume¶

Metric: Number of runs per day/week/month
Customer Impact: Run volume indicates customer usage and engagement
Business Impact: Run volume correlates with revenue and customer value

Cost Metrics per Customer/Project¶

Cost per Customer¶

Metric: Total cost (infrastructure + AI) per customer
Components: Infrastructure costs + AI token costs
Business Impact: Enables customer profitability analysis
Optimization: Identify high-cost customers and optimization opportunities

Cost per Project¶

Metric: Total cost per project
Components: Infrastructure costs + AI token costs
Business Impact: Enables project profitability analysis
Optimization: Identify high-cost projects and optimization opportunities

Cost per Run¶

Metric: Average cost per run
Components: Infrastructure costs + AI token costs
Business Impact: Enables cost optimization and pricing decisions
Optimization: Identify high-cost run types and optimization opportunities

Usage Analytics¶

Run Type Distribution¶

Metric: Distribution of runs by type (microservice, library, pipeline, etc.)
Business Impact: Understand customer usage patterns
Optimization: Optimize resources based on run type distribution

Customer Usage Patterns¶

Metric: Usage patterns by customer (peak times, run frequency, etc.)
Business Impact: Understand customer behavior and engagement
Optimization: Optimize resources based on customer usage patterns

Template Usage¶

Metric: Most-used templates and patterns
Business Impact: Understand which templates provide most value
Optimization: Focus development on high-value templates

Performance Trends¶

Success Rate Trends¶

Metric: Run success rate trends over time
Business Impact: Identify degradation or improvement in reliability
Action: Investigate and address declining success rates

Execution Time Trends¶

Metric: Execution time trends over time
Business Impact: Identify performance degradation or improvement
Action: Investigate and address performance issues

Cost Trends¶

Metric: Cost trends over time (per customer, per run, total)
Business Impact: Identify cost increases and optimization opportunities
Action: Optimize costs based on trends

Customer Dashboards¶

What Customers Can See About Their Factory Usage¶

Project Execution Status¶

Active Runs — Real-time status of active runs
Run History — Historical run data (success/failure, duration, etc.)
Run Details — Detailed information about individual runs (jobs, steps, errors)

Cost Tracking¶

Cost per Project — Cost breakdown by project
Cost per Run — Cost breakdown by run type
AI Token Usage — AI token usage and costs
Cost Trends — Cost trends over time

Usage Reports¶

Run Volume — Number of runs per day/week/month
Run Types — Distribution of runs by type
Template Usage — Most-used templates
Performance Metrics — Success rates, execution times

Customer-Facing Dashboard Features¶

Real-Time Status¶

Active Runs — Real-time status of active runs
Queue Status — Current queue depth and processing rate
System Health — Overall Factory health indicators

Historical Analytics¶

Run History — Historical run data with filtering and search
Cost History — Historical cost data with trends
Performance History — Historical performance metrics

Alerts and Notifications¶

Run Completion — Notifications when runs complete
Run Failures — Notifications when runs fail
Cost Alerts — Alerts when costs exceed thresholds

Cost Tracking¶

AI Token Usage and Costs¶

Token Usage Metrics¶

Tokens per Run — Average tokens used per run
Tokens per Customer — Total tokens used per customer
Tokens per Model — Token usage by AI model
Token Trends — Token usage trends over time

Cost Allocation¶

Cost per Model — AI costs by model (GPT-4, GPT-3.5, etc.)
Cost per Operation — AI costs by operation type (generation, reasoning, etc.)
Cost per Customer — AI costs allocated to customers
Cost per Project — AI costs allocated to projects

Cost Optimization¶

Model Selection — Optimize model selection based on cost and performance
Prompt Optimization — Optimize prompts to reduce token usage
Caching — Cache responses to reduce token usage
Cost Alerts — Alerts when costs exceed thresholds

Infrastructure Costs¶

Infrastructure Cost Components¶

Control Plane — Orchestrator, schedulers, API services
Data Plane — Worker pools (compute costs)
Storage — Run state database, queue storage
Network — Data transfer costs

Cost Allocation¶

Cost per Customer — Infrastructure costs allocated to customers
Cost per Project — Infrastructure costs allocated to projects
Cost per Run — Infrastructure costs allocated to runs
Fixed vs Variable — Fixed costs (control plane) vs variable costs (data plane)

Cost Optimization¶

Right-Sizing — Optimize instance types and sizes
Auto-Scaling — Optimize auto-scaling policies
Reserved Instances — Use reserved instances for predictable workloads
Spot Instances — Use spot instances for variable workloads

Cost Allocation per Project/Customer¶

Allocation Methods¶

Usage-Based — Allocate costs based on usage (runs, tokens, etc.)
Proportional — Allocate costs proportionally (e.g., by run count)
Fixed + Variable — Allocate fixed costs (control plane) and variable costs (data plane)

Cost Reporting¶

Customer Reports — Monthly cost reports per customer
Project Reports — Cost reports per project
Cost Trends — Cost trends over time
Cost Forecasts — Cost forecasts based on usage trends

Cost Optimization Recommendations¶

Automated Recommendations¶

Right-Sizing — Recommendations for instance type optimization
Reserved Instances — Recommendations for reserved instance usage
Auto-Scaling — Recommendations for auto-scaling policy optimization
Model Selection — Recommendations for AI model selection

Cost Alerts¶

Threshold Alerts — Alerts when costs exceed thresholds
Anomaly Detection — Alerts for cost anomalies
Trend Alerts — Alerts for cost trend changes

Operational Insights¶

Factory Health Indicators¶

System Health¶

Control Plane Health — Orchestrator and scheduler health
Data Plane Health — Worker pool health
Queue Health — Queue depth and processing rate
Database Health — Run state database health

Performance Health¶

Run Success Rate — Overall run success rate
Execution Times — Average and percentile execution times
Queue Latency — Time from queue to worker start
API Response Times — API response times

Capacity Utilization¶

Resource Utilization¶

Control Plane Utilization — CPU and memory utilization
Data Plane Utilization — Worker CPU and memory utilization
Queue Utilization — Queue depth vs capacity
Database Utilization — Database CPU and storage utilization

Capacity Planning¶

Current Capacity — Current system capacity
Projected Capacity — Projected capacity based on growth trends
Capacity Alerts — Alerts when capacity approaches limits
Scaling Recommendations — Recommendations for capacity scaling

Performance Trends¶

Success Rate Trends¶

Trend Analysis — Success rate trends over time
Anomaly Detection — Detect anomalies in success rates
Root Cause Analysis — Analyze root causes of success rate changes

Execution Time Trends¶

Trend Analysis — Execution time trends over time
Performance Degradation — Detect performance degradation
Optimization Opportunities — Identify optimization opportunities

Predictive Insights¶

Usage Forecasting¶

Run Volume Forecast — Forecast run volume based on historical trends
Cost Forecast — Forecast costs based on usage trends
Capacity Forecast — Forecast capacity needs based on growth trends

Anomaly Detection¶

Cost Anomalies — Detect unusual cost patterns
Usage Anomalies — Detect unusual usage patterns
Performance Anomalies — Detect unusual performance patterns

Monitoring & Insights Flow¶

graph LR
    Factory[Factory Runtime]
    Metrics[Metrics Collection]
    Storage[(Metrics Storage)]
    Dashboards[Customer Dashboards]
    Insights[Operational Insights]
    Alerts[Alerts & Notifications]

    Factory --> Metrics
    Metrics --> Storage
    Storage --> Dashboards
    Storage --> Insights
    Storage --> Alerts

Hold "Alt" / "Option" to enable pan & zoom

Data Flow:

Collection — Factory runtime emits metrics (success rates, costs, performance)
Storage — Metrics stored in time-series database
Visualization — Dashboards query storage for visualization
Analysis — Insights analyze metrics for trends and anomalies
Alerting — Alerts trigger when metrics exceed thresholds

Reliability & Scalability — SLAs and performance characteristics
Operational Excellence — Cost and efficiency considerations
Business Continuity — Risk management and failure handling
Factory Operations — Operational procedures and runbooks
Technical Runtime Documentation — Technical observability implementation (external documentation)

Monitoring & Insights¶

Business Metrics¶

Customer-Facing Metrics¶

Run Success Rate¶

Execution Time¶

Run Volume¶

Cost Metrics per Customer/Project¶

Cost per Customer¶

Cost per Project¶

Cost per Run¶

Usage Analytics¶

Run Type Distribution¶

Customer Usage Patterns¶

Template Usage¶

Performance Trends¶

Success Rate Trends¶

Execution Time Trends¶

Cost Trends¶

Customer Dashboards¶

What Customers Can See About Their Factory Usage¶

Project Execution Status¶

Cost Tracking¶

Usage Reports¶

Customer-Facing Dashboard Features¶

Real-Time Status¶

Historical Analytics¶

Alerts and Notifications¶

Cost Tracking¶

AI Token Usage and Costs¶

Token Usage Metrics¶

Cost Allocation¶

Cost Optimization¶

Infrastructure Costs¶

Infrastructure Cost Components¶

Cost Allocation¶

Cost Optimization¶

Cost Allocation per Project/Customer¶

Allocation Methods¶

Cost Reporting¶

Cost Optimization Recommendations¶

Automated Recommendations¶

Cost Alerts¶

Operational Insights¶

Factory Health Indicators¶

System Health¶

Performance Health¶

Capacity Utilization¶

Resource Utilization¶

Capacity Planning¶

Performance Trends¶

Success Rate Trends¶

Execution Time Trends¶

Predictive Insights¶

Usage Forecasting¶

Anomaly Detection¶

Monitoring & Insights Flow¶

Related Documentation¶