Monitoring & Insights¶
Business Metrics¶
Customer-Facing Metrics¶
Run Success Rate¶
- Metric: Percentage of runs that complete successfully
- Target: 95%+ success rate (after automatic retries)
- Customer Impact: High success rates mean fewer customer issues
- Business Impact: High success rates reduce support overhead and improve customer satisfaction
Execution Time¶
- Metric: Average and percentile run durations
- Target: 90% of runs complete within expected time
- Customer Impact: Predictable execution times enable better planning
- Business Impact: Fast execution differentiates Factory and improves customer satisfaction
Run Volume¶
- Metric: Number of runs per day/week/month
- Customer Impact: Run volume indicates customer usage and engagement
- Business Impact: Run volume correlates with revenue and customer value
Cost Metrics per Customer/Project¶
Cost per Customer¶
- Metric: Total cost (infrastructure + AI) per customer
- Components: Infrastructure costs + AI token costs
- Business Impact: Enables customer profitability analysis
- Optimization: Identify high-cost customers and optimization opportunities
Cost per Project¶
- Metric: Total cost per project
- Components: Infrastructure costs + AI token costs
- Business Impact: Enables project profitability analysis
- Optimization: Identify high-cost projects and optimization opportunities
Cost per Run¶
- Metric: Average cost per run
- Components: Infrastructure costs + AI token costs
- Business Impact: Enables cost optimization and pricing decisions
- Optimization: Identify high-cost run types and optimization opportunities
Usage Analytics¶
Run Type Distribution¶
- Metric: Distribution of runs by type (microservice, library, pipeline, etc.)
- Business Impact: Understand customer usage patterns
- Optimization: Optimize resources based on run type distribution
Customer Usage Patterns¶
- Metric: Usage patterns by customer (peak times, run frequency, etc.)
- Business Impact: Understand customer behavior and engagement
- Optimization: Optimize resources based on customer usage patterns
Template Usage¶
- Metric: Most-used templates and patterns
- Business Impact: Understand which templates provide most value
- Optimization: Focus development on high-value templates
Performance Trends¶
Success Rate Trends¶
- Metric: Run success rate trends over time
- Business Impact: Identify degradation or improvement in reliability
- Action: Investigate and address declining success rates
Execution Time Trends¶
- Metric: Execution time trends over time
- Business Impact: Identify performance degradation or improvement
- Action: Investigate and address performance issues
Cost Trends¶
- Metric: Cost trends over time (per customer, per run, total)
- Business Impact: Identify cost increases and optimization opportunities
- Action: Optimize costs based on trends
Customer Dashboards¶
What Customers Can See About Their Factory Usage¶
Project Execution Status¶
- Active Runs — Real-time status of active runs
- Run History — Historical run data (success/failure, duration, etc.)
- Run Details — Detailed information about individual runs (jobs, steps, errors)
Cost Tracking¶
- Cost per Project — Cost breakdown by project
- Cost per Run — Cost breakdown by run type
- AI Token Usage — AI token usage and costs
- Cost Trends — Cost trends over time
Usage Reports¶
- Run Volume — Number of runs per day/week/month
- Run Types — Distribution of runs by type
- Template Usage — Most-used templates
- Performance Metrics — Success rates, execution times
Customer-Facing Dashboard Features¶
Real-Time Status¶
- Active Runs — Real-time status of active runs
- Queue Status — Current queue depth and processing rate
- System Health — Overall Factory health indicators
Historical Analytics¶
- Run History — Historical run data with filtering and search
- Cost History — Historical cost data with trends
- Performance History — Historical performance metrics
Alerts and Notifications¶
- Run Completion — Notifications when runs complete
- Run Failures — Notifications when runs fail
- Cost Alerts — Alerts when costs exceed thresholds
Cost Tracking¶
AI Token Usage and Costs¶
Token Usage Metrics¶
- Tokens per Run — Average tokens used per run
- Tokens per Customer — Total tokens used per customer
- Tokens per Model — Token usage by AI model
- Token Trends — Token usage trends over time
Cost Allocation¶
- Cost per Model — AI costs by model (GPT-4, GPT-3.5, etc.)
- Cost per Operation — AI costs by operation type (generation, reasoning, etc.)
- Cost per Customer — AI costs allocated to customers
- Cost per Project — AI costs allocated to projects
Cost Optimization¶
- Model Selection — Optimize model selection based on cost and performance
- Prompt Optimization — Optimize prompts to reduce token usage
- Caching — Cache responses to reduce token usage
- Cost Alerts — Alerts when costs exceed thresholds
Infrastructure Costs¶
Infrastructure Cost Components¶
- Control Plane — Orchestrator, schedulers, API services
- Data Plane — Worker pools (compute costs)
- Storage — Run state database, queue storage
- Network — Data transfer costs
Cost Allocation¶
- Cost per Customer — Infrastructure costs allocated to customers
- Cost per Project — Infrastructure costs allocated to projects
- Cost per Run — Infrastructure costs allocated to runs
- Fixed vs Variable — Fixed costs (control plane) vs variable costs (data plane)
Cost Optimization¶
- Right-Sizing — Optimize instance types and sizes
- Auto-Scaling — Optimize auto-scaling policies
- Reserved Instances — Use reserved instances for predictable workloads
- Spot Instances — Use spot instances for variable workloads
Cost Allocation per Project/Customer¶
Allocation Methods¶
- Usage-Based — Allocate costs based on usage (runs, tokens, etc.)
- Proportional — Allocate costs proportionally (e.g., by run count)
- Fixed + Variable — Allocate fixed costs (control plane) and variable costs (data plane)
Cost Reporting¶
- Customer Reports — Monthly cost reports per customer
- Project Reports — Cost reports per project
- Cost Trends — Cost trends over time
- Cost Forecasts — Cost forecasts based on usage trends
Cost Optimization Recommendations¶
Automated Recommendations¶
- Right-Sizing — Recommendations for instance type optimization
- Reserved Instances — Recommendations for reserved instance usage
- Auto-Scaling — Recommendations for auto-scaling policy optimization
- Model Selection — Recommendations for AI model selection
Cost Alerts¶
- Threshold Alerts — Alerts when costs exceed thresholds
- Anomaly Detection — Alerts for cost anomalies
- Trend Alerts — Alerts for cost trend changes
Operational Insights¶
Factory Health Indicators¶
System Health¶
- Control Plane Health — Orchestrator and scheduler health
- Data Plane Health — Worker pool health
- Queue Health — Queue depth and processing rate
- Database Health — Run state database health
Performance Health¶
- Run Success Rate — Overall run success rate
- Execution Times — Average and percentile execution times
- Queue Latency — Time from queue to worker start
- API Response Times — API response times
Capacity Utilization¶
Resource Utilization¶
- Control Plane Utilization — CPU and memory utilization
- Data Plane Utilization — Worker CPU and memory utilization
- Queue Utilization — Queue depth vs capacity
- Database Utilization — Database CPU and storage utilization
Capacity Planning¶
- Current Capacity — Current system capacity
- Projected Capacity — Projected capacity based on growth trends
- Capacity Alerts — Alerts when capacity approaches limits
- Scaling Recommendations — Recommendations for capacity scaling
Performance Trends¶
Success Rate Trends¶
- Trend Analysis — Success rate trends over time
- Anomaly Detection — Detect anomalies in success rates
- Root Cause Analysis — Analyze root causes of success rate changes
Execution Time Trends¶
- Trend Analysis — Execution time trends over time
- Performance Degradation — Detect performance degradation
- Optimization Opportunities — Identify optimization opportunities
Predictive Insights¶
Usage Forecasting¶
- Run Volume Forecast — Forecast run volume based on historical trends
- Cost Forecast — Forecast costs based on usage trends
- Capacity Forecast — Forecast capacity needs based on growth trends
Anomaly Detection¶
- Cost Anomalies — Detect unusual cost patterns
- Usage Anomalies — Detect unusual usage patterns
- Performance Anomalies — Detect unusual performance patterns
Monitoring & Insights Flow¶
graph LR
Factory[Factory Runtime]
Metrics[Metrics Collection]
Storage[(Metrics Storage)]
Dashboards[Customer Dashboards]
Insights[Operational Insights]
Alerts[Alerts & Notifications]
Factory --> Metrics
Metrics --> Storage
Storage --> Dashboards
Storage --> Insights
Storage --> Alerts
Hold "Alt" / "Option" to enable pan & zoom
Data Flow:
- Collection — Factory runtime emits metrics (success rates, costs, performance)
- Storage — Metrics stored in time-series database
- Visualization — Dashboards query storage for visualization
- Analysis — Insights analyze metrics for trends and anomalies
- Alerting — Alerts trigger when metrics exceed thresholds
Related Documentation¶
- Reliability & Scalability — SLAs and performance characteristics
- Operational Excellence — Cost and efficiency considerations
- Business Continuity — Risk management and failure handling
- Factory Operations — Operational procedures and runbooks
- Technical Runtime Documentation — Technical observability implementation