FinOps Scaling Policies¶
This document defines scaling policies and cost/latency trade-offs for ConnectSoft's infrastructure. It is written for operations engineers, SREs, and architects who need to understand how scaling decisions balance performance requirements, cost efficiency, and budget constraints.
Scaling policies at ConnectSoft balance latency SLOs, cost efficiency, and budget limits. Different policies apply to different workload types, product tiers, and tenant plans, enabling cost-optimized scaling while maintaining service quality.
Important
Cost-Aware Scaling: Scaling decisions must consider both performance requirements (latency SLOs) and cost constraints (budget limits). Unlimited scaling is not acceptable—scaling must be bounded by cost budgets and efficiency targets.
General Principles¶
Horizontal Scaling Preferred¶
Primary Approach:
- Scale horizontally (add more instances/pods) rather than vertically (increase instance size)
- Horizontal scaling provides better cost efficiency and resilience
- Vertical scaling only when horizontal scaling is not feasible
Benefits:
- Better cost efficiency (smaller instances are often more cost-effective)
- Improved resilience (failure of one instance doesn't affect others)
- More granular scaling (scale in smaller increments)
Scale Decision Drivers¶
Scaling decisions are driven by multiple factors:
Latency SLOs:
- Scale when latency exceeds SLO targets
- Maintain latency within defined SLO bounds
- Different SLOs for different product tiers and tenant plans
Queue Lengths / Message Age:
- Scale when queue length exceeds thresholds
- Scale when message age exceeds acceptable limits
- Prevent queue backlog from growing unbounded
Cost Ceilings:
- Scale decisions bounded by budget limits
- Cost-saving mode may be enabled when approaching budget limits
- Different scaling behavior for different cost tiers
See: Operations Overview for SLO definitions.
See: FinOps Budgets & Alerts for budget limits and cost-saving modes.
Workload Types & Metrics¶
Interactive APIs (User-Facing)¶
Interactive APIs serve user-facing requests and require low latency.
Metrics:
- p95 Latency - 95th percentile request latency
- CPU Utilization - CPU usage percentage
- RPS - Requests per second
Scaling Rules:
- Scale Out: When p95 latency > target SLO for Y minutes, or CPU > X% for Y minutes
- Scale In: When p95 latency < target SLO and CPU < lower threshold for Y minutes
- Maximum Replicas: Hard upper bound per environment (prevents runaway scaling)
- Minimum Replicas: Minimum instances to maintain availability
Example Policies:
- Scale out when p95 latency > 500ms for 5 minutes
- Scale out when CPU > 70% for 5 minutes
- Maximum 20 replicas in production
- Minimum 2 replicas in production
Cost Considerations:
- Higher maximum replicas for enterprise tenants
- Lower maximum replicas for free-tier tenants
- Cost-saving mode may relax latency targets for low-tier users
Background Workers (Factory Jobs, ETL, Multi-Agent Workflows)¶
Background workers process asynchronous jobs and can tolerate higher latency.
Metrics:
- Queue Length - Number of pending jobs
- Job Age - Age of oldest pending job
- Throughput - Jobs processed per second
Scaling Rules:
- Scale Out: When queue length > threshold or job age > threshold
- Scale In: When queue length < lower threshold for Y minutes
- Cost-Saving Mode: Allow larger latency windows in low-cost modes (off-peak hours, free-tier tenants)
Example Policies:
- Scale out when queue length > 100 jobs
- Scale out when oldest job age > 10 minutes
- Cost-saving mode: Allow job age up to 30 minutes during off-peak hours
Cost Considerations:
- Background workers can accept higher latency to reduce costs
- Off-peak hours may use cost-saving mode with relaxed latency
- Free-tier tenants may have more aggressive cost-saving policies
AI-Heavy Workloads¶
AI-heavy workloads balance concurrency, latency, and cost through model selection and batching.
Trade-Offs:
- Higher Concurrency / Lower Latency - More parallel AI calls, faster responses, higher cost
- Fewer Calls / Smaller Models - Lower concurrency, smaller models, lower cost, higher latency
Tier-Based Behavior:
Free Tier:
- More aggressive batching (batch multiple requests)
- Slower responses (acceptable latency windows)
- Smaller/cheaper models where possible
- Lower concurrency limits
Paid Tiers (Pro, Family, Enterprise):
- More generous concurrency (faster responses)
- Lower latency targets
- Access to more expensive/higher-quality models
- Priority processing
Scaling Rules:
- Scale Out: When AI request queue length > threshold or p95 latency > target
- Model Selection: May switch to cheaper models when cost budget is constrained
- Batching: Increase batching in cost-saving mode
Cost Considerations:
- Model selection significantly impacts cost (cheap vs. expensive models)
- Batching reduces API call costs but increases latency
- Concurrency limits prevent cost spikes from runaway scaling
See: FinOps Cost Model for AI token cost formulas.
Product-Specific Policies¶
connectsoft.me¶
Different scaling policies by plan tier:
Free Plan:
- Accept slightly higher latency to control costs
- More aggressive cost-saving mode
- Lower maximum replicas
- Higher batching for AI requests
Pro Plan:
- Standard latency SLOs
- Standard scaling policies
- Moderate maximum replicas
- Standard AI concurrency
Family Plan:
- Standard latency SLOs
- Standard scaling policies
- Moderate maximum replicas
- Standard AI concurrency
Enterprise Plan:
- Strict latency SLOs
- More generous scaling (higher maximum replicas)
- Priority processing
- Higher AI concurrency limits
connectsoft.io and Vertical Suites¶
More strict SLOs, especially for enterprise tenants:
Standard Tenants:
- Standard latency SLOs
- Standard scaling policies
- Cost-aware scaling with budget limits
Enterprise Tenants:
- Strict latency SLOs (higher availability and lower latency requirements)
- More generous scaling (higher maximum replicas)
- Priority processing
- Cost-saving mode less likely to be enabled
Vertical Suites (Insurance, AdTech, HR):
- Industry-specific SLOs
- Compliance-driven scaling (maintain availability for regulatory requirements)
- Cost optimization balanced with compliance needs
Cost vs Latency Trade-offs¶
SLOs, SLA Commitments, and Budget Limits¶
Scaling policies balance three competing concerns:
SLOs (Service Level Objectives):
- Internal targets for latency, availability, error rate
- Different SLOs for different product tiers and tenant plans
- SLOs guide scaling decisions (scale to meet SLOs)
SLA Commitments:
- Contractual commitments to customers
- Must be maintained regardless of cost
- Enterprise customers have stricter SLA commitments
Budget Limits:
- Cost ceilings that constrain scaling
- When approaching budget limits, cost-saving mode may be enabled
- Cost-saving mode may relax SLO targets for low-tier users (but not SLA commitments)
Cost-Saving Mode¶
When budget limits are approached, cost-saving mode may be enabled:
For Free/Low-Tier Users:
- Relaxed latency targets (within acceptable bounds)
- More aggressive batching
- Smaller/cheaper AI models
- Lower maximum replicas
For Paid/Enterprise Users:
- SLA commitments maintained (no relaxation)
- Cost-saving mode less likely to be enabled
- Alternative cost optimizations (efficiency improvements, not SLO relaxation)
See: FinOps Budgets & Alerts for budget limits and cost-saving mode triggers.
Decision Flow¶
The following diagram illustrates the scaling decision flow, balancing latency requirements and cost budgets:
flowchart TD
A[Monitor Metrics] --> B{Latency > Target SLO?}
B -->|Yes| C{Cost Budget Near Limit?}
B -->|No| D[No Scaling Needed]
C -->|No| E[Scale Out]
C -->|Yes| F{Tenant Tier?}
F -->|Free/Low-Tier| G[Apply Cost-Safe Mode]
F -->|Paid/Enterprise| H{Can Optimize Efficiency?}
G --> I[Relax Latency Target]
G --> J[Increase Batching]
G --> K[Use Cheaper Models]
H -->|Yes| L[Optimize Efficiency]
H -->|No| E
I --> M[Monitor Impact]
J --> M
K --> M
L --> M
E --> M
M --> N{SLO Met?}
N -->|Yes| O[Continue Monitoring]
N -->|No| P{Escalate?}
P -->|Yes| Q[Manual Intervention]
P -->|No| E
O --> A
Q --> A
Flow Description:
- Monitor metrics (latency, CPU, queue length, etc.)
- Check if latency exceeds target SLO
- If latency is high, check if cost budget is near limit
- If budget allows, scale out
- If budget is constrained, check tenant tier
- For free/low-tier tenants, apply cost-safe mode (relax latency, increase batching, use cheaper models)
- For paid/enterprise tenants, try efficiency optimizations first; scale out if needed to maintain SLA
- Monitor impact and continue adjusting
See: Operations Overview for SLO definitions.
See: FinOps Budgets & Alerts for budget evaluation and cost-saving mode.
Related Documents¶
Governance & Overview¶
- FinOps Overview - High-level FinOps principles and ownership model
Operations FinOps Documents¶
- FinOps Cost Model - Detailed cost modeling and attribution
- FinOps Budgets & Alerts - Budget definitions and alerting rules
Operations & Observability¶
- Operations Overview - Operations and SRE overview (includes SLO definitions)
- Observability – Dashboards and Alerts - Monitoring and alerting
- Support & SLA Policy - SLA definitions