FinOps Scaling Policies¶

This document defines scaling policies and cost/latency trade-offs for ConnectSoft's infrastructure. It is written for operations engineers, SREs, and architects who need to understand how scaling decisions balance performance requirements, cost efficiency, and budget constraints.

Scaling policies at ConnectSoft balance latency SLOs, cost efficiency, and budget limits. Different policies apply to different workload types, product tiers, and tenant plans, enabling cost-optimized scaling while maintaining service quality.

Important

Cost-Aware Scaling: Scaling decisions must consider both performance requirements (latency SLOs) and cost constraints (budget limits). Unlimited scaling is not acceptable—scaling must be bounded by cost budgets and efficiency targets.

General Principles¶

Horizontal Scaling Preferred¶

Primary Approach:

Scale horizontally (add more instances/pods) rather than vertically (increase instance size)
Horizontal scaling provides better cost efficiency and resilience
Vertical scaling only when horizontal scaling is not feasible

Benefits:

Better cost efficiency (smaller instances are often more cost-effective)
Improved resilience (failure of one instance doesn't affect others)
More granular scaling (scale in smaller increments)

Scale Decision Drivers¶

Scaling decisions are driven by multiple factors:

Latency SLOs:

Scale when latency exceeds SLO targets
Maintain latency within defined SLO bounds
Different SLOs for different product tiers and tenant plans

Queue Lengths / Message Age:

Scale when queue length exceeds thresholds
Scale when message age exceeds acceptable limits
Prevent queue backlog from growing unbounded

Cost Ceilings:

Scale decisions bounded by budget limits
Cost-saving mode may be enabled when approaching budget limits
Different scaling behavior for different cost tiers

See: Operations Overview for SLO definitions.

See: FinOps Budgets & Alerts for budget limits and cost-saving modes.

Workload Types & Metrics¶

Interactive APIs (User-Facing)¶

Interactive APIs serve user-facing requests and require low latency.

Metrics:

p95 Latency - 95^th percentile request latency
CPU Utilization - CPU usage percentage
RPS - Requests per second

Scaling Rules:

Scale Out: When p95 latency > target SLO for Y minutes, or CPU > X% for Y minutes
Scale In: When p95 latency < target SLO and CPU < lower threshold for Y minutes
Maximum Replicas: Hard upper bound per environment (prevents runaway scaling)
Minimum Replicas: Minimum instances to maintain availability

Example Policies:

Scale out when p95 latency > 500ms for 5 minutes
Scale out when CPU > 70% for 5 minutes
Maximum 20 replicas in production
Minimum 2 replicas in production

Cost Considerations:

Higher maximum replicas for enterprise tenants
Lower maximum replicas for free-tier tenants
Cost-saving mode may relax latency targets for low-tier users

Background Workers (Factory Jobs, ETL, Multi-Agent Workflows)¶

Background workers process asynchronous jobs and can tolerate higher latency.

Metrics:

Queue Length - Number of pending jobs
Job Age - Age of oldest pending job
Throughput - Jobs processed per second

Scaling Rules:

Scale Out: When queue length > threshold or job age > threshold
Scale In: When queue length < lower threshold for Y minutes
Cost-Saving Mode: Allow larger latency windows in low-cost modes (off-peak hours, free-tier tenants)

Example Policies:

Scale out when queue length > 100 jobs
Scale out when oldest job age > 10 minutes
Cost-saving mode: Allow job age up to 30 minutes during off-peak hours

Cost Considerations:

Background workers can accept higher latency to reduce costs
Off-peak hours may use cost-saving mode with relaxed latency
Free-tier tenants may have more aggressive cost-saving policies

AI-Heavy Workloads¶

AI-heavy workloads balance concurrency, latency, and cost through model selection and batching.

Trade-Offs:

Higher Concurrency / Lower Latency - More parallel AI calls, faster responses, higher cost
Fewer Calls / Smaller Models - Lower concurrency, smaller models, lower cost, higher latency

Tier-Based Behavior:

Free Tier:

More aggressive batching (batch multiple requests)
Slower responses (acceptable latency windows)
Smaller/cheaper models where possible
Lower concurrency limits

Paid Tiers (Pro, Family, Enterprise):

More generous concurrency (faster responses)
Lower latency targets
Access to more expensive/higher-quality models
Priority processing

Scaling Rules:

Scale Out: When AI request queue length > threshold or p95 latency > target
Model Selection: May switch to cheaper models when cost budget is constrained
Batching: Increase batching in cost-saving mode

Cost Considerations:

Model selection significantly impacts cost (cheap vs. expensive models)
Batching reduces API call costs but increases latency
Concurrency limits prevent cost spikes from runaway scaling

See: FinOps Cost Model for AI token cost formulas.

Product-Specific Policies¶

connectsoft.me¶

Different scaling policies by plan tier:

Free Plan:

Accept slightly higher latency to control costs
More aggressive cost-saving mode
Lower maximum replicas
Higher batching for AI requests

Pro Plan:

Standard latency SLOs
Standard scaling policies
Moderate maximum replicas
Standard AI concurrency

Family Plan:

Standard latency SLOs
Standard scaling policies
Moderate maximum replicas
Standard AI concurrency

Enterprise Plan:

Strict latency SLOs
More generous scaling (higher maximum replicas)
Priority processing
Higher AI concurrency limits

connectsoft.io and Vertical Suites¶

More strict SLOs, especially for enterprise tenants:

Standard Tenants:

Standard latency SLOs
Standard scaling policies
Cost-aware scaling with budget limits

Enterprise Tenants:

Strict latency SLOs (higher availability and lower latency requirements)
More generous scaling (higher maximum replicas)
Priority processing
Cost-saving mode less likely to be enabled

Vertical Suites (Insurance, AdTech, HR):

Industry-specific SLOs
Compliance-driven scaling (maintain availability for regulatory requirements)
Cost optimization balanced with compliance needs

Cost vs Latency Trade-offs¶

SLOs, SLA Commitments, and Budget Limits¶

Scaling policies balance three competing concerns:

SLOs (Service Level Objectives):

Internal targets for latency, availability, error rate
Different SLOs for different product tiers and tenant plans
SLOs guide scaling decisions (scale to meet SLOs)

SLA Commitments:

Contractual commitments to customers
Must be maintained regardless of cost
Enterprise customers have stricter SLA commitments

Budget Limits:

Cost ceilings that constrain scaling
When approaching budget limits, cost-saving mode may be enabled
Cost-saving mode may relax SLO targets for low-tier users (but not SLA commitments)

Cost-Saving Mode¶

When budget limits are approached, cost-saving mode may be enabled:

For Free/Low-Tier Users:

Relaxed latency targets (within acceptable bounds)
More aggressive batching
Smaller/cheaper AI models
Lower maximum replicas

For Paid/Enterprise Users:

SLA commitments maintained (no relaxation)
Cost-saving mode less likely to be enabled
Alternative cost optimizations (efficiency improvements, not SLO relaxation)

See: FinOps Budgets & Alerts for budget limits and cost-saving mode triggers.

Decision Flow¶

The following diagram illustrates the scaling decision flow, balancing latency requirements and cost budgets:

flowchart TD
    A[Monitor Metrics] --> B{Latency > Target SLO?}
    B -->|Yes| C{Cost Budget Near Limit?}
    B -->|No| D[No Scaling Needed]

    C -->|No| E[Scale Out]
    C -->|Yes| F{Tenant Tier?}

    F -->|Free/Low-Tier| G[Apply Cost-Safe Mode]
    F -->|Paid/Enterprise| H{Can Optimize Efficiency?}

    G --> I[Relax Latency Target]
    G --> J[Increase Batching]
    G --> K[Use Cheaper Models]

    H -->|Yes| L[Optimize Efficiency]
    H -->|No| E

    I --> M[Monitor Impact]
    J --> M
    K --> M
    L --> M
    E --> M

    M --> N{SLO Met?}
    N -->|Yes| O[Continue Monitoring]
    N -->|No| P{Escalate?}

    P -->|Yes| Q[Manual Intervention]
    P -->|No| E

    O --> A
    Q --> A

Hold "Alt" / "Option" to enable pan & zoom

Flow Description:

Monitor metrics (latency, CPU, queue length, etc.)
Check if latency exceeds target SLO
If latency is high, check if cost budget is near limit
If budget allows, scale out
If budget is constrained, check tenant tier
For free/low-tier tenants, apply cost-safe mode (relax latency, increase batching, use cheaper models)
For paid/enterprise tenants, try efficiency optimizations first; scale out if needed to maintain SLA
Monitor impact and continue adjusting

See: Operations Overview for SLO definitions.

See: FinOps Budgets & Alerts for budget evaluation and cost-saving mode.

Governance & Overview¶

FinOps Overview - High-level FinOps principles and ownership model

Operations FinOps Documents¶

FinOps Cost Model - Detailed cost modeling and attribution
FinOps Budgets & Alerts - Budget definitions and alerting rules

Operations & Observability¶

Operations Overview - Operations and SRE overview (includes SLO definitions)
Observability – Dashboards and Alerts - Monitoring and alerting
Support & SLA Policy - SLA definitions

FinOps Scaling Policies¶

General Principles¶

Horizontal Scaling Preferred¶

Scale Decision Drivers¶

Workload Types & Metrics¶

Interactive APIs (User-Facing)¶

Background Workers (Factory Jobs, ETL, Multi-Agent Workflows)¶

AI-Heavy Workloads¶

Product-Specific Policies¶

connectsoft.me¶

connectsoft.io and Vertical Suites¶

Cost vs Latency Trade-offs¶

SLOs, SLA Commitments, and Budget Limits¶

Cost-Saving Mode¶

Decision Flow¶

Related Documents¶

Governance & Overview¶

Operations FinOps Documents¶

Operations & Observability¶