Identity Platform Runbook¶
This document provides operational procedures and troubleshooting guides for the ConnectSoft Identity & Access Platform. It is written for operations teams and SREs running the Identity Platform.
The Identity Platform provides authentication and authorization services for ConnectSoft systems. This runbook covers common operations, troubleshooting, and incident response procedures.
Note
This runbook focuses on operational procedures. For architecture and design details, see Identity Platform.
System Overview¶
What the Identity Platform Does¶
Core Functions: - User authentication - OAuth2/OpenID Connect authentication - Token management - Issue and validate access tokens, refresh tokens - User management - Create, update, delete users - Tenant management - Multi-tenant user isolation - Authorization - Role-based and resource-based authorization
Key Components¶
Services: - Identity API - REST API for authentication and user management - Token Service - Issues and validates tokens - User Store - Database for users and credentials - Key Vault - Stores signing keys and secrets
Dependencies: - Azure Cosmos DB - User data storage - Azure Key Vault - Secrets and signing keys - Azure Service Bus - Event publishing (user created, etc.)
Common Symptoms and Checks¶
Symptom: Login Failures¶
Diagnosis Steps:
- Check Error Rate
- Open Identity Platform dashboard
- Check error rate metric
-
Identify error spike time
-
Check Error Logs
-
Check Token Service Health
- Check
/healthendpoint - Verify token signing keys are accessible
- Check Key Vault connectivity
Common Causes: - Invalid credentials - Token service unavailable - Key Vault connectivity issues - Database connectivity issues
Symptom: Token Issues¶
Diagnosis Steps:
- Check Token Validation Errors
- Check logs for token validation failures
- Verify token expiration times
-
Check token signature validation
-
Check Key Vault
- Verify signing keys are accessible
- Check key rotation status
-
Verify key permissions
-
Check Token Service
- Verify token service is running
- Check token service health endpoint
- Verify token service can access Key Vault
Common Causes: - Expired tokens - Invalid token signatures - Key Vault connectivity issues - Token service unavailable
Symptom: High Latency¶
Diagnosis Steps:
- Check Response Times
- Check p95/p99 latency metrics
- Identify slow endpoints
-
Check database query performance
-
Check Database Performance
- Check Cosmos DB RU consumption
- Check query performance
-
Check connection pool usage
-
Check External Dependencies
- Check Key Vault response times
- Check Service Bus latency
- Check network latency
Common Causes: - Database performance issues - High RU consumption - Network latency - Resource constraints
Incident Scenarios¶
Scenario 1: Authentication Outage¶
Symptoms: - All login attempts failing - High error rate (100%) - Token service returning 500 errors
Diagnosis:
-
Check Token Service
-
Check Key Vault
-
Check Database
Resolution Steps:
- If Token Service Down:
- Check pod status:
kubectl get pods -n identity - Check logs:
kubectl logs -n identity <pod-name> -
Restart if needed:
kubectl rollout restart deployment/identity-api -n identity -
If Key Vault Issue:
- Verify managed identity permissions
- Check Key Vault firewall rules
-
Verify key exists and is accessible
-
If Database Issue:
- Check Cosmos DB status in Azure portal
- Verify connection string
- Check RU limits and throttling
Verification: - Login attempts succeed - Error rate returns to normal - Health checks pass
Scenario 2: Performance Degradation¶
Symptoms: - Increased latency (p95 > 500ms) - Timeout errors - Slow token generation
Diagnosis:
- Check Metrics
- Check latency trends
- Check request rate
-
Check error rate
-
Check Database
- Check Cosmos DB RU consumption
- Check query performance
-
Check throttling
-
Check Resources
- Check CPU/memory usage
- Check pod resource limits
- Check autoscaling status
Resolution Steps:
-
Scale Up
-
Increase Database RU
- Azure Portal → Cosmos DB → Scale
-
Increase RU/s if throttled
-
Optimize Queries
- Review slow query logs
- Add indexes if needed
- Optimize query patterns
Verification: - Latency returns to normal - No timeout errors - Health checks pass
Scenario 3: Token Validation Failures¶
Symptoms: - Token validation errors - "Invalid token" errors - Token signature verification failures
Diagnosis:
- Check Token Service Logs
- Check for token validation errors
- Verify token format
-
Check signature validation
-
Check Key Vault
- Verify signing keys are accessible
- Check key rotation status
- Verify key permissions
Resolution Steps:
- Verify Key Rotation
- Check if keys were rotated recently
- Verify new keys are accessible
-
Update token service configuration if needed
-
Check Key Permissions
Verification: - Token validation succeeds - No token errors in logs - Health checks pass
Maintenance Tasks¶
Rotating Signing Keys¶
Frequency: Quarterly or as needed for security
Steps:
-
Generate New Key
-
Update Configuration
- Update token service configuration to use new key
-
Keep old key for token validation during transition
-
Deploy Update
- Deploy updated configuration
- Verify new tokens use new key
-
Monitor for issues
-
Remove Old Key (after transition period)
- Remove old key after all tokens expire
- Update configuration to remove old key reference
Patching and Updates¶
Process:
- Plan Update
- Review release notes
- Identify breaking changes
-
Plan deployment window
-
Deploy to Dev
- Deploy update to dev environment
- Run tests
-
Verify functionality
-
Deploy to Staging
- Deploy update to staging
- Run integration tests
-
Verify with production-like data
-
Deploy to Production
- Deploy during maintenance window
- Monitor metrics and logs
- Rollback if issues occur
Scaling and Capacity Planning¶
Monitoring: - Request rate trends - Latency trends - Database RU consumption - Resource usage
Scaling Triggers: - CPU usage > 70% - Memory usage > 80% - Request rate increasing - Latency increasing
Scaling Actions: - Horizontal scaling (add pods) - Vertical scaling (increase resources) - Database scaling (increase RU)
Related Documents¶
- Operations Overview - Operations documentation overview
- Monitoring & Dashboards - Monitoring practices
- Incident Management - Incident response process
- Identity Platform - Platform architecture