Skip to content

Identity Platform Runbook

This document provides operational procedures and troubleshooting guides for the ConnectSoft Identity & Access Platform. It is written for operations teams and SREs running the Identity Platform.

The Identity Platform provides authentication and authorization services for ConnectSoft systems. This runbook covers common operations, troubleshooting, and incident response procedures.

Note

This runbook focuses on operational procedures. For architecture and design details, see Identity Platform.

System Overview

What the Identity Platform Does

Core Functions: - User authentication - OAuth2/OpenID Connect authentication - Token management - Issue and validate access tokens, refresh tokens - User management - Create, update, delete users - Tenant management - Multi-tenant user isolation - Authorization - Role-based and resource-based authorization

Key Components

Services: - Identity API - REST API for authentication and user management - Token Service - Issues and validates tokens - User Store - Database for users and credentials - Key Vault - Stores signing keys and secrets

Dependencies: - Azure Cosmos DB - User data storage - Azure Key Vault - Secrets and signing keys - Azure Service Bus - Event publishing (user created, etc.)

Common Symptoms and Checks

Symptom: Login Failures

Diagnosis Steps:

  1. Check Error Rate
  2. Open Identity Platform dashboard
  3. Check error rate metric
  4. Identify error spike time

  5. Check Error Logs

    # Query logs for authentication errors
    az monitor log-analytics query \
      --workspace <workspace-id> \
      --analytics-query "IdentityPlatform_CL | where Level == 'Error' | where Message contains 'authentication'"
    

  6. Check Token Service Health

  7. Check /health endpoint
  8. Verify token signing keys are accessible
  9. Check Key Vault connectivity

Common Causes: - Invalid credentials - Token service unavailable - Key Vault connectivity issues - Database connectivity issues

Symptom: Token Issues

Diagnosis Steps:

  1. Check Token Validation Errors
  2. Check logs for token validation failures
  3. Verify token expiration times
  4. Check token signature validation

  5. Check Key Vault

  6. Verify signing keys are accessible
  7. Check key rotation status
  8. Verify key permissions

  9. Check Token Service

  10. Verify token service is running
  11. Check token service health endpoint
  12. Verify token service can access Key Vault

Common Causes: - Expired tokens - Invalid token signatures - Key Vault connectivity issues - Token service unavailable

Symptom: High Latency

Diagnosis Steps:

  1. Check Response Times
  2. Check p95/p99 latency metrics
  3. Identify slow endpoints
  4. Check database query performance

  5. Check Database Performance

  6. Check Cosmos DB RU consumption
  7. Check query performance
  8. Check connection pool usage

  9. Check External Dependencies

  10. Check Key Vault response times
  11. Check Service Bus latency
  12. Check network latency

Common Causes: - Database performance issues - High RU consumption - Network latency - Resource constraints

Incident Scenarios

Scenario 1: Authentication Outage

Symptoms: - All login attempts failing - High error rate (100%) - Token service returning 500 errors

Diagnosis:

  1. Check Token Service

    curl https://identity.connectsoft.io/health
    

  2. Check Key Vault

    az keyvault secret show --vault-name <vault-name> --name SigningKey
    

  3. Check Database

    az cosmosdb sql container show \
      --account-name <account-name> \
      --database-name IdentityDB \
      --name Users
    

Resolution Steps:

  1. If Token Service Down:
  2. Check pod status: kubectl get pods -n identity
  3. Check logs: kubectl logs -n identity <pod-name>
  4. Restart if needed: kubectl rollout restart deployment/identity-api -n identity

  5. If Key Vault Issue:

  6. Verify managed identity permissions
  7. Check Key Vault firewall rules
  8. Verify key exists and is accessible

  9. If Database Issue:

  10. Check Cosmos DB status in Azure portal
  11. Verify connection string
  12. Check RU limits and throttling

Verification: - Login attempts succeed - Error rate returns to normal - Health checks pass

Scenario 2: Performance Degradation

Symptoms: - Increased latency (p95 > 500ms) - Timeout errors - Slow token generation

Diagnosis:

  1. Check Metrics
  2. Check latency trends
  3. Check request rate
  4. Check error rate

  5. Check Database

  6. Check Cosmos DB RU consumption
  7. Check query performance
  8. Check throttling

  9. Check Resources

  10. Check CPU/memory usage
  11. Check pod resource limits
  12. Check autoscaling status

Resolution Steps:

  1. Scale Up

    kubectl scale deployment identity-api --replicas=5 -n identity
    

  2. Increase Database RU

  3. Azure Portal → Cosmos DB → Scale
  4. Increase RU/s if throttled

  5. Optimize Queries

  6. Review slow query logs
  7. Add indexes if needed
  8. Optimize query patterns

Verification: - Latency returns to normal - No timeout errors - Health checks pass

Scenario 3: Token Validation Failures

Symptoms: - Token validation errors - "Invalid token" errors - Token signature verification failures

Diagnosis:

  1. Check Token Service Logs
  2. Check for token validation errors
  3. Verify token format
  4. Check signature validation

  5. Check Key Vault

  6. Verify signing keys are accessible
  7. Check key rotation status
  8. Verify key permissions

Resolution Steps:

  1. Verify Key Rotation
  2. Check if keys were rotated recently
  3. Verify new keys are accessible
  4. Update token service configuration if needed

  5. Check Key Permissions

    az keyvault set-policy \
      --name <vault-name> \
      --object-id <managed-identity-id> \
      --secret-permissions get list
    

Verification: - Token validation succeeds - No token errors in logs - Health checks pass

Maintenance Tasks

Rotating Signing Keys

Frequency: Quarterly or as needed for security

Steps:

  1. Generate New Key

    az keyvault key create \
      --vault-name <vault-name> \
      --name SigningKey-v2 \
      --kty RSA \
      --size 2048
    

  2. Update Configuration

  3. Update token service configuration to use new key
  4. Keep old key for token validation during transition

  5. Deploy Update

  6. Deploy updated configuration
  7. Verify new tokens use new key
  8. Monitor for issues

  9. Remove Old Key (after transition period)

  10. Remove old key after all tokens expire
  11. Update configuration to remove old key reference

Patching and Updates

Process:

  1. Plan Update
  2. Review release notes
  3. Identify breaking changes
  4. Plan deployment window

  5. Deploy to Dev

  6. Deploy update to dev environment
  7. Run tests
  8. Verify functionality

  9. Deploy to Staging

  10. Deploy update to staging
  11. Run integration tests
  12. Verify with production-like data

  13. Deploy to Production

  14. Deploy during maintenance window
  15. Monitor metrics and logs
  16. Rollback if issues occur

Scaling and Capacity Planning

Monitoring: - Request rate trends - Latency trends - Database RU consumption - Resource usage

Scaling Triggers: - CPU usage > 70% - Memory usage > 80% - Request rate increasing - Latency increasing

Scaling Actions: - Horizontal scaling (add pods) - Vertical scaling (increase resources) - Database scaling (increase RU)