Skip to content

Audit Platform Runbook

This document provides operational procedures and troubleshooting guides for the ConnectSoft Audit Trail Platform. It is written for operations teams and SREs running the Audit Platform.

The Audit Platform provides centralized, tamper-evident audit logging for compliance-driven systems. This runbook covers common operations, troubleshooting, and incident response procedures.

Note

This runbook focuses on operational procedures. For architecture and design details, see Audit Platform.

System Overview

What the Audit Platform Does

Core Functions: - Event ingestion - Receive and store audit events from services - Event storage - Store events in tamper-evident storage - Event querying - Query events by tenant, time range, event type - Compliance - Support compliance requirements (retention, export, legal hold)

Key Components

Services: - Audit API - REST API for ingesting and querying events - Event Processor - Processes and stores events - Event Store - Cosmos DB for event storage - Query Service - Handles event queries

Dependencies: - Azure Cosmos DB - Event storage - Azure Service Bus - Event ingestion (optional) - Azure Blob Storage - Event export and archival

Common Symptoms and Checks

Symptom: Missing Audit Logs

Diagnosis Steps:

  1. Check Ingestion Rate
  2. Check events ingested per second metric
  3. Compare to expected rate
  4. Identify drop in ingestion

  5. Check Ingestion Logs

    # Query logs for ingestion errors
    az monitor log-analytics query \
      --workspace <workspace-id> \
      --analytics-query "AuditPlatform_CL | where Level == 'Error' | where Message contains 'ingestion'"
    

  6. Check Event Processor

  7. Check event processor health
  8. Verify event processor can access database
  9. Check for processing errors

Common Causes: - Event processor unavailable - Database connectivity issues - Service Bus connectivity issues - Throttling (RU limits exceeded)

Symptom: Ingestion Backlog

Symptoms: - Events queued but not processed - Increasing queue length - Delayed event storage

Diagnosis Steps:

  1. Check Queue Length
  2. Check Service Bus queue length
  3. Check processing rate vs ingestion rate
  4. Identify bottleneck

  5. Check Event Processor

  6. Check event processor health
  7. Check processing rate
  8. Check for errors

  9. Check Database Performance

  10. Check Cosmos DB RU consumption
  11. Check throttling
  12. Check query performance

Common Causes: - Event processor not keeping up - Database throttling - Resource constraints

Symptom: Query Latency

Symptoms: - Slow query responses (> 1 second) - Timeout errors - High query latency

Diagnosis Steps:

  1. Check Query Performance
  2. Check query latency metrics
  3. Identify slow queries
  4. Check query patterns

  5. Check Database

  6. Check Cosmos DB RU consumption
  7. Check query RU usage
  8. Check indexing

  9. Check Query Service

  10. Check query service health
  11. Check resource usage
  12. Check for errors

Common Causes: - Missing indexes - Large time range queries - High RU consumption - Resource constraints

Incident Scenarios

Scenario 1: Ingestion Backlog

Symptoms: - Events queued but not processed - Queue length increasing - Delayed event storage

Diagnosis:

  1. Check Queue Length

    az servicebus queue show \
      --namespace-name <namespace> \
      --resource-group <rg> \
      --name audit-events \
      --query "countDetails.activeMessageCount"
    

  2. Check Event Processor

    kubectl get pods -n audit
    kubectl logs -n audit <event-processor-pod>
    

  3. Check Database

    az cosmosdb sql container show \
      --account-name <account-name> \
      --database-name AuditDB \
      --name Events
    

Resolution Steps:

  1. Scale Event Processor

    kubectl scale deployment audit-event-processor --replicas=5 -n audit
    

  2. Increase Database RU

  3. Azure Portal → Cosmos DB → Scale
  4. Increase RU/s to handle load

  5. Optimize Processing

  6. Review processing logic
  7. Batch processing if possible
  8. Optimize database writes

Verification: - Queue length decreases - Processing rate matches ingestion rate - Events stored within SLA

Scenario 2: Database Performance Issues

Symptoms: - High query latency - Throttling errors - Timeout errors

Diagnosis:

  1. Check RU Consumption
  2. Check Cosmos DB metrics
  3. Identify high RU operations
  4. Check throttling

  5. Check Query Patterns

  6. Review slow queries
  7. Check query RU usage
  8. Identify inefficient queries

Resolution Steps:

  1. Scale Database
  2. Increase RU/s
  3. Enable autoscaling if available

  4. Optimize Queries

  5. Add indexes
  6. Optimize query patterns
  7. Use partition keys effectively

  8. Cache Queries

  9. Cache frequent queries
  10. Use read replicas if available

Verification: - Query latency returns to normal - No throttling errors - RU consumption within limits

Scenario 3: Missing Events

Symptoms: - Events not appearing in queries - Gaps in event timeline - Missing events for specific tenants

Diagnosis:

  1. Check Ingestion
  2. Verify events are being ingested
  3. Check ingestion logs
  4. Verify event format

  5. Check Storage

  6. Verify events are stored
  7. Check database queries
  8. Verify tenant filtering

  9. Check Processing

  10. Check event processor logs
  11. Verify processing errors
  12. Check for dropped events

Resolution Steps:

  1. Fix Ingestion
  2. Fix event format issues
  3. Resolve connectivity issues
  4. Restart ingestion if needed

  5. Reprocess Events

  6. Reprocess queued events
  7. Verify events are stored
  8. Check for data corruption

Verification: - Events appear in queries - No gaps in timeline - All tenants have events

Maintenance Tasks

Retention Policy Checks

Frequency: Monthly

Process:

  1. Review Retention Policies
  2. Check retention policies per tenant
  3. Verify compliance requirements
  4. Identify events to archive/delete

  5. Archive Old Events

  6. Export events older than retention period
  7. Archive to Blob Storage
  8. Delete from Cosmos DB

  9. Verify Compliance

  10. Verify retention policies are enforced
  11. Check for legal holds
  12. Document retention actions

Index Maintenance

Frequency: Quarterly or as needed

Process:

  1. Review Index Usage
  2. Check index usage statistics
  3. Identify unused indexes
  4. Identify missing indexes

  5. Optimize Indexes

  6. Add indexes for common queries
  7. Remove unused indexes
  8. Update index definitions

  9. Monitor Performance

  10. Monitor query performance
  11. Check RU consumption
  12. Verify improvements

Capacity Planning

Monitoring: - Event ingestion rate trends - Storage growth trends - Query volume trends - RU consumption trends

Planning: - Project storage needs (6-12 months) - Plan RU increases - Plan for peak loads - Plan for compliance requirements

Actions: - Scale storage capacity - Scale RU capacity - Optimize storage usage - Plan archival strategy