Audit Platform Runbook¶

This document provides operational procedures and troubleshooting guides for the ConnectSoft Audit Trail Platform. It is written for operations teams and SREs running the Audit Platform.

The Audit Platform provides centralized, tamper-evident audit logging for compliance-driven systems. This runbook covers common operations, troubleshooting, and incident response procedures.

Note

This runbook focuses on operational procedures. For architecture and design details, see Audit Platform.

System Overview¶

What the Audit Platform Does¶

Core Functions: - Event ingestion - Receive and store audit events from services - Event storage - Store events in tamper-evident storage - Event querying - Query events by tenant, time range, event type - Compliance - Support compliance requirements (retention, export, legal hold)

Key Components¶

Services: - Audit API - REST API for ingesting and querying events - Event Processor - Processes and stores events - Event Store - Cosmos DB for event storage - Query Service - Handles event queries

Dependencies: - Azure Cosmos DB - Event storage - Azure Service Bus - Event ingestion (optional) - Azure Blob Storage - Event export and archival

Common Symptoms and Checks¶

Symptom: Missing Audit Logs¶

Diagnosis Steps:

Check Ingestion Rate
Check events ingested per second metric
Compare to expected rate
Identify drop in ingestion

Check Ingestion Logs

# Query logs for ingestion errors
az monitor log-analytics query \
  --workspace <workspace-id> \
  --analytics-query "AuditPlatform_CL | where Level == 'Error' | where Message contains 'ingestion'"

Check Event Processor
Check event processor health
Verify event processor can access database
Check for processing errors

Common Causes: - Event processor unavailable - Database connectivity issues - Service Bus connectivity issues - Throttling (RU limits exceeded)

Symptom: Ingestion Backlog¶

Symptoms: - Events queued but not processed - Increasing queue length - Delayed event storage

Diagnosis Steps:

Check Queue Length
Check Service Bus queue length
Check processing rate vs ingestion rate
Identify bottleneck
Check Event Processor
Check event processor health
Check processing rate
Check for errors
Check Database Performance
Check Cosmos DB RU consumption
Check throttling
Check query performance

Common Causes: - Event processor not keeping up - Database throttling - Resource constraints

Symptom: Query Latency¶

Symptoms: - Slow query responses (> 1 second) - Timeout errors - High query latency

Diagnosis Steps:

Check Query Performance
Check query latency metrics
Identify slow queries
Check query patterns
Check Database
Check Cosmos DB RU consumption
Check query RU usage
Check indexing
Check Query Service
Check query service health
Check resource usage
Check for errors

Common Causes: - Missing indexes - Large time range queries - High RU consumption - Resource constraints

Incident Scenarios¶

Scenario 1: Ingestion Backlog¶

Symptoms: - Events queued but not processed - Queue length increasing - Delayed event storage

Diagnosis:

Check Queue Length

az servicebus queue show \
  --namespace-name <namespace> \
  --resource-group <rg> \
  --name audit-events \
  --query "countDetails.activeMessageCount"

Check Event Processor

kubectl get pods -n audit
kubectl logs -n audit <event-processor-pod>

Check Database

az cosmosdb sql container show \
  --account-name <account-name> \
  --database-name AuditDB \
  --name Events

Resolution Steps:

Scale Event Processor

kubectl scale deployment audit-event-processor --replicas=5 -n audit

Increase Database RU
Azure Portal → Cosmos DB → Scale
Increase RU/s to handle load
Optimize Processing
Review processing logic
Batch processing if possible
Optimize database writes

Verification: - Queue length decreases - Processing rate matches ingestion rate - Events stored within SLA

Scenario 2: Database Performance Issues¶

Symptoms: - High query latency - Throttling errors - Timeout errors

Diagnosis:

Check RU Consumption
Check Cosmos DB metrics
Identify high RU operations
Check throttling
Check Query Patterns
Review slow queries
Check query RU usage
Identify inefficient queries

Resolution Steps:

Scale Database
Increase RU/s
Enable autoscaling if available
Optimize Queries
Add indexes
Optimize query patterns
Use partition keys effectively
Cache Queries
Cache frequent queries
Use read replicas if available

Verification: - Query latency returns to normal - No throttling errors - RU consumption within limits

Scenario 3: Missing Events¶

Symptoms: - Events not appearing in queries - Gaps in event timeline - Missing events for specific tenants

Diagnosis:

Check Ingestion
Verify events are being ingested
Check ingestion logs
Verify event format
Check Storage
Verify events are stored
Check database queries
Verify tenant filtering
Check Processing
Check event processor logs
Verify processing errors
Check for dropped events

Resolution Steps:

Fix Ingestion
Fix event format issues
Resolve connectivity issues
Restart ingestion if needed
Reprocess Events
Reprocess queued events
Verify events are stored
Check for data corruption

Verification: - Events appear in queries - No gaps in timeline - All tenants have events

Maintenance Tasks¶

Retention Policy Checks¶

Frequency: Monthly

Process:

Review Retention Policies
Check retention policies per tenant
Verify compliance requirements
Identify events to archive/delete
Archive Old Events
Export events older than retention period
Archive to Blob Storage
Delete from Cosmos DB
Verify Compliance
Verify retention policies are enforced
Check for legal holds
Document retention actions

Index Maintenance¶

Frequency: Quarterly or as needed

Process:

Review Index Usage
Check index usage statistics
Identify unused indexes
Identify missing indexes
Optimize Indexes
Add indexes for common queries
Remove unused indexes
Update index definitions
Monitor Performance
Monitor query performance
Check RU consumption
Verify improvements

Capacity Planning¶

Monitoring: - Event ingestion rate trends - Storage growth trends - Query volume trends - RU consumption trends

Planning: - Project storage needs (6-12 months) - Plan RU increases - Plan for peak loads - Plan for compliance requirements

Actions: - Scale storage capacity - Scale RU capacity - Optimize storage usage - Plan archival strategy

Operations Overview - Operations documentation overview
Monitoring & Dashboards - Monitoring practices
Incident Management - Incident response process
Audit Platform - Platform architecture

Audit Platform Runbook¶

System Overview¶

What the Audit Platform Does¶

Key Components¶

Common Symptoms and Checks¶

Symptom: Missing Audit Logs¶

Symptom: Ingestion Backlog¶

Symptom: Query Latency¶

Incident Scenarios¶

Scenario 1: Ingestion Backlog¶

Scenario 2: Database Performance Issues¶

Scenario 3: Missing Events¶

Maintenance Tasks¶

Retention Policy Checks¶

Index Maintenance¶

Capacity Planning¶

Related Documents¶