Audit Platform Runbook¶
This document provides operational procedures and troubleshooting guides for the ConnectSoft Audit Trail Platform. It is written for operations teams and SREs running the Audit Platform.
The Audit Platform provides centralized, tamper-evident audit logging for compliance-driven systems. This runbook covers common operations, troubleshooting, and incident response procedures.
Note
This runbook focuses on operational procedures. For architecture and design details, see Audit Platform.
System Overview¶
What the Audit Platform Does¶
Core Functions: - Event ingestion - Receive and store audit events from services - Event storage - Store events in tamper-evident storage - Event querying - Query events by tenant, time range, event type - Compliance - Support compliance requirements (retention, export, legal hold)
Key Components¶
Services: - Audit API - REST API for ingesting and querying events - Event Processor - Processes and stores events - Event Store - Cosmos DB for event storage - Query Service - Handles event queries
Dependencies: - Azure Cosmos DB - Event storage - Azure Service Bus - Event ingestion (optional) - Azure Blob Storage - Event export and archival
Common Symptoms and Checks¶
Symptom: Missing Audit Logs¶
Diagnosis Steps:
- Check Ingestion Rate
- Check events ingested per second metric
- Compare to expected rate
-
Identify drop in ingestion
-
Check Ingestion Logs
-
Check Event Processor
- Check event processor health
- Verify event processor can access database
- Check for processing errors
Common Causes: - Event processor unavailable - Database connectivity issues - Service Bus connectivity issues - Throttling (RU limits exceeded)
Symptom: Ingestion Backlog¶
Symptoms: - Events queued but not processed - Increasing queue length - Delayed event storage
Diagnosis Steps:
- Check Queue Length
- Check Service Bus queue length
- Check processing rate vs ingestion rate
-
Identify bottleneck
-
Check Event Processor
- Check event processor health
- Check processing rate
-
Check for errors
-
Check Database Performance
- Check Cosmos DB RU consumption
- Check throttling
- Check query performance
Common Causes: - Event processor not keeping up - Database throttling - Resource constraints
Symptom: Query Latency¶
Symptoms: - Slow query responses (> 1 second) - Timeout errors - High query latency
Diagnosis Steps:
- Check Query Performance
- Check query latency metrics
- Identify slow queries
-
Check query patterns
-
Check Database
- Check Cosmos DB RU consumption
- Check query RU usage
-
Check indexing
-
Check Query Service
- Check query service health
- Check resource usage
- Check for errors
Common Causes: - Missing indexes - Large time range queries - High RU consumption - Resource constraints
Incident Scenarios¶
Scenario 1: Ingestion Backlog¶
Symptoms: - Events queued but not processed - Queue length increasing - Delayed event storage
Diagnosis:
-
Check Queue Length
-
Check Event Processor
-
Check Database
Resolution Steps:
-
Scale Event Processor
-
Increase Database RU
- Azure Portal → Cosmos DB → Scale
-
Increase RU/s to handle load
-
Optimize Processing
- Review processing logic
- Batch processing if possible
- Optimize database writes
Verification: - Queue length decreases - Processing rate matches ingestion rate - Events stored within SLA
Scenario 2: Database Performance Issues¶
Symptoms: - High query latency - Throttling errors - Timeout errors
Diagnosis:
- Check RU Consumption
- Check Cosmos DB metrics
- Identify high RU operations
-
Check throttling
-
Check Query Patterns
- Review slow queries
- Check query RU usage
- Identify inefficient queries
Resolution Steps:
- Scale Database
- Increase RU/s
-
Enable autoscaling if available
-
Optimize Queries
- Add indexes
- Optimize query patterns
-
Use partition keys effectively
-
Cache Queries
- Cache frequent queries
- Use read replicas if available
Verification: - Query latency returns to normal - No throttling errors - RU consumption within limits
Scenario 3: Missing Events¶
Symptoms: - Events not appearing in queries - Gaps in event timeline - Missing events for specific tenants
Diagnosis:
- Check Ingestion
- Verify events are being ingested
- Check ingestion logs
-
Verify event format
-
Check Storage
- Verify events are stored
- Check database queries
-
Verify tenant filtering
-
Check Processing
- Check event processor logs
- Verify processing errors
- Check for dropped events
Resolution Steps:
- Fix Ingestion
- Fix event format issues
- Resolve connectivity issues
-
Restart ingestion if needed
-
Reprocess Events
- Reprocess queued events
- Verify events are stored
- Check for data corruption
Verification: - Events appear in queries - No gaps in timeline - All tenants have events
Maintenance Tasks¶
Retention Policy Checks¶
Frequency: Monthly
Process:
- Review Retention Policies
- Check retention policies per tenant
- Verify compliance requirements
-
Identify events to archive/delete
-
Archive Old Events
- Export events older than retention period
- Archive to Blob Storage
-
Delete from Cosmos DB
-
Verify Compliance
- Verify retention policies are enforced
- Check for legal holds
- Document retention actions
Index Maintenance¶
Frequency: Quarterly or as needed
Process:
- Review Index Usage
- Check index usage statistics
- Identify unused indexes
-
Identify missing indexes
-
Optimize Indexes
- Add indexes for common queries
- Remove unused indexes
-
Update index definitions
-
Monitor Performance
- Monitor query performance
- Check RU consumption
- Verify improvements
Capacity Planning¶
Monitoring: - Event ingestion rate trends - Storage growth trends - Query volume trends - RU consumption trends
Planning: - Project storage needs (6-12 months) - Plan RU increases - Plan for peak loads - Plan for compliance requirements
Actions: - Scale storage capacity - Scale RU capacity - Optimize storage usage - Plan archival strategy
Related Documents¶
- Operations Overview - Operations documentation overview
- Monitoring & Dashboards - Monitoring practices
- Incident Management - Incident response process
- Audit Platform - Platform architecture