Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Audience: On-call engineers, SREs, platform operators Last updated: 2026-02-23 Escalation chain: On-call engineer -> Team lead -> VP Engineering -> CTO
Table of Contents
- Decision Latency Spike (>200ms P99)
- Worker Backlog (Queue Depth >100)
- Database Connection Exhaustion
- Redis OOM / Eviction
- API 5xx Error Rate >1%
- DLQ Overflow (>500)
- Certificate Expiry
- Node Capacity Exhaustion
- Database Disk Full
- Auth Failures Spike
General Incident Workflow
- Acknowledge the alert within 5 minutes.
- Assess severity using the matrix below.
- Communicate status in
#kaireon-incidentsSlack channel. - Mitigate using the relevant playbook.
- Resolve and confirm metrics return to normal.
- Post-mortem within 48 hours for SEV1/SEV2 incidents.
| Severity | Criteria | Response Time | Update Cadence |
|---|---|---|---|
| SEV1 | Full outage, data loss risk | 5 min | Every 15 min |
| SEV2 | Degraded service, >10% users affected | 15 min | Every 30 min |
| SEV3 | Minor degradation, <10% users affected | 30 min | Every 1 hour |
| SEV4 | Cosmetic, no user impact | Next business day | As needed |
1. Decision Latency Spike (>200ms P99)
Severity: SEV2 (SEV1 if sustained >10 minutes or >500ms P99)Detection
- Alert:
kaireon_decision_latency_msP99 > 200ms for 3 consecutive minutes. - Dashboard: Grafana > KaireonAI Decisions > Latency panel.
- Metrics:
Diagnosis
-
Identify the bottleneck layer:
-
Check for resource contention:
-
Check for slow queries:
-
Check for model loading delays:
Resolution
-
Immediate (if >500ms P99): Scale up API replicas:
-
If database-bound: Kill long-running queries and investigate:
-
If Redis-bound: Flush stale caches:
-
If scoring-bound: Enable score caching or disable non-critical rankers:
Prevention
- Set HPA target CPU to 60% (not 80%) for headroom.
- Enable scoring result caching with a 30-second TTL for repeat customer lookups.
- Run weekly load tests against the decision endpoint with production-like traffic.
- Maintain database query performance baselines; alert on 50% regression.
- Pre-warm model caches on pod startup using an init container.
2. Worker Backlog (Queue Depth >100)
Severity: SEV2 (SEV1 if depth >500 or growing >50/min)Detection
- Alert:
kaireon_active_worker_jobs > 100for 5 minutes. - Dashboard: Grafana > KaireonAI Workers > Queue Depth panel.
- Metrics:
- Queue names:
batch-jobs,pipeline-jobs,dsar-jobs,retrain-jobs,journey-jobs
Diagnosis
-
Determine which queue is backed up:
-
Check worker health:
-
Check if workers are stuck on a specific job:
-
Check for upstream rate changes:
Resolution
-
Scale workers immediately:
-
If workers are crash-looping, restart them:
-
If a single job type is stuck, isolate it:
-
If the queue is Redis-backed and Redis is the bottleneck:
Prevention
- Configure KEDA autoscaler to trigger at queue depth 50 (not 100).
- Set per-job-type timeouts (default 60s) so stuck jobs do not block workers.
- Implement circuit breakers for downstream dependencies (connectors, external APIs).
- Monitor queue depth trend, not just threshold; alert on sustained growth rate.
- Set max retry count to 3 with exponential backoff to prevent retry storms.
3. Database Connection Exhaustion
Severity: SEV1Detection
- Alert:
pgbouncer_active_connections / pgbouncer_max_connections > 0.9for 2 minutes. - Alert: Application logs contain
remaining connection slots are reservedortoo many connections. - Dashboard: Grafana > PostgreSQL > Connection Pool panel.
- Metrics:
Diagnosis
-
Check PgBouncer status:
-
Check for connection leaks in the application:
-
Check PostgreSQL directly:
-
Check for long-running transactions holding connections:
Resolution
-
Kill idle-in-transaction connections (>5 min):
-
Reload PgBouncer if it is stuck:
-
Temporarily increase PgBouncer pool size:
-
If a specific service is leaking, restart it:
Prevention
- Set
idle_in_transaction_session_timeout = 30sin PostgreSQL. - Configure PgBouncer
server_idle_timeout = 600to reclaim idle server connections. - Use connection pooling in the application layer (Prisma connection limit).
- Set
max_client_connin PgBouncer to 2x expected peak. - Add connection acquisition timeout (5s) in the application to fail fast.
- Audit code for missing
finallyblocks that release connections.
4. Redis OOM / Eviction or Quota Exhaustion
Severity: SEV2 (SEV1 if decision cache is fully evicted or all writes blocked by quota)Free-tier quota note: On Upstash / Fly Redis / similar pay-by-ops services, this incident may present asERR max requests limit exceeded. Limit: 500000rather than OOM. The most common cause is idle BullMQ worker polling withWORKER_INPROCESS=1— five workers polling Redis with the BRPOPLPUSH command burn ~1.3M ops/month with zero queued work. The fix isWORKER_INPROCESS=0+ cron-driven drain; see worker-mode-and-cron-drain runbook §8 for the full diagnosis + resolution. Real /recommend traffic on a typical playground is ~75-120K ops/month — well within the 500K free tier if the worker isn’t burning it idle.
Detection
- Alert:
redis_memory_used_bytes / redis_memory_max_bytes > 0.9for 5 minutes. - Alert:
redis_evicted_keys_totalrate > 0 for 3 minutes. - Dashboard: Grafana > Redis > Memory panel.
- Metrics:
Diagnosis
-
Check memory breakdown:
-
Identify largest keys:
-
Check which key patterns consume the most memory:
-
Check for abnormal client connections:
Resolution
-
Flush non-critical caches first:
-
If specific key patterns are bloated, expire them:
-
Scale Redis vertically (if managed):
-
Switch eviction policy if needed:
Prevention
- Set TTLs on all cache keys (decision cache: 60s, session: 24h, feature: 300s).
- Use
volatile-lrueviction policy so only keys with TTLs are evicted. - Monitor memory usage trend and scale proactively at 70%.
- Separate cache Redis from session/queue Redis to isolate failure domains.
- Implement key size limits in the application layer (reject values >1MB).
5. API 5xx Error Rate >1%
Severity: SEV2 (SEV1 if >5% or sustained >10 minutes)Detection
- Alert:
rate(http_responses_total{status=~"5.."}[5m]) / rate(http_responses_total[5m]) > 0.01for 3 minutes. - Dashboard: Grafana > KaireonAI API > Error Rate panel.
- Metrics:
Diagnosis
-
Identify which endpoints are failing:
-
Check application logs for errors:
-
Check if it correlates with a deployment:
-
Check downstream dependencies:
-
Check for resource pressure:
Resolution
-
If caused by a bad deployment, rollback:
-
If caused by downstream failure, enable circuit breaker:
-
If caused by OOM kills, increase memory:
-
If a specific route is the problem, disable it temporarily:
Prevention
- Implement canary deployments (10% traffic for 5 minutes before full rollout).
- Add structured error logging with request IDs for traceability.
- Set memory limits with 20% headroom above observed peak.
- Run integration tests in staging before promoting to production.
- Implement retry with backoff for transient downstream failures.
6. DLQ Overflow (>500)
Severity: SEV2 (SEV1 if DLQ contains decision requests)Detection
- Alert:
kaireon_dlq_depth > 500for 10 minutes. - Dashboard: Grafana > KaireonAI Workers > Dead Letter Queue panel.
- Metrics:
Diagnosis
-
Sample DLQ messages to identify the failure pattern:
-
Check if a specific error dominates:
-
Check if the DLQ growth correlates with a deployment or config change:
Resolution
-
If messages are retryable (transient errors), replay them:
-
If messages are poison pills (schema errors), archive and purge:
-
If caused by a downstream outage, wait for recovery then replay:
Prevention
- Set DLQ alert threshold at 100 (not 500) for earlier detection.
- Implement automatic DLQ replay with exponential backoff (max 3 retries).
- Add DLQ message classification (transient vs. permanent failure).
- Archive DLQ messages to S3 daily for audit and forensics.
- Add schema validation before enqueue to reject malformed messages early.
7. Certificate Expiry
Severity: SEV1 (if <24 hours to expiry), SEV2 (if <7 days)Detection
- Alert:
cert_expiry_seconds < 604800(7 days) for warning. - Alert:
cert_expiry_seconds < 86400(24 hours) for critical. - Dashboard: Grafana > Infrastructure > Certificate Expiry panel.
- Manual check:
Diagnosis
-
Identify which certificate is expiring:
-
Check cert-manager status (if using cert-manager):
-
Check if auto-renewal failed:
Resolution
-
If cert-manager is installed, force renewal:
-
If cert-manager renewal is stuck, delete and recreate:
-
If manual certificate, replace it:
Prevention
- Use cert-manager with Let’s Encrypt for automatic renewal.
- Set alerts at 30, 14, 7, 3, and 1 day(s) before expiry.
- Run a weekly certificate audit job that scans all namespaces.
- Maintain a certificate inventory spreadsheet with owners and expiry dates.
- Test certificate renewal in staging monthly.
8. Node Capacity Exhaustion
Severity: SEV2 (SEV1 if pods are evicted or cannot schedule)Detection
- Alert:
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1. - Alert:
kube_node_status_condition{condition="DiskPressure",status="true"} == 1. - Alert: Pending pods count > 0 for more than 5 minutes.
- Dashboard: Grafana > Kubernetes > Node Resources panel.
- Metrics:
Diagnosis
-
Check node resource usage:
-
Check for pending pods:
-
Check for resource hogs:
-
Check cluster autoscaler status:
Resolution
-
If autoscaler is enabled, check if it is working:
-
Manually add nodes if autoscaler is stuck:
-
Evict non-critical workloads:
-
If disk pressure, clean up:
Prevention
- Set cluster autoscaler min/max to allow 30% headroom.
- Use PodDisruptionBudgets to protect critical workloads during eviction.
- Set resource requests and limits on all pods (no unbounded pods).
- Schedule non-critical batch jobs during off-peak hours.
- Run monthly capacity planning reviews based on growth trends.
9. Database Disk Full
Severity: SEV1Detection
- Alert:
pg_database_size_bytes / pg_disk_total_bytes > 0.85for 10 minutes. - Alert (RDS):
FreeStorageSpace < 5GBfor 5 minutes. - Dashboard: Grafana > PostgreSQL > Disk Usage panel.
- Metrics:
Diagnosis
-
Check current disk usage:
-
Check for bloat:
-
Check for WAL accumulation:
-
Check for orphaned temp files:
Resolution
-
Immediate: Increase disk (if RDS with storage autoscaling disabled):
-
Run emergency VACUUM on bloated tables:
-
Purge old data (if retention policy allows):
-
Drop unused replication slots (WAL accumulation):
Prevention
- Enable RDS storage autoscaling with a maximum limit.
- Implement data retention policies with automated purge jobs (daily cron).
- Run
VACUUM ANALYZEon large tables nightly. - Partition large tables (decision_logs, audit_events) by month.
- Monitor disk growth rate and project exhaustion date weekly.
10. Auth Failures Spike
Severity: SEV2 (SEV1 if suspected credential compromise)Detection
- Alert:
rate(kaireon_auth_failures_total[5m]) > 10for 3 minutes. - Alert:
rate(kaireon_auth_failures_total[5m]) / rate(kaireon_auth_attempts_total[5m]) > 0.1for 5 minutes. - Dashboard: Grafana > KaireonAI Security > Auth Failures panel.
- Metrics:
Diagnosis
-
Identify the failure reason:
-
Check if failures are from a single IP (brute force):
-
Check if OIDC/SAML provider is down:
-
Check if tokens are expired due to clock skew:
-
Check for recent secret/key rotation:
Resolution
-
If brute force, block the source IP:
-
If OIDC provider is down, enable fallback auth:
-
If key rotation broke auth, rollback the secret:
-
If clock skew, sync NTP:
-
If credential compromise is suspected:
- Rotate all API keys and service account tokens immediately.
- Revoke all active sessions.
- Enable enhanced logging.
- Notify the security team.
Prevention
- Implement rate limiting on auth endpoints (10 failures per IP per minute).
- Enable account lockout after 5 consecutive failures (30-minute window).
- Use short-lived tokens (15 minutes) with refresh token rotation.
- Monitor for credential stuffing patterns (many IPs, same usernames).
- Run quarterly penetration tests on auth flows.
- Sync NTP on all nodes; alert if clock drift >1 second.