Incident Response - KaireonAI

Audience: On-call engineers, SREs, platform operators Last updated: 2026-02-23 Escalation chain: On-call engineer -> Team lead -> VP Engineering -> CTO

Decision Latency Spike (>200ms P99)
Worker Backlog (Queue Depth >100)
Database Connection Exhaustion
Redis OOM / Eviction
API 5xx Error Rate >1%
DLQ Overflow (>500)
Certificate Expiry
Node Capacity Exhaustion
Database Disk Full
Auth Failures Spike

General Incident Workflow

Acknowledge the alert within 5 minutes.
Assess severity using the matrix below.
Communicate status in #kaireon-incidents Slack channel.
Mitigate using the relevant playbook.
Resolve and confirm metrics return to normal.
Post-mortem within 48 hours for SEV1/SEV2 incidents.

Severity	Criteria	Response Time	Update Cadence
SEV1	Full outage, data loss risk	5 min	Every 15 min
SEV2	Degraded service, >10% users affected	15 min	Every 30 min
SEV3	Minor degradation, <10% users affected	30 min	Every 1 hour
SEV4	Cosmetic, no user impact	Next business day	As needed

1. Decision Latency Spike (>200ms P99)

Severity: SEV2 (SEV1 if sustained >10 minutes or >500ms P99)

Detection

Alert: kaireon_decision_latency_ms P99 > 200ms for 3 consecutive minutes.
Dashboard: Grafana > KaireonAI Decisions > Latency panel.

Metrics:

histogram_quantile(0.99, rate(kaireon_decision_latency_ms_bucket[5m]))

Diagnosis

Identify the bottleneck layer:

# Check if the database is slow
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/metrics | grep kaireon_http_request_duration_seconds

# Check Redis latency
kubectl exec -it deploy/kaireon-redis -- redis-cli --latency-history -i 1

# Check scoring engine time
kubectl logs deploy/kaireon-api --since=5m | grep "scoring_duration" | tail -20

Check for resource contention:

# CPU throttling on decision pods
kubectl top pods -l app=kaireon-api

# Check if HPA is maxed out
kubectl get hpa kaireon-api-hpa

Check for slow queries:

SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '1 second'
ORDER BY duration DESC;

Check for model loading delays:

kubectl logs deploy/kaireon-api --since=5m | grep -E "model_load|cache_miss"

Resolution

Immediate (if >500ms P99): Scale up API replicas:

kubectl scale deploy/kaireon-api --replicas=10

If database-bound: Kill long-running queries and investigate:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '30 seconds'
AND query NOT LIKE '%pg_stat_activity%';

If Redis-bound: Flush stale caches:

kubectl exec -it deploy/kaireon-redis -- redis-cli FLUSHDB ASYNC

If scoring-bound: Enable score caching or disable non-critical rankers:

kubectl set env deploy/kaireon-api SCORING_CACHE_ENABLED=true
kubectl set env deploy/kaireon-api SCORING_TIMEOUT_MS=150

Prevention

Set HPA target CPU to 60% (not 80%) for headroom.
Enable scoring result caching with a 30-second TTL for repeat customer lookups.
Run weekly load tests against the decision endpoint with production-like traffic.
Maintain database query performance baselines; alert on 50% regression.
Pre-warm model caches on pod startup using an init container.

2. Worker Backlog (Queue Depth >100)

Severity: SEV2 (SEV1 if depth >500 or growing >50/min)

Detection

Alert: kaireon_active_worker_jobs > 100 for 5 minutes.
Dashboard: Grafana > KaireonAI Workers > Queue Depth panel.

Metrics:

kaireon_active_worker_jobs{queue="batch-jobs"}
rate(kaireon_active_worker_jobs[5m])

Queue names: batch-jobs, pipeline-jobs, dsar-jobs, retrain-jobs, journey-jobs

Diagnosis

Determine which queue is backed up:

kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/metrics | grep queue_depth

Check worker health:

kubectl get pods -l app=kaireon-worker
kubectl logs -l app=kaireon-worker --since=5m --tail=50 | grep -E "error|panic|OOM"

Check if workers are stuck on a specific job:

kubectl logs -l app=kaireon-worker --since=5m | grep "processing_duration" | sort -t= -k2 -rn | head -10

Check for upstream rate changes:

rate(kaireon_worker_jobs_enqueued_total[5m])

Resolution

Scale workers immediately:

kubectl scale deploy/kaireon-worker --replicas=20

If workers are crash-looping, restart them:

kubectl rollout restart deploy/kaireon-worker

If a single job type is stuck, isolate it:

# Pause the problematic job type
kubectl set env deploy/kaireon-worker SKIP_JOB_TYPES=pipeline_execute

If the queue is Redis-backed and Redis is the bottleneck:

kubectl exec -it deploy/kaireon-redis -- redis-cli INFO memory
kubectl exec -it deploy/kaireon-redis -- redis-cli SLOWLOG GET 10

Prevention

Configure KEDA autoscaler to trigger at queue depth 50 (not 100).
Set per-job-type timeouts (default 60s) so stuck jobs do not block workers.
Implement circuit breakers for downstream dependencies (connectors, external APIs).
Monitor queue depth trend, not just threshold; alert on sustained growth rate.
Set max retry count to 3 with exponential backoff to prevent retry storms.

3. Database Connection Exhaustion

Severity: SEV1

Detection

Alert: pgbouncer_active_connections / pgbouncer_max_connections > 0.9 for 2 minutes.
Alert: Application logs contain remaining connection slots are reserved or too many connections.
Dashboard: Grafana > PostgreSQL > Connection Pool panel.

Metrics:

pgbouncer_pools_server_active_connections
pgbouncer_pools_server_idle_connections
pgbouncer_pools_client_waiting_connections

Diagnosis

Check PgBouncer status:

kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW POOLS;"
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CLIENTS;"

Check for connection leaks in the application:

kubectl logs -l app=kaireon-api --since=10m | grep -E "connection|pool" | tail -30

Check PostgreSQL directly:

SELECT count(*), state, usename, application_name
FROM pg_stat_activity
GROUP BY state, usename, application_name
ORDER BY count DESC;

Check for long-running transactions holding connections:

SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;

Resolution

Kill idle-in-transaction connections (>5 min):

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - xact_start > interval '5 minutes';

Reload PgBouncer if it is stuck:

kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "RELOAD;"

Temporarily increase PgBouncer pool size:

kubectl edit configmap kaireon-pgbouncer-config
# Increase default_pool_size from 20 to 40
kubectl rollout restart deploy/kaireon-pgbouncer

If a specific service is leaking, restart it:

kubectl rollout restart deploy/kaireon-api

Prevention

Set idle_in_transaction_session_timeout = 30s in PostgreSQL.
Configure PgBouncer server_idle_timeout = 600 to reclaim idle server connections.
Use connection pooling in the application layer (Prisma connection limit).
Set max_client_conn in PgBouncer to 2x expected peak.
Add connection acquisition timeout (5s) in the application to fail fast.
Audit code for missing finally blocks that release connections.

4. Redis OOM / Eviction or Quota Exhaustion

Severity: SEV2 (SEV1 if decision cache is fully evicted or all writes blocked by quota)

Free-tier quota note: On Upstash / Fly Redis / similar pay-by-ops services, this incident may present as ERR max requests limit exceeded. Limit: 500000 rather than OOM. The most common cause is idle BullMQ worker polling with WORKER_INPROCESS=1 — five workers polling Redis with the BRPOPLPUSH command burn ~1.3M ops/month with zero queued work. The fix is WORKER_INPROCESS=0 + cron-driven drain; see worker-mode-and-cron-drain runbook §8 for the full diagnosis + resolution. Real /recommend traffic on a typical playground is ~75-120K ops/month — well within the 500K free tier if the worker isn’t burning it idle.

Detection

Alert: redis_memory_used_bytes / redis_memory_max_bytes > 0.9 for 5 minutes.
Alert: redis_evicted_keys_total rate > 0 for 3 minutes.
Dashboard: Grafana > Redis > Memory panel.

Metrics:

redis_memory_used_bytes
rate(redis_evicted_keys_total[5m])
redis_connected_clients

Diagnosis

Check memory breakdown:

kubectl exec -it deploy/kaireon-redis -- redis-cli INFO memory
kubectl exec -it deploy/kaireon-redis -- redis-cli MEMORY DOCTOR

Identify largest keys:

kubectl exec -it deploy/kaireon-redis -- redis-cli --bigkeys

Check which key patterns consume the most memory:

kubectl exec -it deploy/kaireon-redis -- redis-cli --memkeys --pattern "decision:*" | tail -5
kubectl exec -it deploy/kaireon-redis -- redis-cli --memkeys --pattern "session:*" | tail -5

Check for abnormal client connections:

kubectl exec -it deploy/kaireon-redis -- redis-cli CLIENT LIST | wc -l

Resolution

Flush non-critical caches first:

# Flush only the cache database (DB 1), not the session database (DB 0)
kubectl exec -it deploy/kaireon-redis -- redis-cli -n 1 FLUSHDB ASYNC

If specific key patterns are bloated, expire them:

kubectl exec -it deploy/kaireon-redis -- redis-cli EVAL "
  local keys = redis.call('KEYS', ARGV[1])
  for _, key in ipairs(keys) do
    if redis.call('TTL', key) == -1 then
      redis.call('EXPIRE', key, 300)
    end
  end
  return #keys
" 0 "pipeline:result:*"

Scale Redis vertically (if managed):

# For AWS ElastiCache — change instance type via console or Terraform
# For self-managed, increase maxmemory:
kubectl exec -it deploy/kaireon-redis -- redis-cli CONFIG SET maxmemory 8gb

Switch eviction policy if needed:

kubectl exec -it deploy/kaireon-redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru

Prevention

Set TTLs on all cache keys (decision cache: 60s, session: 24h, feature: 300s).
Use volatile-lru eviction policy so only keys with TTLs are evicted.
Monitor memory usage trend and scale proactively at 70%.
Separate cache Redis from session/queue Redis to isolate failure domains.
Implement key size limits in the application layer (reject values >1MB).

5. API 5xx Error Rate >1%

Severity: SEV2 (SEV1 if >5% or sustained >10 minutes)

Detection

Alert: rate(http_responses_total{status=~"5.."}[5m]) / rate(http_responses_total[5m]) > 0.01 for 3 minutes.
Dashboard: Grafana > KaireonAI API > Error Rate panel.

Metrics:

sum(rate(http_responses_total{status=~"5.."}[5m])) by (route)
/ sum(rate(http_responses_total[5m])) by (route)

Diagnosis

Identify which endpoints are failing:

topk(5, sum(rate(http_responses_total{status=~"5.."}[5m])) by (route))

Check application logs for errors:

kubectl logs -l app=kaireon-api --since=5m | grep -E '"status":5[0-9]{2}' | head -20
kubectl logs -l app=kaireon-api --since=5m | grep -i "error\|exception\|panic" | tail -30

Check if it correlates with a deployment:

kubectl rollout history deploy/kaireon-api

Check downstream dependencies:

# Database connectivity
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz

# Redis connectivity
kubectl exec -it deploy/kaireon-redis -- redis-cli PING

Check for resource pressure:

kubectl top pods -l app=kaireon-api
kubectl describe pods -l app=kaireon-api | grep -A5 "Last State"

Resolution

If caused by a bad deployment, rollback:

kubectl rollout undo deploy/kaireon-api

If caused by downstream failure, enable circuit breaker:

kubectl set env deploy/kaireon-api CIRCUIT_BREAKER_ENABLED=true

If caused by OOM kills, increase memory:

kubectl set resources deploy/kaireon-api --limits=memory=2Gi --requests=memory=1Gi

If a specific route is the problem, disable it temporarily:

# Add the route to the maintenance blocklist
kubectl set env deploy/kaireon-api BLOCKED_ROUTES="/api/v1/pipelines/execute"

Prevention

Implement canary deployments (10% traffic for 5 minutes before full rollout).
Add structured error logging with request IDs for traceability.
Set memory limits with 20% headroom above observed peak.
Run integration tests in staging before promoting to production.
Implement retry with backoff for transient downstream failures.

6. DLQ Overflow (>500)

Severity: SEV2 (SEV1 if DLQ contains decision requests)

Detection

Alert: kaireon_dlq_depth > 500 for 10 minutes.
Dashboard: Grafana > KaireonAI Workers > Dead Letter Queue panel.

Metrics:

kaireon_dlq_depth{queue="decisions"}
rate(kaireon_dlq_depth[5m])

Diagnosis

Sample DLQ messages to identify the failure pattern:

kubectl exec -it deploy/kaireon-worker -- node -e "
  const redis = require('ioredis');
  const r = new redis(process.env.REDIS_URL);
  r.lrange('dlq:decisions', 0, 5).then(msgs => {
    msgs.forEach(m => console.log(JSON.parse(m).error));
    r.quit();
  });
"

Check if a specific error dominates:

kubectl logs -l app=kaireon-worker --since=30m | grep "moved_to_dlq" | \
  jq -r '.error_type' | sort | uniq -c | sort -rn

Check if the DLQ growth correlates with a deployment or config change:

kubectl rollout history deploy/kaireon-worker
kubectl get events --sort-by=.lastTimestamp | tail -20

Resolution

If messages are retryable (transient errors), replay them:

kubectl exec -it deploy/kaireon-worker -- node scripts/replay-dlq.js --queue=decisions --batch=50

If messages are poison pills (schema errors), archive and purge:

# Archive to S3 first
kubectl exec -it deploy/kaireon-worker -- node scripts/archive-dlq.js --queue=decisions --dest=s3://kaireon-dlq-archive/

# Then purge
kubectl exec -it deploy/kaireon-redis -- redis-cli DEL dlq:decisions

If caused by a downstream outage, wait for recovery then replay:

# Verify downstream is healthy
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz

# Replay in controlled batches
kubectl exec -it deploy/kaireon-worker -- node scripts/replay-dlq.js --queue=decisions --batch=10 --delay=1000

Prevention

Set DLQ alert threshold at 100 (not 500) for earlier detection.
Implement automatic DLQ replay with exponential backoff (max 3 retries).
Add DLQ message classification (transient vs. permanent failure).
Archive DLQ messages to S3 daily for audit and forensics.
Add schema validation before enqueue to reject malformed messages early.

7. Certificate Expiry

Severity: SEV1 (if <24 hours to expiry), SEV2 (if <7 days)

Detection

Alert: cert_expiry_seconds < 604800 (7 days) for warning.
Alert: cert_expiry_seconds < 86400 (24 hours) for critical.
Dashboard: Grafana > Infrastructure > Certificate Expiry panel.

Manual check:

# Check TLS cert for the API endpoint
echo | openssl s_client -connect api.kaireon.com:443 2>/dev/null | openssl x509 -noout -dates

# Check all Kubernetes TLS secrets
kubectl get secrets --all-namespaces -o json | \
  jq -r '.items[] | select(.type=="kubernetes.io/tls") |
  "\(.metadata.namespace)/\(.metadata.name)"' | while read secret; do
    ns=$(echo $secret | cut -d/ -f1)
    name=$(echo $secret | cut -d/ -f2)
    expiry=$(kubectl get secret -n $ns $name -o jsonpath='{.data.tls\.crt}' | \
      base64 -d | openssl x509 -noout -enddate 2>/dev/null)
    echo "$secret: $expiry"
  done

Diagnosis

Identify which certificate is expiring:

kubectl get certificate --all-namespaces
kubectl describe certificate -n kaireon kaireon-tls

Check cert-manager status (if using cert-manager):

kubectl get challenges --all-namespaces
kubectl get orders --all-namespaces
kubectl logs -n cert-manager deploy/cert-manager --since=1h | grep -i error

Check if auto-renewal failed:

kubectl describe certificaterequest -n kaireon | grep -A5 "Status"

Resolution

If cert-manager is installed, force renewal:

kubectl cert-manager renew -n kaireon kaireon-tls

If cert-manager renewal is stuck, delete and recreate:

kubectl delete certificate -n kaireon kaireon-tls
kubectl apply -f k8s/certificates/kaireon-tls.yaml

If manual certificate, replace it:

kubectl create secret tls kaireon-tls \
  --cert=new-cert.pem \
  --key=new-key.pem \
  -n kaireon \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart ingress controller to pick up new cert
kubectl rollout restart deploy/ingress-nginx-controller -n ingress-nginx

Prevention

Use cert-manager with Let’s Encrypt for automatic renewal.
Set alerts at 30, 14, 7, 3, and 1 day(s) before expiry.
Run a weekly certificate audit job that scans all namespaces.
Maintain a certificate inventory spreadsheet with owners and expiry dates.
Test certificate renewal in staging monthly.

8. Node Capacity Exhaustion

Severity: SEV2 (SEV1 if pods are evicted or cannot schedule)

Detection

Alert: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1.
Alert: kube_node_status_condition{condition="DiskPressure",status="true"} == 1.
Alert: Pending pods count > 0 for more than 5 minutes.
Dashboard: Grafana > Kubernetes > Node Resources panel.

Metrics:

sum(kube_pod_status_phase{phase="Pending"}) > 0
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1

Diagnosis

Check node resource usage:

kubectl top nodes
kubectl describe nodes | grep -A10 "Allocated resources"

Check for pending pods:

kubectl get pods --all-namespaces --field-selector=status.phase=Pending
kubectl describe pod <pending-pod-name> | grep -A5 "Events"

Check for resource hogs:

kubectl top pods --all-namespaces --sort-by=memory | head -20
kubectl top pods --all-namespaces --sort-by=cpu | head -20

Check cluster autoscaler status:

kubectl logs -n kube-system deploy/cluster-autoscaler --since=10m | grep -E "scale|unschedulable"

Resolution

If autoscaler is enabled, check if it is working:

kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml

Manually add nodes if autoscaler is stuck:

# For EKS
aws eks update-nodegroup-config \
  --cluster-name kaireon-prod \
  --nodegroup-name kaireon-workers \
  --scaling-config minSize=3,maxSize=20,desiredSize=10

Evict non-critical workloads:

kubectl scale deploy/kaireon-batch-processor --replicas=0
kubectl scale deploy/kaireon-analytics --replicas=0

If disk pressure, clean up:

# On the affected node
docker system prune -af
crictl rmi --prune

Prevention

Set cluster autoscaler min/max to allow 30% headroom.
Use PodDisruptionBudgets to protect critical workloads during eviction.
Set resource requests and limits on all pods (no unbounded pods).
Schedule non-critical batch jobs during off-peak hours.
Run monthly capacity planning reviews based on growth trends.

9. Database Disk Full

Severity: SEV1

Detection

Alert: pg_database_size_bytes / pg_disk_total_bytes > 0.85 for 10 minutes.
Alert (RDS): FreeStorageSpace < 5GB for 5 minutes.
Dashboard: Grafana > PostgreSQL > Disk Usage panel.

Metrics:

pg_database_size_bytes{datname="kaireon"}
node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"}

Diagnosis

Check current disk usage:

SELECT pg_size_pretty(pg_database_size('kaireon')) AS db_size;

SELECT schemaname, tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;

Check for bloat:

SELECT schemaname, tablename,
  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
  n_dead_tup,
  n_live_tup,
  ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

Check for WAL accumulation:

SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_size;
SELECT * FROM pg_replication_slots;

Check for orphaned temp files:

SELECT pg_size_pretty(temp_bytes) AS temp_usage, datname
FROM pg_stat_database
WHERE temp_bytes > 0;

Resolution

Immediate: Increase disk (if RDS with storage autoscaling disabled):

aws rds modify-db-instance \
  --db-instance-identifier kaireon-prod \
  --allocated-storage 500 \
  --apply-immediately

Run emergency VACUUM on bloated tables:

VACUUM (VERBOSE) decision_logs;
VACUUM (VERBOSE) audit_events;

Purge old data (if retention policy allows):

DELETE FROM decision_logs WHERE created_at < NOW() - INTERVAL '90 days';
DELETE FROM audit_events WHERE created_at < NOW() - INTERVAL '180 days';
VACUUM FULL decision_logs;

Drop unused replication slots (WAL accumulation):

SELECT pg_drop_replication_slot('unused_slot_name');

Prevention

Enable RDS storage autoscaling with a maximum limit.
Implement data retention policies with automated purge jobs (daily cron).
Run VACUUM ANALYZE on large tables nightly.
Partition large tables (decision_logs, audit_events) by month.
Monitor disk growth rate and project exhaustion date weekly.

10. Auth Failures Spike

Severity: SEV2 (SEV1 if suspected credential compromise)

Detection

Alert: rate(kaireon_auth_failures_total[5m]) > 10 for 3 minutes.
Alert: rate(kaireon_auth_failures_total[5m]) / rate(kaireon_auth_attempts_total[5m]) > 0.1 for 5 minutes.
Dashboard: Grafana > KaireonAI Security > Auth Failures panel.

Metrics:

rate(kaireon_auth_failures_total[5m])
topk(5, sum(rate(kaireon_auth_failures_total[5m])) by (reason))
topk(5, sum(rate(kaireon_auth_failures_total[5m])) by (source_ip))

Diagnosis

Identify the failure reason:

kubectl logs -l app=kaireon-api --since=10m | grep "auth_failure" | \
  jq -r '.reason' | sort | uniq -c | sort -rn

Check if failures are from a single IP (brute force):

kubectl logs -l app=kaireon-api --since=10m | grep "auth_failure" | \
  jq -r '.source_ip' | sort | uniq -c | sort -rn | head -10

Check if OIDC/SAML provider is down:

curl -s https://auth.kaireon.com/.well-known/openid-configuration | jq .
curl -s https://auth.kaireon.com/health

Check if tokens are expired due to clock skew:

# Check API server time
kubectl exec -it deploy/kaireon-api -- date -u

# Compare with auth server
curl -sI https://auth.kaireon.com | grep Date

Check for recent secret/key rotation:

kubectl get secret kaireon-auth-config -o yaml | grep -E "last-applied|resourceVersion"

Resolution

If brute force, block the source IP:

# Add to WAF blocklist
aws wafv2 update-ip-set \
  --name kaireon-blocklist \
  --scope REGIONAL \
  --addresses "1.2.3.4/32" \
  --id <ip-set-id> \
  --lock-token <lock-token>

If OIDC provider is down, enable fallback auth:

kubectl set env deploy/kaireon-api AUTH_FALLBACK_ENABLED=true

If key rotation broke auth, rollback the secret:

kubectl rollout undo secret/kaireon-auth-config
kubectl rollout restart deploy/kaireon-api

If clock skew, sync NTP:

# On affected nodes
sudo systemctl restart chronyd

If credential compromise is suspected:
- Rotate all API keys and service account tokens immediately.
- Revoke all active sessions.
- Enable enhanced logging.
- Notify the security team.

Prevention

Implement rate limiting on auth endpoints (10 failures per IP per minute).
Enable account lockout after 5 consecutive failures (30-minute window).
Use short-lived tokens (15 minutes) with refresh token rotation.
Monitor for credential stuffing patterns (many IPs, same usernames).
Run quarterly penetration tests on auth flows.
Sync NTP on all nodes; alert if clock drift >1 second.

Appendix: Useful Commands

Quick Health Check

# All-in-one health check
kubectl get pods -l app.kubernetes.io/part-of=kaireon | grep -v Running
kubectl top pods -l app.kubernetes.io/part-of=kaireon
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz | jq .
kubectl exec -it deploy/kaireon-redis -- redis-cli PING

Log Aggregation

# Tail all KaireonAI logs
kubectl logs -l app.kubernetes.io/part-of=kaireon --since=5m -f --max-log-requests=10

# Search for errors across all pods
kubectl logs -l app.kubernetes.io/part-of=kaireon --since=30m | grep -i error | sort | uniq -c | sort -rn

Metrics Quick Reference

# Decision throughput
sum(rate(kaireon_decisions_total[5m]))

# Error budget burn rate
1 - (sum(rate(http_responses_total{status!~"5.."}[1h])) / sum(rate(http_responses_total[1h])))

# P99 latency
histogram_quantile(0.99, sum(rate(kaireon_decision_latency_ms_bucket[5m])) by (le))

Deploy

Configure

Operate

Architecture

Runbooks

Documentation Index

​Table of Contents

​General Incident Workflow

​1. Decision Latency Spike (>200ms P99)

​Detection

​Diagnosis

​Resolution

​Prevention

​2. Worker Backlog (Queue Depth >100)

​Detection

​Diagnosis

​Resolution

​Prevention

​3. Database Connection Exhaustion

​Detection

​Diagnosis

​Resolution

​Prevention

​4. Redis OOM / Eviction or Quota Exhaustion

​Detection

​Diagnosis

​Resolution

​Prevention

​5. API 5xx Error Rate >1%

​Detection

​Diagnosis

​Resolution

​Prevention

​6. DLQ Overflow (>500)

​Detection

​Diagnosis

​Resolution

​Prevention

​7. Certificate Expiry

​Detection

​Diagnosis

​Resolution

​Prevention

​8. Node Capacity Exhaustion

​Detection

​Diagnosis

​Resolution

​Prevention

​9. Database Disk Full

​Detection

​Diagnosis

​Resolution

​Prevention

​10. Auth Failures Spike

​Detection

​Diagnosis

​Resolution

​Prevention

​Appendix: Useful Commands

​Quick Health Check

​Log Aggregation

​Metrics Quick Reference

Table of Contents

General Incident Workflow

1. Decision Latency Spike (>200ms P99)

Detection

Diagnosis

Resolution

Prevention

2. Worker Backlog (Queue Depth >100)

Detection

Diagnosis

Resolution

Prevention

3. Database Connection Exhaustion

Detection

Diagnosis

Resolution

Prevention

4. Redis OOM / Eviction or Quota Exhaustion

Detection

Diagnosis

Resolution

Prevention

5. API 5xx Error Rate >1%

Detection

Diagnosis

Resolution

Prevention

6. DLQ Overflow (>500)

Detection

Diagnosis

Resolution

Prevention

7. Certificate Expiry

Detection

Diagnosis

Resolution

Prevention

8. Node Capacity Exhaustion

Detection

Diagnosis

Resolution

Prevention

9. Database Disk Full

Detection

Diagnosis

Resolution

Prevention

10. Auth Failures Spike

Detection

Diagnosis

Resolution

Prevention

Appendix: Useful Commands

Quick Health Check

Log Aggregation

Metrics Quick Reference