Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

Audience: On-call engineers, SREs, platform operators Last updated: 2026-02-23 Escalation chain: On-call engineer -> Team lead -> VP Engineering -> CTO

Table of Contents

  1. Decision Latency Spike (>200ms P99)
  2. Worker Backlog (Queue Depth >100)
  3. Database Connection Exhaustion
  4. Redis OOM / Eviction
  5. API 5xx Error Rate >1%
  6. DLQ Overflow (>500)
  7. Certificate Expiry
  8. Node Capacity Exhaustion
  9. Database Disk Full
  10. Auth Failures Spike

General Incident Workflow

  1. Acknowledge the alert within 5 minutes.
  2. Assess severity using the matrix below.
  3. Communicate status in #kaireon-incidents Slack channel.
  4. Mitigate using the relevant playbook.
  5. Resolve and confirm metrics return to normal.
  6. Post-mortem within 48 hours for SEV1/SEV2 incidents.
SeverityCriteriaResponse TimeUpdate Cadence
SEV1Full outage, data loss risk5 minEvery 15 min
SEV2Degraded service, >10% users affected15 minEvery 30 min
SEV3Minor degradation, <10% users affected30 minEvery 1 hour
SEV4Cosmetic, no user impactNext business dayAs needed

1. Decision Latency Spike (>200ms P99)

Severity: SEV2 (SEV1 if sustained >10 minutes or >500ms P99)

Detection

  • Alert: kaireon_decision_latency_ms P99 > 200ms for 3 consecutive minutes.
  • Dashboard: Grafana > KaireonAI Decisions > Latency panel.
  • Metrics:
    histogram_quantile(0.99, rate(kaireon_decision_latency_ms_bucket[5m]))
    

Diagnosis

  1. Identify the bottleneck layer:
    # Check if the database is slow
    kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/metrics | grep kaireon_http_request_duration_seconds
    
    # Check Redis latency
    kubectl exec -it deploy/kaireon-redis -- redis-cli --latency-history -i 1
    
    # Check scoring engine time
    kubectl logs deploy/kaireon-api --since=5m | grep "scoring_duration" | tail -20
    
  2. Check for resource contention:
    # CPU throttling on decision pods
    kubectl top pods -l app=kaireon-api
    
    # Check if HPA is maxed out
    kubectl get hpa kaireon-api-hpa
    
  3. Check for slow queries:
    SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    AND now() - pg_stat_activity.query_start > interval '1 second'
    ORDER BY duration DESC;
    
  4. Check for model loading delays:
    kubectl logs deploy/kaireon-api --since=5m | grep -E "model_load|cache_miss"
    

Resolution

  1. Immediate (if >500ms P99): Scale up API replicas:
    kubectl scale deploy/kaireon-api --replicas=10
    
  2. If database-bound: Kill long-running queries and investigate:
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state != 'idle'
    AND now() - pg_stat_activity.query_start > interval '30 seconds'
    AND query NOT LIKE '%pg_stat_activity%';
    
  3. If Redis-bound: Flush stale caches:
    kubectl exec -it deploy/kaireon-redis -- redis-cli FLUSHDB ASYNC
    
  4. If scoring-bound: Enable score caching or disable non-critical rankers:
    kubectl set env deploy/kaireon-api SCORING_CACHE_ENABLED=true
    kubectl set env deploy/kaireon-api SCORING_TIMEOUT_MS=150
    

Prevention

  • Set HPA target CPU to 60% (not 80%) for headroom.
  • Enable scoring result caching with a 30-second TTL for repeat customer lookups.
  • Run weekly load tests against the decision endpoint with production-like traffic.
  • Maintain database query performance baselines; alert on 50% regression.
  • Pre-warm model caches on pod startup using an init container.

2. Worker Backlog (Queue Depth >100)

Severity: SEV2 (SEV1 if depth >500 or growing >50/min)

Detection

  • Alert: kaireon_active_worker_jobs > 100 for 5 minutes.
  • Dashboard: Grafana > KaireonAI Workers > Queue Depth panel.
  • Metrics:
    kaireon_active_worker_jobs{queue="batch-jobs"}
    rate(kaireon_active_worker_jobs[5m])
    
  • Queue names: batch-jobs, pipeline-jobs, dsar-jobs, retrain-jobs, journey-jobs

Diagnosis

  1. Determine which queue is backed up:
    kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/metrics | grep queue_depth
    
  2. Check worker health:
    kubectl get pods -l app=kaireon-worker
    kubectl logs -l app=kaireon-worker --since=5m --tail=50 | grep -E "error|panic|OOM"
    
  3. Check if workers are stuck on a specific job:
    kubectl logs -l app=kaireon-worker --since=5m | grep "processing_duration" | sort -t= -k2 -rn | head -10
    
  4. Check for upstream rate changes:
    rate(kaireon_worker_jobs_enqueued_total[5m])
    

Resolution

  1. Scale workers immediately:
    kubectl scale deploy/kaireon-worker --replicas=20
    
  2. If workers are crash-looping, restart them:
    kubectl rollout restart deploy/kaireon-worker
    
  3. If a single job type is stuck, isolate it:
    # Pause the problematic job type
    kubectl set env deploy/kaireon-worker SKIP_JOB_TYPES=pipeline_execute
    
  4. If the queue is Redis-backed and Redis is the bottleneck:
    kubectl exec -it deploy/kaireon-redis -- redis-cli INFO memory
    kubectl exec -it deploy/kaireon-redis -- redis-cli SLOWLOG GET 10
    

Prevention

  • Configure KEDA autoscaler to trigger at queue depth 50 (not 100).
  • Set per-job-type timeouts (default 60s) so stuck jobs do not block workers.
  • Implement circuit breakers for downstream dependencies (connectors, external APIs).
  • Monitor queue depth trend, not just threshold; alert on sustained growth rate.
  • Set max retry count to 3 with exponential backoff to prevent retry storms.

3. Database Connection Exhaustion

Severity: SEV1

Detection

  • Alert: pgbouncer_active_connections / pgbouncer_max_connections > 0.9 for 2 minutes.
  • Alert: Application logs contain remaining connection slots are reserved or too many connections.
  • Dashboard: Grafana > PostgreSQL > Connection Pool panel.
  • Metrics:
    pgbouncer_pools_server_active_connections
    pgbouncer_pools_server_idle_connections
    pgbouncer_pools_client_waiting_connections
    

Diagnosis

  1. Check PgBouncer status:
    kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW POOLS;"
    kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CLIENTS;"
    
  2. Check for connection leaks in the application:
    kubectl logs -l app=kaireon-api --since=10m | grep -E "connection|pool" | tail -30
    
  3. Check PostgreSQL directly:
    SELECT count(*), state, usename, application_name
    FROM pg_stat_activity
    GROUP BY state, usename, application_name
    ORDER BY count DESC;
    
  4. Check for long-running transactions holding connections:
    SELECT pid, now() - xact_start AS duration, state, query
    FROM pg_stat_activity
    WHERE xact_start IS NOT NULL
    ORDER BY xact_start ASC
    LIMIT 20;
    

Resolution

  1. Kill idle-in-transaction connections (>5 min):
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle in transaction'
    AND now() - xact_start > interval '5 minutes';
    
  2. Reload PgBouncer if it is stuck:
    kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "RELOAD;"
    
  3. Temporarily increase PgBouncer pool size:
    kubectl edit configmap kaireon-pgbouncer-config
    # Increase default_pool_size from 20 to 40
    kubectl rollout restart deploy/kaireon-pgbouncer
    
  4. If a specific service is leaking, restart it:
    kubectl rollout restart deploy/kaireon-api
    

Prevention

  • Set idle_in_transaction_session_timeout = 30s in PostgreSQL.
  • Configure PgBouncer server_idle_timeout = 600 to reclaim idle server connections.
  • Use connection pooling in the application layer (Prisma connection limit).
  • Set max_client_conn in PgBouncer to 2x expected peak.
  • Add connection acquisition timeout (5s) in the application to fail fast.
  • Audit code for missing finally blocks that release connections.

4. Redis OOM / Eviction

Severity: SEV2 (SEV1 if decision cache is fully evicted)

Detection

  • Alert: redis_memory_used_bytes / redis_memory_max_bytes > 0.9 for 5 minutes.
  • Alert: redis_evicted_keys_total rate > 0 for 3 minutes.
  • Dashboard: Grafana > Redis > Memory panel.
  • Metrics:
    redis_memory_used_bytes
    rate(redis_evicted_keys_total[5m])
    redis_connected_clients
    

Diagnosis

  1. Check memory breakdown:
    kubectl exec -it deploy/kaireon-redis -- redis-cli INFO memory
    kubectl exec -it deploy/kaireon-redis -- redis-cli MEMORY DOCTOR
    
  2. Identify largest keys:
    kubectl exec -it deploy/kaireon-redis -- redis-cli --bigkeys
    
  3. Check which key patterns consume the most memory:
    kubectl exec -it deploy/kaireon-redis -- redis-cli --memkeys --pattern "decision:*" | tail -5
    kubectl exec -it deploy/kaireon-redis -- redis-cli --memkeys --pattern "session:*" | tail -5
    
  4. Check for abnormal client connections:
    kubectl exec -it deploy/kaireon-redis -- redis-cli CLIENT LIST | wc -l
    

Resolution

  1. Flush non-critical caches first:
    # Flush only the cache database (DB 1), not the session database (DB 0)
    kubectl exec -it deploy/kaireon-redis -- redis-cli -n 1 FLUSHDB ASYNC
    
  2. If specific key patterns are bloated, expire them:
    kubectl exec -it deploy/kaireon-redis -- redis-cli EVAL "
      local keys = redis.call('KEYS', ARGV[1])
      for _, key in ipairs(keys) do
        if redis.call('TTL', key) == -1 then
          redis.call('EXPIRE', key, 300)
        end
      end
      return #keys
    " 0 "pipeline:result:*"
    
  3. Scale Redis vertically (if managed):
    # For AWS ElastiCache — change instance type via console or Terraform
    # For self-managed, increase maxmemory:
    kubectl exec -it deploy/kaireon-redis -- redis-cli CONFIG SET maxmemory 8gb
    
  4. Switch eviction policy if needed:
    kubectl exec -it deploy/kaireon-redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru
    

Prevention

  • Set TTLs on all cache keys (decision cache: 60s, session: 24h, feature: 300s).
  • Use volatile-lru eviction policy so only keys with TTLs are evicted.
  • Monitor memory usage trend and scale proactively at 70%.
  • Separate cache Redis from session/queue Redis to isolate failure domains.
  • Implement key size limits in the application layer (reject values >1MB).

5. API 5xx Error Rate >1%

Severity: SEV2 (SEV1 if >5% or sustained >10 minutes)

Detection

  • Alert: rate(http_responses_total{status=~"5.."}[5m]) / rate(http_responses_total[5m]) > 0.01 for 3 minutes.
  • Dashboard: Grafana > KaireonAI API > Error Rate panel.
  • Metrics:
    sum(rate(http_responses_total{status=~"5.."}[5m])) by (route)
    / sum(rate(http_responses_total[5m])) by (route)
    

Diagnosis

  1. Identify which endpoints are failing:
    topk(5, sum(rate(http_responses_total{status=~"5.."}[5m])) by (route))
    
  2. Check application logs for errors:
    kubectl logs -l app=kaireon-api --since=5m | grep -E '"status":5[0-9]{2}' | head -20
    kubectl logs -l app=kaireon-api --since=5m | grep -i "error\|exception\|panic" | tail -30
    
  3. Check if it correlates with a deployment:
    kubectl rollout history deploy/kaireon-api
    
  4. Check downstream dependencies:
    # Database connectivity
    kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz
    
    # Redis connectivity
    kubectl exec -it deploy/kaireon-redis -- redis-cli PING
    
  5. Check for resource pressure:
    kubectl top pods -l app=kaireon-api
    kubectl describe pods -l app=kaireon-api | grep -A5 "Last State"
    

Resolution

  1. If caused by a bad deployment, rollback:
    kubectl rollout undo deploy/kaireon-api
    
  2. If caused by downstream failure, enable circuit breaker:
    kubectl set env deploy/kaireon-api CIRCUIT_BREAKER_ENABLED=true
    
  3. If caused by OOM kills, increase memory:
    kubectl set resources deploy/kaireon-api --limits=memory=2Gi --requests=memory=1Gi
    
  4. If a specific route is the problem, disable it temporarily:
    # Add the route to the maintenance blocklist
    kubectl set env deploy/kaireon-api BLOCKED_ROUTES="/api/v1/pipelines/execute"
    

Prevention

  • Implement canary deployments (10% traffic for 5 minutes before full rollout).
  • Add structured error logging with request IDs for traceability.
  • Set memory limits with 20% headroom above observed peak.
  • Run integration tests in staging before promoting to production.
  • Implement retry with backoff for transient downstream failures.

6. DLQ Overflow (>500)

Severity: SEV2 (SEV1 if DLQ contains decision requests)

Detection

  • Alert: kaireon_dlq_depth > 500 for 10 minutes.
  • Dashboard: Grafana > KaireonAI Workers > Dead Letter Queue panel.
  • Metrics:
    kaireon_dlq_depth{queue="decisions"}
    rate(kaireon_dlq_depth[5m])
    

Diagnosis

  1. Sample DLQ messages to identify the failure pattern:
    kubectl exec -it deploy/kaireon-worker -- node -e "
      const redis = require('ioredis');
      const r = new redis(process.env.REDIS_URL);
      r.lrange('dlq:decisions', 0, 5).then(msgs => {
        msgs.forEach(m => console.log(JSON.parse(m).error));
        r.quit();
      });
    "
    
  2. Check if a specific error dominates:
    kubectl logs -l app=kaireon-worker --since=30m | grep "moved_to_dlq" | \
      jq -r '.error_type' | sort | uniq -c | sort -rn
    
  3. Check if the DLQ growth correlates with a deployment or config change:
    kubectl rollout history deploy/kaireon-worker
    kubectl get events --sort-by=.lastTimestamp | tail -20
    

Resolution

  1. If messages are retryable (transient errors), replay them:
    kubectl exec -it deploy/kaireon-worker -- node scripts/replay-dlq.js --queue=decisions --batch=50
    
  2. If messages are poison pills (schema errors), archive and purge:
    # Archive to S3 first
    kubectl exec -it deploy/kaireon-worker -- node scripts/archive-dlq.js --queue=decisions --dest=s3://kaireon-dlq-archive/
    
    # Then purge
    kubectl exec -it deploy/kaireon-redis -- redis-cli DEL dlq:decisions
    
  3. If caused by a downstream outage, wait for recovery then replay:
    # Verify downstream is healthy
    kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz
    
    # Replay in controlled batches
    kubectl exec -it deploy/kaireon-worker -- node scripts/replay-dlq.js --queue=decisions --batch=10 --delay=1000
    

Prevention

  • Set DLQ alert threshold at 100 (not 500) for earlier detection.
  • Implement automatic DLQ replay with exponential backoff (max 3 retries).
  • Add DLQ message classification (transient vs. permanent failure).
  • Archive DLQ messages to S3 daily for audit and forensics.
  • Add schema validation before enqueue to reject malformed messages early.

7. Certificate Expiry

Severity: SEV1 (if <24 hours to expiry), SEV2 (if <7 days)

Detection

  • Alert: cert_expiry_seconds < 604800 (7 days) for warning.
  • Alert: cert_expiry_seconds < 86400 (24 hours) for critical.
  • Dashboard: Grafana > Infrastructure > Certificate Expiry panel.
  • Manual check:
    # Check TLS cert for the API endpoint
    echo | openssl s_client -connect api.kaireon.com:443 2>/dev/null | openssl x509 -noout -dates
    
    # Check all Kubernetes TLS secrets
    kubectl get secrets --all-namespaces -o json | \
      jq -r '.items[] | select(.type=="kubernetes.io/tls") |
      "\(.metadata.namespace)/\(.metadata.name)"' | while read secret; do
        ns=$(echo $secret | cut -d/ -f1)
        name=$(echo $secret | cut -d/ -f2)
        expiry=$(kubectl get secret -n $ns $name -o jsonpath='{.data.tls\.crt}' | \
          base64 -d | openssl x509 -noout -enddate 2>/dev/null)
        echo "$secret: $expiry"
      done
    

Diagnosis

  1. Identify which certificate is expiring:
    kubectl get certificate --all-namespaces
    kubectl describe certificate -n kaireon kaireon-tls
    
  2. Check cert-manager status (if using cert-manager):
    kubectl get challenges --all-namespaces
    kubectl get orders --all-namespaces
    kubectl logs -n cert-manager deploy/cert-manager --since=1h | grep -i error
    
  3. Check if auto-renewal failed:
    kubectl describe certificaterequest -n kaireon | grep -A5 "Status"
    

Resolution

  1. If cert-manager is installed, force renewal:
    kubectl cert-manager renew -n kaireon kaireon-tls
    
  2. If cert-manager renewal is stuck, delete and recreate:
    kubectl delete certificate -n kaireon kaireon-tls
    kubectl apply -f k8s/certificates/kaireon-tls.yaml
    
  3. If manual certificate, replace it:
    kubectl create secret tls kaireon-tls \
      --cert=new-cert.pem \
      --key=new-key.pem \
      -n kaireon \
      --dry-run=client -o yaml | kubectl apply -f -
    
    # Restart ingress controller to pick up new cert
    kubectl rollout restart deploy/ingress-nginx-controller -n ingress-nginx
    

Prevention

  • Use cert-manager with Let’s Encrypt for automatic renewal.
  • Set alerts at 30, 14, 7, 3, and 1 day(s) before expiry.
  • Run a weekly certificate audit job that scans all namespaces.
  • Maintain a certificate inventory spreadsheet with owners and expiry dates.
  • Test certificate renewal in staging monthly.

8. Node Capacity Exhaustion

Severity: SEV2 (SEV1 if pods are evicted or cannot schedule)

Detection

  • Alert: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1.
  • Alert: kube_node_status_condition{condition="DiskPressure",status="true"} == 1.
  • Alert: Pending pods count > 0 for more than 5 minutes.
  • Dashboard: Grafana > Kubernetes > Node Resources panel.
  • Metrics:
    sum(kube_pod_status_phase{phase="Pending"}) > 0
    node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
    

Diagnosis

  1. Check node resource usage:
    kubectl top nodes
    kubectl describe nodes | grep -A10 "Allocated resources"
    
  2. Check for pending pods:
    kubectl get pods --all-namespaces --field-selector=status.phase=Pending
    kubectl describe pod <pending-pod-name> | grep -A5 "Events"
    
  3. Check for resource hogs:
    kubectl top pods --all-namespaces --sort-by=memory | head -20
    kubectl top pods --all-namespaces --sort-by=cpu | head -20
    
  4. Check cluster autoscaler status:
    kubectl logs -n kube-system deploy/cluster-autoscaler --since=10m | grep -E "scale|unschedulable"
    

Resolution

  1. If autoscaler is enabled, check if it is working:
    kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml
    
  2. Manually add nodes if autoscaler is stuck:
    # For EKS
    aws eks update-nodegroup-config \
      --cluster-name kaireon-prod \
      --nodegroup-name kaireon-workers \
      --scaling-config minSize=3,maxSize=20,desiredSize=10
    
  3. Evict non-critical workloads:
    kubectl scale deploy/kaireon-batch-processor --replicas=0
    kubectl scale deploy/kaireon-analytics --replicas=0
    
  4. If disk pressure, clean up:
    # On the affected node
    docker system prune -af
    crictl rmi --prune
    

Prevention

  • Set cluster autoscaler min/max to allow 30% headroom.
  • Use PodDisruptionBudgets to protect critical workloads during eviction.
  • Set resource requests and limits on all pods (no unbounded pods).
  • Schedule non-critical batch jobs during off-peak hours.
  • Run monthly capacity planning reviews based on growth trends.

9. Database Disk Full

Severity: SEV1

Detection

  • Alert: pg_database_size_bytes / pg_disk_total_bytes > 0.85 for 10 minutes.
  • Alert (RDS): FreeStorageSpace < 5GB for 5 minutes.
  • Dashboard: Grafana > PostgreSQL > Disk Usage panel.
  • Metrics:
    pg_database_size_bytes{datname="kaireon"}
    node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"}
    

Diagnosis

  1. Check current disk usage:
    SELECT pg_size_pretty(pg_database_size('kaireon')) AS db_size;
    
    SELECT schemaname, tablename,
      pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size
    FROM pg_tables
    WHERE schemaname = 'public'
    ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
    LIMIT 20;
    
  2. Check for bloat:
    SELECT schemaname, tablename,
      pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
      n_dead_tup,
      n_live_tup,
      ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_pct
    FROM pg_stat_user_tables
    ORDER BY n_dead_tup DESC
    LIMIT 20;
    
  3. Check for WAL accumulation:
    SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_size;
    SELECT * FROM pg_replication_slots;
    
  4. Check for orphaned temp files:
    SELECT pg_size_pretty(temp_bytes) AS temp_usage, datname
    FROM pg_stat_database
    WHERE temp_bytes > 0;
    

Resolution

  1. Immediate: Increase disk (if RDS with storage autoscaling disabled):
    aws rds modify-db-instance \
      --db-instance-identifier kaireon-prod \
      --allocated-storage 500 \
      --apply-immediately
    
  2. Run emergency VACUUM on bloated tables:
    VACUUM (VERBOSE) decision_logs;
    VACUUM (VERBOSE) audit_events;
    
  3. Purge old data (if retention policy allows):
    DELETE FROM decision_logs WHERE created_at < NOW() - INTERVAL '90 days';
    DELETE FROM audit_events WHERE created_at < NOW() - INTERVAL '180 days';
    VACUUM FULL decision_logs;
    
  4. Drop unused replication slots (WAL accumulation):
    SELECT pg_drop_replication_slot('unused_slot_name');
    

Prevention

  • Enable RDS storage autoscaling with a maximum limit.
  • Implement data retention policies with automated purge jobs (daily cron).
  • Run VACUUM ANALYZE on large tables nightly.
  • Partition large tables (decision_logs, audit_events) by month.
  • Monitor disk growth rate and project exhaustion date weekly.

10. Auth Failures Spike

Severity: SEV2 (SEV1 if suspected credential compromise)

Detection

  • Alert: rate(kaireon_auth_failures_total[5m]) > 10 for 3 minutes.
  • Alert: rate(kaireon_auth_failures_total[5m]) / rate(kaireon_auth_attempts_total[5m]) > 0.1 for 5 minutes.
  • Dashboard: Grafana > KaireonAI Security > Auth Failures panel.
  • Metrics:
    rate(kaireon_auth_failures_total[5m])
    topk(5, sum(rate(kaireon_auth_failures_total[5m])) by (reason))
    topk(5, sum(rate(kaireon_auth_failures_total[5m])) by (source_ip))
    

Diagnosis

  1. Identify the failure reason:
    kubectl logs -l app=kaireon-api --since=10m | grep "auth_failure" | \
      jq -r '.reason' | sort | uniq -c | sort -rn
    
  2. Check if failures are from a single IP (brute force):
    kubectl logs -l app=kaireon-api --since=10m | grep "auth_failure" | \
      jq -r '.source_ip' | sort | uniq -c | sort -rn | head -10
    
  3. Check if OIDC/SAML provider is down:
    curl -s https://auth.kaireon.com/.well-known/openid-configuration | jq .
    curl -s https://auth.kaireon.com/health
    
  4. Check if tokens are expired due to clock skew:
    # Check API server time
    kubectl exec -it deploy/kaireon-api -- date -u
    
    # Compare with auth server
    curl -sI https://auth.kaireon.com | grep Date
    
  5. Check for recent secret/key rotation:
    kubectl get secret kaireon-auth-config -o yaml | grep -E "last-applied|resourceVersion"
    

Resolution

  1. If brute force, block the source IP:
    # Add to WAF blocklist
    aws wafv2 update-ip-set \
      --name kaireon-blocklist \
      --scope REGIONAL \
      --addresses "1.2.3.4/32" \
      --id <ip-set-id> \
      --lock-token <lock-token>
    
  2. If OIDC provider is down, enable fallback auth:
    kubectl set env deploy/kaireon-api AUTH_FALLBACK_ENABLED=true
    
  3. If key rotation broke auth, rollback the secret:
    kubectl rollout undo secret/kaireon-auth-config
    kubectl rollout restart deploy/kaireon-api
    
  4. If clock skew, sync NTP:
    # On affected nodes
    sudo systemctl restart chronyd
    
  5. If credential compromise is suspected:
    • Rotate all API keys and service account tokens immediately.
    • Revoke all active sessions.
    • Enable enhanced logging.
    • Notify the security team.

Prevention

  • Implement rate limiting on auth endpoints (10 failures per IP per minute).
  • Enable account lockout after 5 consecutive failures (30-minute window).
  • Use short-lived tokens (15 minutes) with refresh token rotation.
  • Monitor for credential stuffing patterns (many IPs, same usernames).
  • Run quarterly penetration tests on auth flows.
  • Sync NTP on all nodes; alert if clock drift >1 second.

Appendix: Useful Commands

Quick Health Check

# All-in-one health check
kubectl get pods -l app.kubernetes.io/part-of=kaireon | grep -v Running
kubectl top pods -l app.kubernetes.io/part-of=kaireon
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz | jq .
kubectl exec -it deploy/kaireon-redis -- redis-cli PING

Log Aggregation

# Tail all KaireonAI logs
kubectl logs -l app.kubernetes.io/part-of=kaireon --since=5m -f --max-log-requests=10

# Search for errors across all pods
kubectl logs -l app.kubernetes.io/part-of=kaireon --since=30m | grep -i error | sort | uniq -c | sort -rn

Metrics Quick Reference

# Decision throughput
sum(rate(kaireon_decisions_total[5m]))

# Error budget burn rate
1 - (sum(rate(http_responses_total{status!~"5.."}[1h])) / sum(rate(http_responses_total[1h])))

# P99 latency
histogram_quantile(0.99, sum(rate(kaireon_decision_latency_ms_bucket[5m])) by (le))