Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Audience: On-call engineers, SREs, platform operators
Last updated: 2026-02-23
Escalation chain: On-call engineer -> Team lead -> VP Engineering -> CTO
Table of Contents
- Decision Latency Spike (>200ms P99)
- Worker Backlog (Queue Depth >100)
- Database Connection Exhaustion
- Redis OOM / Eviction
- API 5xx Error Rate >1%
- DLQ Overflow (>500)
- Certificate Expiry
- Node Capacity Exhaustion
- Database Disk Full
- Auth Failures Spike
General Incident Workflow
- Acknowledge the alert within 5 minutes.
- Assess severity using the matrix below.
- Communicate status in
#kaireon-incidents Slack channel.
- Mitigate using the relevant playbook.
- Resolve and confirm metrics return to normal.
- Post-mortem within 48 hours for SEV1/SEV2 incidents.
| Severity | Criteria | Response Time | Update Cadence |
|---|
| SEV1 | Full outage, data loss risk | 5 min | Every 15 min |
| SEV2 | Degraded service, >10% users affected | 15 min | Every 30 min |
| SEV3 | Minor degradation, <10% users affected | 30 min | Every 1 hour |
| SEV4 | Cosmetic, no user impact | Next business day | As needed |
1. Decision Latency Spike (>200ms P99)
Severity: SEV2 (SEV1 if sustained >10 minutes or >500ms P99)
Detection
- Alert:
kaireon_decision_latency_ms P99 > 200ms for 3 consecutive minutes.
- Dashboard: Grafana > KaireonAI Decisions > Latency panel.
- Metrics:
histogram_quantile(0.99, rate(kaireon_decision_latency_ms_bucket[5m]))
Diagnosis
-
Identify the bottleneck layer:
# Check if the database is slow
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/metrics | grep kaireon_http_request_duration_seconds
# Check Redis latency
kubectl exec -it deploy/kaireon-redis -- redis-cli --latency-history -i 1
# Check scoring engine time
kubectl logs deploy/kaireon-api --since=5m | grep "scoring_duration" | tail -20
-
Check for resource contention:
# CPU throttling on decision pods
kubectl top pods -l app=kaireon-api
# Check if HPA is maxed out
kubectl get hpa kaireon-api-hpa
-
Check for slow queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '1 second'
ORDER BY duration DESC;
-
Check for model loading delays:
kubectl logs deploy/kaireon-api --since=5m | grep -E "model_load|cache_miss"
Resolution
-
Immediate (if >500ms P99): Scale up API replicas:
kubectl scale deploy/kaireon-api --replicas=10
-
If database-bound: Kill long-running queries and investigate:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '30 seconds'
AND query NOT LIKE '%pg_stat_activity%';
-
If Redis-bound: Flush stale caches:
kubectl exec -it deploy/kaireon-redis -- redis-cli FLUSHDB ASYNC
-
If scoring-bound: Enable score caching or disable non-critical rankers:
kubectl set env deploy/kaireon-api SCORING_CACHE_ENABLED=true
kubectl set env deploy/kaireon-api SCORING_TIMEOUT_MS=150
Prevention
- Set HPA target CPU to 60% (not 80%) for headroom.
- Enable scoring result caching with a 30-second TTL for repeat customer lookups.
- Run weekly load tests against the decision endpoint with production-like traffic.
- Maintain database query performance baselines; alert on 50% regression.
- Pre-warm model caches on pod startup using an init container.
2. Worker Backlog (Queue Depth >100)
Severity: SEV2 (SEV1 if depth >500 or growing >50/min)
Detection
- Alert:
kaireon_active_worker_jobs > 100 for 5 minutes.
- Dashboard: Grafana > KaireonAI Workers > Queue Depth panel.
- Metrics:
kaireon_active_worker_jobs{queue="batch-jobs"}
rate(kaireon_active_worker_jobs[5m])
- Queue names:
batch-jobs, pipeline-jobs, dsar-jobs, retrain-jobs, journey-jobs
Diagnosis
-
Determine which queue is backed up:
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/metrics | grep queue_depth
-
Check worker health:
kubectl get pods -l app=kaireon-worker
kubectl logs -l app=kaireon-worker --since=5m --tail=50 | grep -E "error|panic|OOM"
-
Check if workers are stuck on a specific job:
kubectl logs -l app=kaireon-worker --since=5m | grep "processing_duration" | sort -t= -k2 -rn | head -10
-
Check for upstream rate changes:
rate(kaireon_worker_jobs_enqueued_total[5m])
Resolution
-
Scale workers immediately:
kubectl scale deploy/kaireon-worker --replicas=20
-
If workers are crash-looping, restart them:
kubectl rollout restart deploy/kaireon-worker
-
If a single job type is stuck, isolate it:
# Pause the problematic job type
kubectl set env deploy/kaireon-worker SKIP_JOB_TYPES=pipeline_execute
-
If the queue is Redis-backed and Redis is the bottleneck:
kubectl exec -it deploy/kaireon-redis -- redis-cli INFO memory
kubectl exec -it deploy/kaireon-redis -- redis-cli SLOWLOG GET 10
Prevention
- Configure KEDA autoscaler to trigger at queue depth 50 (not 100).
- Set per-job-type timeouts (default 60s) so stuck jobs do not block workers.
- Implement circuit breakers for downstream dependencies (connectors, external APIs).
- Monitor queue depth trend, not just threshold; alert on sustained growth rate.
- Set max retry count to 3 with exponential backoff to prevent retry storms.
3. Database Connection Exhaustion
Severity: SEV1
Detection
- Alert:
pgbouncer_active_connections / pgbouncer_max_connections > 0.9 for 2 minutes.
- Alert: Application logs contain
remaining connection slots are reserved or too many connections.
- Dashboard: Grafana > PostgreSQL > Connection Pool panel.
- Metrics:
pgbouncer_pools_server_active_connections
pgbouncer_pools_server_idle_connections
pgbouncer_pools_client_waiting_connections
Diagnosis
-
Check PgBouncer status:
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW POOLS;"
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CLIENTS;"
-
Check for connection leaks in the application:
kubectl logs -l app=kaireon-api --since=10m | grep -E "connection|pool" | tail -30
-
Check PostgreSQL directly:
SELECT count(*), state, usename, application_name
FROM pg_stat_activity
GROUP BY state, usename, application_name
ORDER BY count DESC;
-
Check for long-running transactions holding connections:
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;
Resolution
-
Kill idle-in-transaction connections (>5 min):
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - xact_start > interval '5 minutes';
-
Reload PgBouncer if it is stuck:
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "RELOAD;"
-
Temporarily increase PgBouncer pool size:
kubectl edit configmap kaireon-pgbouncer-config
# Increase default_pool_size from 20 to 40
kubectl rollout restart deploy/kaireon-pgbouncer
-
If a specific service is leaking, restart it:
kubectl rollout restart deploy/kaireon-api
Prevention
- Set
idle_in_transaction_session_timeout = 30s in PostgreSQL.
- Configure PgBouncer
server_idle_timeout = 600 to reclaim idle server connections.
- Use connection pooling in the application layer (Prisma connection limit).
- Set
max_client_conn in PgBouncer to 2x expected peak.
- Add connection acquisition timeout (5s) in the application to fail fast.
- Audit code for missing
finally blocks that release connections.
4. Redis OOM / Eviction
Severity: SEV2 (SEV1 if decision cache is fully evicted)
Detection
- Alert:
redis_memory_used_bytes / redis_memory_max_bytes > 0.9 for 5 minutes.
- Alert:
redis_evicted_keys_total rate > 0 for 3 minutes.
- Dashboard: Grafana > Redis > Memory panel.
- Metrics:
redis_memory_used_bytes
rate(redis_evicted_keys_total[5m])
redis_connected_clients
Diagnosis
-
Check memory breakdown:
kubectl exec -it deploy/kaireon-redis -- redis-cli INFO memory
kubectl exec -it deploy/kaireon-redis -- redis-cli MEMORY DOCTOR
-
Identify largest keys:
kubectl exec -it deploy/kaireon-redis -- redis-cli --bigkeys
-
Check which key patterns consume the most memory:
kubectl exec -it deploy/kaireon-redis -- redis-cli --memkeys --pattern "decision:*" | tail -5
kubectl exec -it deploy/kaireon-redis -- redis-cli --memkeys --pattern "session:*" | tail -5
-
Check for abnormal client connections:
kubectl exec -it deploy/kaireon-redis -- redis-cli CLIENT LIST | wc -l
Resolution
-
Flush non-critical caches first:
# Flush only the cache database (DB 1), not the session database (DB 0)
kubectl exec -it deploy/kaireon-redis -- redis-cli -n 1 FLUSHDB ASYNC
-
If specific key patterns are bloated, expire them:
kubectl exec -it deploy/kaireon-redis -- redis-cli EVAL "
local keys = redis.call('KEYS', ARGV[1])
for _, key in ipairs(keys) do
if redis.call('TTL', key) == -1 then
redis.call('EXPIRE', key, 300)
end
end
return #keys
" 0 "pipeline:result:*"
-
Scale Redis vertically (if managed):
# For AWS ElastiCache — change instance type via console or Terraform
# For self-managed, increase maxmemory:
kubectl exec -it deploy/kaireon-redis -- redis-cli CONFIG SET maxmemory 8gb
-
Switch eviction policy if needed:
kubectl exec -it deploy/kaireon-redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru
Prevention
- Set TTLs on all cache keys (decision cache: 60s, session: 24h, feature: 300s).
- Use
volatile-lru eviction policy so only keys with TTLs are evicted.
- Monitor memory usage trend and scale proactively at 70%.
- Separate cache Redis from session/queue Redis to isolate failure domains.
- Implement key size limits in the application layer (reject values >1MB).
5. API 5xx Error Rate >1%
Severity: SEV2 (SEV1 if >5% or sustained >10 minutes)
Detection
- Alert:
rate(http_responses_total{status=~"5.."}[5m]) / rate(http_responses_total[5m]) > 0.01 for 3 minutes.
- Dashboard: Grafana > KaireonAI API > Error Rate panel.
- Metrics:
sum(rate(http_responses_total{status=~"5.."}[5m])) by (route)
/ sum(rate(http_responses_total[5m])) by (route)
Diagnosis
-
Identify which endpoints are failing:
topk(5, sum(rate(http_responses_total{status=~"5.."}[5m])) by (route))
-
Check application logs for errors:
kubectl logs -l app=kaireon-api --since=5m | grep -E '"status":5[0-9]{2}' | head -20
kubectl logs -l app=kaireon-api --since=5m | grep -i "error\|exception\|panic" | tail -30
-
Check if it correlates with a deployment:
kubectl rollout history deploy/kaireon-api
-
Check downstream dependencies:
# Database connectivity
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz
# Redis connectivity
kubectl exec -it deploy/kaireon-redis -- redis-cli PING
-
Check for resource pressure:
kubectl top pods -l app=kaireon-api
kubectl describe pods -l app=kaireon-api | grep -A5 "Last State"
Resolution
-
If caused by a bad deployment, rollback:
kubectl rollout undo deploy/kaireon-api
-
If caused by downstream failure, enable circuit breaker:
kubectl set env deploy/kaireon-api CIRCUIT_BREAKER_ENABLED=true
-
If caused by OOM kills, increase memory:
kubectl set resources deploy/kaireon-api --limits=memory=2Gi --requests=memory=1Gi
-
If a specific route is the problem, disable it temporarily:
# Add the route to the maintenance blocklist
kubectl set env deploy/kaireon-api BLOCKED_ROUTES="/api/v1/pipelines/execute"
Prevention
- Implement canary deployments (10% traffic for 5 minutes before full rollout).
- Add structured error logging with request IDs for traceability.
- Set memory limits with 20% headroom above observed peak.
- Run integration tests in staging before promoting to production.
- Implement retry with backoff for transient downstream failures.
6. DLQ Overflow (>500)
Severity: SEV2 (SEV1 if DLQ contains decision requests)
Detection
- Alert:
kaireon_dlq_depth > 500 for 10 minutes.
- Dashboard: Grafana > KaireonAI Workers > Dead Letter Queue panel.
- Metrics:
kaireon_dlq_depth{queue="decisions"}
rate(kaireon_dlq_depth[5m])
Diagnosis
-
Sample DLQ messages to identify the failure pattern:
kubectl exec -it deploy/kaireon-worker -- node -e "
const redis = require('ioredis');
const r = new redis(process.env.REDIS_URL);
r.lrange('dlq:decisions', 0, 5).then(msgs => {
msgs.forEach(m => console.log(JSON.parse(m).error));
r.quit();
});
"
-
Check if a specific error dominates:
kubectl logs -l app=kaireon-worker --since=30m | grep "moved_to_dlq" | \
jq -r '.error_type' | sort | uniq -c | sort -rn
-
Check if the DLQ growth correlates with a deployment or config change:
kubectl rollout history deploy/kaireon-worker
kubectl get events --sort-by=.lastTimestamp | tail -20
Resolution
-
If messages are retryable (transient errors), replay them:
kubectl exec -it deploy/kaireon-worker -- node scripts/replay-dlq.js --queue=decisions --batch=50
-
If messages are poison pills (schema errors), archive and purge:
# Archive to S3 first
kubectl exec -it deploy/kaireon-worker -- node scripts/archive-dlq.js --queue=decisions --dest=s3://kaireon-dlq-archive/
# Then purge
kubectl exec -it deploy/kaireon-redis -- redis-cli DEL dlq:decisions
-
If caused by a downstream outage, wait for recovery then replay:
# Verify downstream is healthy
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz
# Replay in controlled batches
kubectl exec -it deploy/kaireon-worker -- node scripts/replay-dlq.js --queue=decisions --batch=10 --delay=1000
Prevention
- Set DLQ alert threshold at 100 (not 500) for earlier detection.
- Implement automatic DLQ replay with exponential backoff (max 3 retries).
- Add DLQ message classification (transient vs. permanent failure).
- Archive DLQ messages to S3 daily for audit and forensics.
- Add schema validation before enqueue to reject malformed messages early.
7. Certificate Expiry
Severity: SEV1 (if <24 hours to expiry), SEV2 (if <7 days)
Detection
- Alert:
cert_expiry_seconds < 604800 (7 days) for warning.
- Alert:
cert_expiry_seconds < 86400 (24 hours) for critical.
- Dashboard: Grafana > Infrastructure > Certificate Expiry panel.
- Manual check:
# Check TLS cert for the API endpoint
echo | openssl s_client -connect api.kaireon.com:443 2>/dev/null | openssl x509 -noout -dates
# Check all Kubernetes TLS secrets
kubectl get secrets --all-namespaces -o json | \
jq -r '.items[] | select(.type=="kubernetes.io/tls") |
"\(.metadata.namespace)/\(.metadata.name)"' | while read secret; do
ns=$(echo $secret | cut -d/ -f1)
name=$(echo $secret | cut -d/ -f2)
expiry=$(kubectl get secret -n $ns $name -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate 2>/dev/null)
echo "$secret: $expiry"
done
Diagnosis
-
Identify which certificate is expiring:
kubectl get certificate --all-namespaces
kubectl describe certificate -n kaireon kaireon-tls
-
Check cert-manager status (if using cert-manager):
kubectl get challenges --all-namespaces
kubectl get orders --all-namespaces
kubectl logs -n cert-manager deploy/cert-manager --since=1h | grep -i error
-
Check if auto-renewal failed:
kubectl describe certificaterequest -n kaireon | grep -A5 "Status"
Resolution
-
If cert-manager is installed, force renewal:
kubectl cert-manager renew -n kaireon kaireon-tls
-
If cert-manager renewal is stuck, delete and recreate:
kubectl delete certificate -n kaireon kaireon-tls
kubectl apply -f k8s/certificates/kaireon-tls.yaml
-
If manual certificate, replace it:
kubectl create secret tls kaireon-tls \
--cert=new-cert.pem \
--key=new-key.pem \
-n kaireon \
--dry-run=client -o yaml | kubectl apply -f -
# Restart ingress controller to pick up new cert
kubectl rollout restart deploy/ingress-nginx-controller -n ingress-nginx
Prevention
- Use cert-manager with Let’s Encrypt for automatic renewal.
- Set alerts at 30, 14, 7, 3, and 1 day(s) before expiry.
- Run a weekly certificate audit job that scans all namespaces.
- Maintain a certificate inventory spreadsheet with owners and expiry dates.
- Test certificate renewal in staging monthly.
8. Node Capacity Exhaustion
Severity: SEV2 (SEV1 if pods are evicted or cannot schedule)
Detection
- Alert:
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1.
- Alert:
kube_node_status_condition{condition="DiskPressure",status="true"} == 1.
- Alert: Pending pods count > 0 for more than 5 minutes.
- Dashboard: Grafana > Kubernetes > Node Resources panel.
- Metrics:
sum(kube_pod_status_phase{phase="Pending"}) > 0
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
Diagnosis
-
Check node resource usage:
kubectl top nodes
kubectl describe nodes | grep -A10 "Allocated resources"
-
Check for pending pods:
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
kubectl describe pod <pending-pod-name> | grep -A5 "Events"
-
Check for resource hogs:
kubectl top pods --all-namespaces --sort-by=memory | head -20
kubectl top pods --all-namespaces --sort-by=cpu | head -20
-
Check cluster autoscaler status:
kubectl logs -n kube-system deploy/cluster-autoscaler --since=10m | grep -E "scale|unschedulable"
Resolution
-
If autoscaler is enabled, check if it is working:
kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml
-
Manually add nodes if autoscaler is stuck:
# For EKS
aws eks update-nodegroup-config \
--cluster-name kaireon-prod \
--nodegroup-name kaireon-workers \
--scaling-config minSize=3,maxSize=20,desiredSize=10
-
Evict non-critical workloads:
kubectl scale deploy/kaireon-batch-processor --replicas=0
kubectl scale deploy/kaireon-analytics --replicas=0
-
If disk pressure, clean up:
# On the affected node
docker system prune -af
crictl rmi --prune
Prevention
- Set cluster autoscaler min/max to allow 30% headroom.
- Use PodDisruptionBudgets to protect critical workloads during eviction.
- Set resource requests and limits on all pods (no unbounded pods).
- Schedule non-critical batch jobs during off-peak hours.
- Run monthly capacity planning reviews based on growth trends.
9. Database Disk Full
Severity: SEV1
Detection
- Alert:
pg_database_size_bytes / pg_disk_total_bytes > 0.85 for 10 minutes.
- Alert (RDS):
FreeStorageSpace < 5GB for 5 minutes.
- Dashboard: Grafana > PostgreSQL > Disk Usage panel.
- Metrics:
pg_database_size_bytes{datname="kaireon"}
node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"}
Diagnosis
-
Check current disk usage:
SELECT pg_size_pretty(pg_database_size('kaireon')) AS db_size;
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;
-
Check for bloat:
SELECT schemaname, tablename,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
n_dead_tup,
n_live_tup,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
-
Check for WAL accumulation:
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_size;
SELECT * FROM pg_replication_slots;
-
Check for orphaned temp files:
SELECT pg_size_pretty(temp_bytes) AS temp_usage, datname
FROM pg_stat_database
WHERE temp_bytes > 0;
Resolution
-
Immediate: Increase disk (if RDS with storage autoscaling disabled):
aws rds modify-db-instance \
--db-instance-identifier kaireon-prod \
--allocated-storage 500 \
--apply-immediately
-
Run emergency VACUUM on bloated tables:
VACUUM (VERBOSE) decision_logs;
VACUUM (VERBOSE) audit_events;
-
Purge old data (if retention policy allows):
DELETE FROM decision_logs WHERE created_at < NOW() - INTERVAL '90 days';
DELETE FROM audit_events WHERE created_at < NOW() - INTERVAL '180 days';
VACUUM FULL decision_logs;
-
Drop unused replication slots (WAL accumulation):
SELECT pg_drop_replication_slot('unused_slot_name');
Prevention
- Enable RDS storage autoscaling with a maximum limit.
- Implement data retention policies with automated purge jobs (daily cron).
- Run
VACUUM ANALYZE on large tables nightly.
- Partition large tables (decision_logs, audit_events) by month.
- Monitor disk growth rate and project exhaustion date weekly.
10. Auth Failures Spike
Severity: SEV2 (SEV1 if suspected credential compromise)
Detection
- Alert:
rate(kaireon_auth_failures_total[5m]) > 10 for 3 minutes.
- Alert:
rate(kaireon_auth_failures_total[5m]) / rate(kaireon_auth_attempts_total[5m]) > 0.1 for 5 minutes.
- Dashboard: Grafana > KaireonAI Security > Auth Failures panel.
- Metrics:
rate(kaireon_auth_failures_total[5m])
topk(5, sum(rate(kaireon_auth_failures_total[5m])) by (reason))
topk(5, sum(rate(kaireon_auth_failures_total[5m])) by (source_ip))
Diagnosis
-
Identify the failure reason:
kubectl logs -l app=kaireon-api --since=10m | grep "auth_failure" | \
jq -r '.reason' | sort | uniq -c | sort -rn
-
Check if failures are from a single IP (brute force):
kubectl logs -l app=kaireon-api --since=10m | grep "auth_failure" | \
jq -r '.source_ip' | sort | uniq -c | sort -rn | head -10
-
Check if OIDC/SAML provider is down:
curl -s https://auth.kaireon.com/.well-known/openid-configuration | jq .
curl -s https://auth.kaireon.com/health
-
Check if tokens are expired due to clock skew:
# Check API server time
kubectl exec -it deploy/kaireon-api -- date -u
# Compare with auth server
curl -sI https://auth.kaireon.com | grep Date
-
Check for recent secret/key rotation:
kubectl get secret kaireon-auth-config -o yaml | grep -E "last-applied|resourceVersion"
Resolution
-
If brute force, block the source IP:
# Add to WAF blocklist
aws wafv2 update-ip-set \
--name kaireon-blocklist \
--scope REGIONAL \
--addresses "1.2.3.4/32" \
--id <ip-set-id> \
--lock-token <lock-token>
-
If OIDC provider is down, enable fallback auth:
kubectl set env deploy/kaireon-api AUTH_FALLBACK_ENABLED=true
-
If key rotation broke auth, rollback the secret:
kubectl rollout undo secret/kaireon-auth-config
kubectl rollout restart deploy/kaireon-api
-
If clock skew, sync NTP:
# On affected nodes
sudo systemctl restart chronyd
-
If credential compromise is suspected:
- Rotate all API keys and service account tokens immediately.
- Revoke all active sessions.
- Enable enhanced logging.
- Notify the security team.
Prevention
- Implement rate limiting on auth endpoints (10 failures per IP per minute).
- Enable account lockout after 5 consecutive failures (30-minute window).
- Use short-lived tokens (15 minutes) with refresh token rotation.
- Monitor for credential stuffing patterns (many IPs, same usernames).
- Run quarterly penetration tests on auth flows.
- Sync NTP on all nodes; alert if clock drift >1 second.
Appendix: Useful Commands
Quick Health Check
# All-in-one health check
kubectl get pods -l app.kubernetes.io/part-of=kaireon | grep -v Running
kubectl top pods -l app.kubernetes.io/part-of=kaireon
kubectl exec -it deploy/kaireon-api -- curl -s localhost:9090/healthz | jq .
kubectl exec -it deploy/kaireon-redis -- redis-cli PING
Log Aggregation
# Tail all KaireonAI logs
kubectl logs -l app.kubernetes.io/part-of=kaireon --since=5m -f --max-log-requests=10
# Search for errors across all pods
kubectl logs -l app.kubernetes.io/part-of=kaireon --since=30m | grep -i error | sort | uniq -c | sort -rn
Metrics Quick Reference
# Decision throughput
sum(rate(kaireon_decisions_total[5m]))
# Error budget burn rate
1 - (sum(rate(http_responses_total{status!~"5.."}[1h])) / sum(rate(http_responses_total[1h])))
# P99 latency
histogram_quantile(0.99, sum(rate(kaireon_decision_latency_ms_bucket[5m])) by (le))