Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Audience: SREs, platform operators, engineering leadership
Last updated: 2026-02-23
Infrastructure: AWS (EKS, RDS, ElastiCache, S3)
Classification: Internal — Confidential
Table of Contents
- RTO/RPO Targets
- Backup Strategy
- Failover Procedures
- Recovery Steps
- DR Testing Schedule
1. RTO/RPO Targets
Definitions
- RTO (Recovery Time Objective): Maximum acceptable time from failure to full service restoration.
- RPO (Recovery Point Objective): Maximum acceptable data loss measured in time.
Targets by Tier
| Component | Tier | RTO | RPO | Strategy |
|---|
| Decision API | Critical | 5 min | 0 (zero data loss) | Multi-AZ, auto-failover |
| PostgreSQL (RDS) | Critical | 10 min | 1 min | Multi-AZ, continuous backup |
| Redis (ElastiCache) | High | 5 min | 5 min | Multi-AZ replica, auto-failover |
| Worker queues | High | 15 min | 5 min | Redis persistence + DLQ replay |
| Pipeline execution | Medium | 30 min | 15 min | Re-run from last checkpoint |
| Dashboard / UI | Medium | 15 min | N/A (stateless) | Multi-replica, ALB health checks |
| Batch analytics | Low | 4 hours | 1 hour | Re-run from source data |
SLA Commitments
| SLA Metric | Target | Measurement Window |
|---|
| Availability | 99.95% | Monthly |
| Decision API uptime | 99.99% | Monthly |
| Data durability | 99.999999999% (11 nines) | Annual (S3 backed) |
| Planned maintenance downtime | < 30 min/month | Monthly |
2. Backup Strategy
Overview
┌─────────────────────────────────────────┐
│ Backup Architecture │
├─────────────────────────────────────────┤
│ │
│ PostgreSQL (RDS) │
│ ├── Automated snapshots (daily, 7d) │
│ ├── Continuous WAL archiving (5m RPO) │
│ └── Manual snapshots (pre-migration) │
│ │
│ Redis (ElastiCache) │
│ ├── Daily snapshots (retained 7d) │
│ └── AOF persistence (1s sync) │
│ │
│ Application State │
│ ├── Git (infrastructure as code) │
│ ├── S3 (pipeline artifacts, exports) │
│ └── Secrets Manager (credentials) │
│ │
│ All backups replicated to us-west-2 │
└─────────────────────────────────────────┘
PostgreSQL Backups
Automated (RDS):
- Automated daily snapshots at 02:00 UTC.
- Retention: 7 days (automated), 90 days (manual).
- Continuous WAL archiving enables point-in-time recovery (PITR) to any second within the retention window.
Manual snapshots:
# Pre-migration snapshot
aws rds create-db-snapshot \
--db-instance-identifier kaireon-prod \
--db-snapshot-identifier "kaireon-pre-migration-$(date +%Y%m%d-%H%M%S)"
# Cross-region copy for DR
aws rds copy-db-snapshot \
--source-db-snapshot-identifier "kaireon-pre-migration-20260223-120000" \
--target-db-snapshot-identifier "kaireon-pre-migration-20260223-120000" \
--source-region us-east-1 \
--region us-west-2
Logical backups (pg_dump):
# Daily logical backup (in addition to RDS snapshots)
make backup
# Stored in: s3://kaireon-backups/daily/
# Cross-region replicated to: s3://kaireon-backups-dr/daily/ (us-west-2)
Redis Backups
ElastiCache snapshots:
# Manual snapshot
aws elasticache create-snapshot \
--replication-group-id kaireon-redis \
--snapshot-name "kaireon-redis-$(date +%Y%m%d-%H%M%S)"
# Export to S3 for cross-region access
aws elasticache copy-snapshot \
--source-snapshot-name "kaireon-redis-20260223-120000" \
--target-snapshot-name "kaireon-redis-20260223-120000-dr" \
--target-bucket "kaireon-backups-dr"
Redis persistence configuration:
appendonly yes
appendfsync everysec
save 900 1
save 300 10
save 60 10000
Application and Configuration Backups
| Asset | Backup Location | Frequency | Retention |
|---|
| Infrastructure code (Terraform) | Git (GitHub) | Every commit | Indefinite |
| Kubernetes manifests | Git (GitHub) | Every commit | Indefinite |
| Application code | Git (GitHub) | Every commit | Indefinite |
| Secrets | AWS Secrets Manager | On change | 30 days versioned |
| SSL certificates | AWS ACM / Secrets Manager | On renewal | Previous + current |
| PgBouncer config | Git + ConfigMap | On change | Indefinite |
| Pipeline artifacts | S3 (kaireon-artifacts) | Per execution | 90 days |
Backup Verification
Backups are verified monthly. See DR Testing Schedule.
# Automated backup verification script
#!/bin/bash
set -euo pipefail
echo "=== KaireonAI Backup Verification ==="
echo "Date: $(date -u)"
# 1. Verify RDS snapshot exists
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
--db-instance-identifier kaireon-prod \
--query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].DBSnapshotIdentifier' \
--output text)
echo "Latest RDS snapshot: $LATEST_SNAPSHOT"
# 2. Verify S3 backup exists
LATEST_S3=$(aws s3 ls s3://kaireon-backups/daily/ --recursive | sort | tail -1)
echo "Latest S3 backup: $LATEST_S3"
# 3. Verify cross-region replica
LATEST_DR=$(aws s3 ls s3://kaireon-backups-dr/daily/ --recursive --region us-west-2 | sort | tail -1)
echo "Latest DR backup: $LATEST_DR"
# 4. Verify Redis snapshot
LATEST_REDIS=$(aws elasticache describe-snapshots \
--replication-group-id kaireon-redis \
--query 'Snapshots | sort_by(@, &NodeSnapshots[0].SnapshotCreateTime) | [-1].SnapshotName' \
--output text)
echo "Latest Redis snapshot: $LATEST_REDIS"
echo "=== Verification Complete ==="
3. Failover Procedures
3.1 RDS Multi-AZ Failover
Automatic failover occurs when:
- The primary instance becomes unreachable.
- The primary AZ experiences an outage.
- The primary instance is rebooted with failover.
- The instance type is modified (with apply-immediately).
Expected downtime: 60-120 seconds.
Manual failover (for testing or planned maintenance):
# Trigger failover
aws rds reboot-db-instance \
--db-instance-identifier kaireon-prod \
--force-failover
# Monitor failover progress
watch -n 5 "aws rds describe-db-instances \
--db-instance-identifier kaireon-prod \
--query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]'"
Post-failover verification:
# 1. Check the new AZ
aws rds describe-db-instances \
--db-instance-identifier kaireon-prod \
--query 'DBInstances[0].AvailabilityZone'
# 2. Verify application connectivity
kubectl exec -it deploy/kaireon-api -n kaireon -- curl -s localhost:9090/healthz
# 3. Check PgBouncer reconnected
kubectl exec -it deploy/kaireon-pgbouncer -n kaireon -- psql -p 6432 pgbouncer -c "SHOW POOLS;"
# 4. Monitor error rate for 10 minutes
kubectl logs -l app=kaireon-api -n kaireon --since=5m | grep -c "error" || echo "0 errors"
3.2 Redis Failover
ElastiCache Multi-AZ automatic failover:
- Promotes a read replica to primary within 60 seconds.
- Application reconnects automatically via the primary endpoint.
Manual failover:
# Trigger failover
aws elasticache test-failover \
--replication-group-id kaireon-redis \
--node-group-id 0001
# Monitor
watch -n 5 "aws elasticache describe-replication-groups \
--replication-group-id kaireon-redis \
--query 'ReplicationGroups[0].Status'"
Post-failover verification:
# 1. Verify Redis is responding
kubectl exec -it deploy/kaireon-api -n kaireon -- redis-cli -h kaireon-redis PING
# 2. Check for data loss (compare key count)
kubectl exec -it deploy/kaireon-api -n kaireon -- redis-cli -h kaireon-redis DBSIZE
# 3. Verify application is using the new primary
kubectl logs -l app=kaireon-api -n kaireon --since=2m | grep -i redis
3.3 EKS Multi-Node Failover
Node failure handling:
- Kubernetes automatically reschedules pods from failed nodes.
- PodDisruptionBudgets ensure minimum replicas remain available.
Current PDB configuration:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kaireon-api-pdb
namespace: kaireon
spec:
minAvailable: 2
selector:
matchLabels:
app: kaireon-api
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kaireon-worker-pdb
namespace: kaireon
spec:
minAvailable: 1
selector:
matchLabels:
app: kaireon-worker
Node failure response:
# 1. Check node status
kubectl get nodes
kubectl describe node <failed-node>
# 2. Check affected pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<failed-node>
# 3. If node is NotReady for >5 minutes, drain it
kubectl drain <failed-node> --ignore-daemonsets --delete-emptydir-data --force
# 4. Verify pods rescheduled
kubectl get pods -n kaireon -o wide
# 5. If cluster autoscaler is not adding nodes, add manually
aws eks update-nodegroup-config \
--cluster-name kaireon-prod \
--nodegroup-name kaireon-workers \
--scaling-config desiredSize=5
Full AZ failure response:
# 1. Verify which AZ is affected
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"
# 2. Ensure node group spans multiple AZs
aws eks describe-nodegroup \
--cluster-name kaireon-prod \
--nodegroup-name kaireon-workers \
--query 'nodegroup.subnets'
# 3. Scale up to compensate for lost capacity
kubectl patch hpa kaireon-api-hpa -n kaireon --type merge -p \
'{"spec":{"minReplicas":6}}'
3.4 Full Region Failure
This is the most severe scenario. Follow these steps in order.
Prerequisites:
- DR region (us-west-2) has a standby EKS cluster (provisioned via Terraform).
- Database snapshots are replicated cross-region.
- Container images are in ECR with cross-region replication.
- DNS is managed via Route 53 with health checks.
Failover procedure:
# === STEP 1: Confirm primary region is down ===
# Verify from multiple sources before declaring a regional failure
aws health describe-events --region us-east-1
curl -s -o /dev/null -w "%{http_code}" https://api.kaireon.com/healthz
# === STEP 2: Activate DR database ===
# Restore from the latest cross-region snapshot
LATEST_DR_SNAPSHOT=$(aws rds describe-db-snapshots \
--region us-west-2 \
--query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].DBSnapshotIdentifier' \
--output text)
aws rds restore-db-instance-from-db-snapshot \
--region us-west-2 \
--db-instance-identifier kaireon-dr \
--db-snapshot-identifier "$LATEST_DR_SNAPSHOT" \
--db-instance-class db.r6g.xlarge \
--multi-az \
--vpc-security-group-ids sg-dr-xxxxxxxx
aws rds wait db-instance-available \
--region us-west-2 \
--db-instance-identifier kaireon-dr
# === STEP 3: Activate DR Redis ===
LATEST_REDIS_SNAPSHOT=$(aws elasticache describe-snapshots \
--region us-west-2 \
--query 'Snapshots | sort_by(@, &NodeSnapshots[0].SnapshotCreateTime) | [-1].SnapshotName' \
--output text)
aws elasticache create-replication-group \
--region us-west-2 \
--replication-group-id kaireon-redis-dr \
--replication-group-description "KaireonAI Redis DR" \
--snapshot-name "$LATEST_REDIS_SNAPSHOT" \
--cache-node-type cache.r6g.large \
--automatic-failover-enabled \
--multi-az-enabled
# === STEP 4: Update application configuration ===
kubectl --context kaireon-dr config use-context kaireon-dr
kubectl set env deploy/kaireon-api -n kaireon \
DATABASE_URL="postgresql://$DB_USER:$DB_PASSWORD@kaireon-dr.xxxxxxxx.us-west-2.rds.amazonaws.com:5432/kaireon" \
REDIS_URL="redis://kaireon-redis-dr.xxxxxxxx.usw2.cache.amazonaws.com:6379"
kubectl set env deploy/kaireon-worker -n kaireon \
DATABASE_URL="postgresql://$DB_USER:$DB_PASSWORD@kaireon-dr.xxxxxxxx.us-west-2.rds.amazonaws.com:5432/kaireon" \
REDIS_URL="redis://kaireon-redis-dr.xxxxxxxx.usw2.cache.amazonaws.com:6379"
# === STEP 5: Scale up DR workloads ===
kubectl --context kaireon-dr scale deploy/kaireon-api -n kaireon --replicas=5
kubectl --context kaireon-dr scale deploy/kaireon-worker -n kaireon --replicas=3
# === STEP 6: Switch DNS ===
aws route53 change-resource-record-sets \
--hosted-zone-id ZXXXXXXXXXXXXX \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.kaireon.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "Z1H1FL5HABSF5",
"DNSName": "kaireon-dr-alb.us-west-2.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}'
# === STEP 7: Verify ===
curl -s https://api.kaireon.com/healthz
echo "DR failover complete at $(date -u)"
4. Recovery Steps
4.1 Post-Incident Recovery Checklist
After any disaster recovery event, complete the following steps:
Immediate (0-1 hour):
Short-term (1-4 hours):
Medium-term (4-24 hours):
Long-term (1-7 days):
4.2 Failback to Primary Region
After a regional failover, fail back to the primary region once it is stable.
# === STEP 1: Verify primary region is stable ===
aws health describe-events --region us-east-1
# Wait at least 24 hours after region recovery before failing back
# === STEP 2: Sync data from DR to primary ===
# Create a snapshot of the DR database
aws rds create-db-snapshot \
--region us-west-2 \
--db-instance-identifier kaireon-dr \
--db-snapshot-identifier "kaireon-failback-$(date +%Y%m%d-%H%M%S)"
# Copy to primary region
aws rds copy-db-snapshot \
--region us-east-1 \
--source-region us-west-2 \
--source-db-snapshot-identifier "kaireon-failback-20260223-120000" \
--target-db-snapshot-identifier "kaireon-failback-20260223-120000"
# Restore in primary region
aws rds restore-db-instance-from-db-snapshot \
--region us-east-1 \
--db-instance-identifier kaireon-prod-restored \
--db-snapshot-identifier "kaireon-failback-20260223-120000" \
--db-instance-class db.r6g.xlarge \
--multi-az
# === STEP 3: Switch traffic back ===
# Follow the same DNS update procedure as failover but point back to us-east-1
# === STEP 4: Decommission DR resources ===
# After 48 hours of stable operation on primary:
aws rds delete-db-instance --region us-west-2 --db-instance-identifier kaireon-dr --skip-final-snapshot
aws elasticache delete-replication-group --region us-west-2 --replication-group-id kaireon-redis-dr
4.3 Data Reconciliation
After any failover that may have caused data divergence:
-- Compare record counts between backup and current
-- (Run against a restored copy of the pre-failure backup)
-- Decision logs gap analysis
SELECT
date_trunc('hour', created_at) AS hour,
count(*) AS decisions
FROM decision_logs
WHERE created_at > NOW() - INTERVAL '48 hours'
GROUP BY hour
ORDER BY hour;
-- Identify the gap window
SELECT
min(created_at) AS first_record,
max(created_at) AS last_record,
count(*) AS total_records
FROM decision_logs
WHERE created_at > NOW() - INTERVAL '48 hours';
5. DR Testing Schedule
Annual DR Testing Calendar
| Month | Test Type | Scope | Duration | Impact |
|---|
| January | Backup restore | Restore RDS snapshot to test instance | 2 hours | None |
| February | Redis failover | ElastiCache failover test | 30 min | Brief cache miss |
| March | RDS failover | Multi-AZ failover test | 1 hour | 1-2 min downtime |
| April | Backup restore | Full pg_dump restore verification | 2 hours | None |
| May | Node failure | Drain a worker node | 1 hour | None (PDB protected) |
| June | Full DR drill | Regional failover to us-west-2 | 4 hours | Planned downtime |
| July | Backup restore | Restore RDS snapshot to test instance | 2 hours | None |
| August | Redis failover | ElastiCache failover test | 30 min | Brief cache miss |
| September | RDS failover | Multi-AZ failover test | 1 hour | 1-2 min downtime |
| October | Backup restore | Full pg_dump restore verification | 2 hours | None |
| November | Node failure | Simulate AZ failure | 2 hours | None (multi-AZ) |
| December | Full DR drill | Regional failover to us-west-2 | 4 hours | Planned downtime |
DR Test Procedure
Pre-test checklist:
During the test:
Post-test checklist:
DR Test Report Template
# DR Test Report - [Date]
## Summary
- **Test type:** [Regional failover / RDS failover / Redis failover / etc.]
- **Date:** [YYYY-MM-DD]
- **Duration:** [HH:MM]
- **Participants:** [Names]
## Results
| Metric | Target | Actual | Pass/Fail |
|--------|--------|--------|-----------|
| RTO | [target] | [actual] | [P/F] |
| RPO | [target] | [actual] | [P/F] |
| Data integrity | 100% | [actual]% | [P/F] |
## Timeline
| Time | Action | Result |
|------|--------|--------|
| HH:MM | [action] | [result] |
## Issues Found
1. [Issue description, severity, owner, deadline]
## Runbook Updates Required
1. [Update description]
## Action Items
- [ ] [Action] - Owner: [name] - Due: [date]
| Role | Name | Contact | Escalation Time |
|---|
| On-call engineer | PagerDuty rotation | PagerDuty | Immediate |
| Team lead | [TBD] | [TBD] | 5 min (SEV1), 15 min (SEV2) |
| VP Engineering | [TBD] | [TBD] | 15 min (SEV1) |
| AWS TAM | [TBD] | [TBD] | 30 min (SEV1) |
Appendix: Key AWS Resources
| Resource | Identifier | Region |
|---|
| EKS Cluster (primary) | kaireon-prod | us-east-1 |
| EKS Cluster (DR) | kaireon-dr | us-west-2 |
| RDS Instance (primary) | kaireon-prod | us-east-1 |
| ElastiCache (primary) | kaireon-redis | us-east-1 |
| S3 Backups (primary) | kaireon-backups | us-east-1 |
| S3 Backups (DR) | kaireon-backups-dr | us-west-2 |
| Route 53 Hosted Zone | ZXXXXXXXXXXXXX | Global |
| ACM Certificate | arn:aws:acm:… | us-east-1 |