Skip to main content
Audience: SREs, platform operators, engineering leadership Last updated: 2026-02-23 Infrastructure: AWS (EKS, RDS, ElastiCache, S3) Classification: Internal — Confidential

Table of Contents

  1. RTO/RPO Targets
  2. Backup Strategy
  3. Failover Procedures
  4. Recovery Steps
  5. DR Testing Schedule

1. RTO/RPO Targets

Definitions

  • RTO (Recovery Time Objective): Maximum acceptable time from failure to full service restoration.
  • RPO (Recovery Point Objective): Maximum acceptable data loss measured in time.

Targets by Tier

ComponentTierRTORPOStrategy
Decision APICritical5 min0 (zero data loss)Multi-AZ, auto-failover
PostgreSQL (RDS)Critical10 min1 minMulti-AZ, continuous backup
Redis (ElastiCache)High5 min5 minMulti-AZ replica, auto-failover
Worker queuesHigh15 min5 minRedis persistence + DLQ replay
Pipeline executionMedium30 min15 minRe-run from last checkpoint
Dashboard / UIMedium15 minN/A (stateless)Multi-replica, ALB health checks
Batch analyticsLow4 hours1 hourRe-run from source data

SLA Commitments

SLA MetricTargetMeasurement Window
Availability99.95%Monthly
Decision API uptime99.99%Monthly
Data durability99.999999999% (11 nines)Annual (S3 backed)
Planned maintenance downtime< 30 min/monthMonthly

2. Backup Strategy

Overview

                    ┌─────────────────────────────────────────┐
                    │           Backup Architecture           │
                    ├─────────────────────────────────────────┤
                    │                                         │
                    │  PostgreSQL (RDS)                       │
                    │  ├── Automated snapshots (daily, 7d)    │
                    │  ├── Continuous WAL archiving (5m RPO)  │
                    │  └── Manual snapshots (pre-migration)   │
                    │                                         │
                    │  Redis (ElastiCache)                    │
                    │  ├── Daily snapshots (retained 7d)      │
                    │  └── AOF persistence (1s sync)          │
                    │                                         │
                    │  Application State                      │
                    │  ├── Git (infrastructure as code)       │
                    │  ├── S3 (pipeline artifacts, exports)   │
                    │  └── Secrets Manager (credentials)      │
                    │                                         │
                    │  All backups replicated to us-west-2    │
                    └─────────────────────────────────────────┘

PostgreSQL Backups

Automated (RDS):
  • Automated daily snapshots at 02:00 UTC.
  • Retention: 7 days (automated), 90 days (manual).
  • Continuous WAL archiving enables point-in-time recovery (PITR) to any second within the retention window.
Manual snapshots:
# Pre-migration snapshot
aws rds create-db-snapshot \
  --db-instance-identifier kaireon-prod \
  --db-snapshot-identifier "kaireon-pre-migration-$(date +%Y%m%d-%H%M%S)"

# Cross-region copy for DR
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier "kaireon-pre-migration-20260223-120000" \
  --target-db-snapshot-identifier "kaireon-pre-migration-20260223-120000" \
  --source-region us-east-1 \
  --region us-west-2
Logical backups (pg_dump):
# Daily logical backup (in addition to RDS snapshots)
make backup

# Stored in: s3://kaireon-backups/daily/
# Cross-region replicated to: s3://kaireon-backups-dr/daily/ (us-west-2)

Redis Backups

ElastiCache snapshots:
# Manual snapshot
aws elasticache create-snapshot \
  --replication-group-id kaireon-redis \
  --snapshot-name "kaireon-redis-$(date +%Y%m%d-%H%M%S)"

# Export to S3 for cross-region access
aws elasticache copy-snapshot \
  --source-snapshot-name "kaireon-redis-20260223-120000" \
  --target-snapshot-name "kaireon-redis-20260223-120000-dr" \
  --target-bucket "kaireon-backups-dr"
Redis persistence configuration:
appendonly yes
appendfsync everysec
save 900 1
save 300 10
save 60 10000

Application and Configuration Backups

AssetBackup LocationFrequencyRetention
Infrastructure code (Terraform)Git (GitHub)Every commitIndefinite
Kubernetes manifestsGit (GitHub)Every commitIndefinite
Application codeGit (GitHub)Every commitIndefinite
SecretsAWS Secrets ManagerOn change30 days versioned
SSL certificatesAWS ACM / Secrets ManagerOn renewalPrevious + current
PgBouncer configGit + ConfigMapOn changeIndefinite
Pipeline artifactsS3 (kaireon-artifacts)Per execution90 days

Backup Verification

Backups are verified monthly. See DR Testing Schedule.
# Automated backup verification script
#!/bin/bash
set -euo pipefail

echo "=== KaireonAI Backup Verification ==="
echo "Date: $(date -u)"

# 1. Verify RDS snapshot exists
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
  --db-instance-identifier kaireon-prod \
  --query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].DBSnapshotIdentifier' \
  --output text)
echo "Latest RDS snapshot: $LATEST_SNAPSHOT"

# 2. Verify S3 backup exists
LATEST_S3=$(aws s3 ls s3://kaireon-backups/daily/ --recursive | sort | tail -1)
echo "Latest S3 backup: $LATEST_S3"

# 3. Verify cross-region replica
LATEST_DR=$(aws s3 ls s3://kaireon-backups-dr/daily/ --recursive --region us-west-2 | sort | tail -1)
echo "Latest DR backup: $LATEST_DR"

# 4. Verify Redis snapshot
LATEST_REDIS=$(aws elasticache describe-snapshots \
  --replication-group-id kaireon-redis \
  --query 'Snapshots | sort_by(@, &NodeSnapshots[0].SnapshotCreateTime) | [-1].SnapshotName' \
  --output text)
echo "Latest Redis snapshot: $LATEST_REDIS"

echo "=== Verification Complete ==="

3. Failover Procedures

3.1 RDS Multi-AZ Failover

Automatic failover occurs when:
  • The primary instance becomes unreachable.
  • The primary AZ experiences an outage.
  • The primary instance is rebooted with failover.
  • The instance type is modified (with apply-immediately).
Expected downtime: 60-120 seconds. Manual failover (for testing or planned maintenance):
# Trigger failover
aws rds reboot-db-instance \
  --db-instance-identifier kaireon-prod \
  --force-failover

# Monitor failover progress
watch -n 5 "aws rds describe-db-instances \
  --db-instance-identifier kaireon-prod \
  --query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]'"
Post-failover verification:
# 1. Check the new AZ
aws rds describe-db-instances \
  --db-instance-identifier kaireon-prod \
  --query 'DBInstances[0].AvailabilityZone'

# 2. Verify application connectivity
kubectl exec -it deploy/kaireon-api -n kaireon -- curl -s localhost:9090/healthz

# 3. Check PgBouncer reconnected
kubectl exec -it deploy/kaireon-pgbouncer -n kaireon -- psql -p 6432 pgbouncer -c "SHOW POOLS;"

# 4. Monitor error rate for 10 minutes
kubectl logs -l app=kaireon-api -n kaireon --since=5m | grep -c "error" || echo "0 errors"

3.2 Redis Failover

ElastiCache Multi-AZ automatic failover:
  • Promotes a read replica to primary within 60 seconds.
  • Application reconnects automatically via the primary endpoint.
Manual failover:
# Trigger failover
aws elasticache test-failover \
  --replication-group-id kaireon-redis \
  --node-group-id 0001

# Monitor
watch -n 5 "aws elasticache describe-replication-groups \
  --replication-group-id kaireon-redis \
  --query 'ReplicationGroups[0].Status'"
Post-failover verification:
# 1. Verify Redis is responding
kubectl exec -it deploy/kaireon-api -n kaireon -- redis-cli -h kaireon-redis PING

# 2. Check for data loss (compare key count)
kubectl exec -it deploy/kaireon-api -n kaireon -- redis-cli -h kaireon-redis DBSIZE

# 3. Verify application is using the new primary
kubectl logs -l app=kaireon-api -n kaireon --since=2m | grep -i redis

3.3 EKS Multi-Node Failover

Node failure handling:
  • Kubernetes automatically reschedules pods from failed nodes.
  • PodDisruptionBudgets ensure minimum replicas remain available.
Current PDB configuration:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kaireon-api-pdb
  namespace: kaireon
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: kaireon-api
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kaireon-worker-pdb
  namespace: kaireon
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: kaireon-worker
Node failure response:
# 1. Check node status
kubectl get nodes
kubectl describe node <failed-node>

# 2. Check affected pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<failed-node>

# 3. If node is NotReady for >5 minutes, drain it
kubectl drain <failed-node> --ignore-daemonsets --delete-emptydir-data --force

# 4. Verify pods rescheduled
kubectl get pods -n kaireon -o wide

# 5. If cluster autoscaler is not adding nodes, add manually
aws eks update-nodegroup-config \
  --cluster-name kaireon-prod \
  --nodegroup-name kaireon-workers \
  --scaling-config desiredSize=5
Full AZ failure response:
# 1. Verify which AZ is affected
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"

# 2. Ensure node group spans multiple AZs
aws eks describe-nodegroup \
  --cluster-name kaireon-prod \
  --nodegroup-name kaireon-workers \
  --query 'nodegroup.subnets'

# 3. Scale up to compensate for lost capacity
kubectl patch hpa kaireon-api-hpa -n kaireon --type merge -p \
  '{"spec":{"minReplicas":6}}'

3.4 Full Region Failure

This is the most severe scenario. Follow these steps in order. Prerequisites:
  • DR region (us-west-2) has a standby EKS cluster (provisioned via Terraform).
  • Database snapshots are replicated cross-region.
  • Container images are in ECR with cross-region replication.
  • DNS is managed via Route 53 with health checks.
Failover procedure:
# === STEP 1: Confirm primary region is down ===
# Verify from multiple sources before declaring a regional failure
aws health describe-events --region us-east-1
curl -s -o /dev/null -w "%{http_code}" https://api.kaireon.com/healthz

# === STEP 2: Activate DR database ===
# Restore from the latest cross-region snapshot
LATEST_DR_SNAPSHOT=$(aws rds describe-db-snapshots \
  --region us-west-2 \
  --query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].DBSnapshotIdentifier' \
  --output text)

aws rds restore-db-instance-from-db-snapshot \
  --region us-west-2 \
  --db-instance-identifier kaireon-dr \
  --db-snapshot-identifier "$LATEST_DR_SNAPSHOT" \
  --db-instance-class db.r6g.xlarge \
  --multi-az \
  --vpc-security-group-ids sg-dr-xxxxxxxx

aws rds wait db-instance-available \
  --region us-west-2 \
  --db-instance-identifier kaireon-dr

# === STEP 3: Activate DR Redis ===
LATEST_REDIS_SNAPSHOT=$(aws elasticache describe-snapshots \
  --region us-west-2 \
  --query 'Snapshots | sort_by(@, &NodeSnapshots[0].SnapshotCreateTime) | [-1].SnapshotName' \
  --output text)

aws elasticache create-replication-group \
  --region us-west-2 \
  --replication-group-id kaireon-redis-dr \
  --replication-group-description "KaireonAI Redis DR" \
  --snapshot-name "$LATEST_REDIS_SNAPSHOT" \
  --cache-node-type cache.r6g.large \
  --automatic-failover-enabled \
  --multi-az-enabled

# === STEP 4: Update application configuration ===
kubectl --context kaireon-dr config use-context kaireon-dr

kubectl set env deploy/kaireon-api -n kaireon \
  DATABASE_URL="postgresql://$DB_USER:$DB_PASSWORD@kaireon-dr.xxxxxxxx.us-west-2.rds.amazonaws.com:5432/kaireon" \
  REDIS_URL="redis://kaireon-redis-dr.xxxxxxxx.usw2.cache.amazonaws.com:6379"

kubectl set env deploy/kaireon-worker -n kaireon \
  DATABASE_URL="postgresql://$DB_USER:$DB_PASSWORD@kaireon-dr.xxxxxxxx.us-west-2.rds.amazonaws.com:5432/kaireon" \
  REDIS_URL="redis://kaireon-redis-dr.xxxxxxxx.usw2.cache.amazonaws.com:6379"

# === STEP 5: Scale up DR workloads ===
kubectl --context kaireon-dr scale deploy/kaireon-api -n kaireon --replicas=5
kubectl --context kaireon-dr scale deploy/kaireon-worker -n kaireon --replicas=3

# === STEP 6: Switch DNS ===
aws route53 change-resource-record-sets \
  --hosted-zone-id ZXXXXXXXXXXXXX \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.kaireon.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z1H1FL5HABSF5",
          "DNSName": "kaireon-dr-alb.us-west-2.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

# === STEP 7: Verify ===
curl -s https://api.kaireon.com/healthz
echo "DR failover complete at $(date -u)"

4. Recovery Steps

4.1 Post-Incident Recovery Checklist

After any disaster recovery event, complete the following steps: Immediate (0-1 hour):
  • Verify all services are healthy (/healthz returns 200).
  • Confirm decision API is processing requests.
  • Check error rates are at baseline levels.
  • Verify database connectivity from all pods.
  • Confirm Redis connectivity and cache warming.
  • Validate worker queues are draining normally.
Short-term (1-4 hours):
  • Run data integrity checks:
    -- Check for orphaned records
    SELECT count(*) FROM decision_logs dl
    LEFT JOIN customers c ON dl.customer_id = c.id
    WHERE c.id IS NULL;
    
    -- Check sequence integrity
    SELECT schemaname, sequencename, last_value
    FROM pg_sequences
    WHERE schemaname = 'public';
    
  • Replay any messages from the DLQ.
  • Verify all scheduled jobs are running (cron, batch).
  • Check replication lag if read replicas are active.
  • Review and resolve any stuck pipelines.
Medium-term (4-24 hours):
  • Run full application test suite against production.
  • Verify dashboard data is complete and accurate.
  • Confirm backup jobs are running in the new configuration.
  • Update monitoring and alerting for the new infrastructure.
  • Notify stakeholders of resolution and any data impact.
Long-term (1-7 days):
  • Conduct a post-mortem and publish findings.
  • Update this runbook with lessons learned.
  • Plan failback to the primary region (if regional failover occurred).
  • Review and update RTO/RPO targets based on actual recovery time.
  • Reprovision DR infrastructure to be ready for the next event.

4.2 Failback to Primary Region

After a regional failover, fail back to the primary region once it is stable.
# === STEP 1: Verify primary region is stable ===
aws health describe-events --region us-east-1
# Wait at least 24 hours after region recovery before failing back

# === STEP 2: Sync data from DR to primary ===
# Create a snapshot of the DR database
aws rds create-db-snapshot \
  --region us-west-2 \
  --db-instance-identifier kaireon-dr \
  --db-snapshot-identifier "kaireon-failback-$(date +%Y%m%d-%H%M%S)"

# Copy to primary region
aws rds copy-db-snapshot \
  --region us-east-1 \
  --source-region us-west-2 \
  --source-db-snapshot-identifier "kaireon-failback-20260223-120000" \
  --target-db-snapshot-identifier "kaireon-failback-20260223-120000"

# Restore in primary region
aws rds restore-db-instance-from-db-snapshot \
  --region us-east-1 \
  --db-instance-identifier kaireon-prod-restored \
  --db-snapshot-identifier "kaireon-failback-20260223-120000" \
  --db-instance-class db.r6g.xlarge \
  --multi-az

# === STEP 3: Switch traffic back ===
# Follow the same DNS update procedure as failover but point back to us-east-1

# === STEP 4: Decommission DR resources ===
# After 48 hours of stable operation on primary:
aws rds delete-db-instance --region us-west-2 --db-instance-identifier kaireon-dr --skip-final-snapshot
aws elasticache delete-replication-group --region us-west-2 --replication-group-id kaireon-redis-dr

4.3 Data Reconciliation

After any failover that may have caused data divergence:
-- Compare record counts between backup and current
-- (Run against a restored copy of the pre-failure backup)

-- Decision logs gap analysis
SELECT
  date_trunc('hour', created_at) AS hour,
  count(*) AS decisions
FROM decision_logs
WHERE created_at > NOW() - INTERVAL '48 hours'
GROUP BY hour
ORDER BY hour;

-- Identify the gap window
SELECT
  min(created_at) AS first_record,
  max(created_at) AS last_record,
  count(*) AS total_records
FROM decision_logs
WHERE created_at > NOW() - INTERVAL '48 hours';

5. DR Testing Schedule

Annual DR Testing Calendar

MonthTest TypeScopeDurationImpact
JanuaryBackup restoreRestore RDS snapshot to test instance2 hoursNone
FebruaryRedis failoverElastiCache failover test30 minBrief cache miss
MarchRDS failoverMulti-AZ failover test1 hour1-2 min downtime
AprilBackup restoreFull pg_dump restore verification2 hoursNone
MayNode failureDrain a worker node1 hourNone (PDB protected)
JuneFull DR drillRegional failover to us-west-24 hoursPlanned downtime
JulyBackup restoreRestore RDS snapshot to test instance2 hoursNone
AugustRedis failoverElastiCache failover test30 minBrief cache miss
SeptemberRDS failoverMulti-AZ failover test1 hour1-2 min downtime
OctoberBackup restoreFull pg_dump restore verification2 hoursNone
NovemberNode failureSimulate AZ failure2 hoursNone (multi-AZ)
DecemberFull DR drillRegional failover to us-west-24 hoursPlanned downtime

DR Test Procedure

Pre-test checklist:
  • Announce the test in #kaireon-incidents and #engineering at least 48 hours in advance.
  • Create a fresh RDS snapshot.
  • Verify DR region infrastructure is provisioned.
  • Assign roles: Incident Commander, Communications Lead, Technical Lead.
  • Prepare the rollback plan.
During the test:
  • Start a timer when the test begins.
  • Follow the relevant failover procedure from Section 3.
  • Record actual times for each step.
  • Monitor error rates, latency, and data integrity throughout.
  • Document any deviations from the runbook.
Post-test checklist:
  • Fail back to the primary region (if regional test).
  • Verify all services are healthy.
  • Record actual RTO and RPO achieved.
  • Write a DR test report including:
    • Actual RTO vs. target RTO.
    • Actual RPO vs. target RPO.
    • Issues encountered.
    • Runbook updates needed.
    • Action items with owners and deadlines.
  • Update this runbook with findings.

DR Test Report Template

# DR Test Report - [Date]

## Summary
- **Test type:** [Regional failover / RDS failover / Redis failover / etc.]
- **Date:** [YYYY-MM-DD]
- **Duration:** [HH:MM]
- **Participants:** [Names]

## Results
| Metric | Target | Actual | Pass/Fail |
|--------|--------|--------|-----------|
| RTO | [target] | [actual] | [P/F] |
| RPO | [target] | [actual] | [P/F] |
| Data integrity | 100% | [actual]% | [P/F] |

## Timeline
| Time | Action | Result |
|------|--------|--------|
| HH:MM | [action] | [result] |

## Issues Found
1. [Issue description, severity, owner, deadline]

## Runbook Updates Required
1. [Update description]

## Action Items
- [ ] [Action] - Owner: [name] - Due: [date]

Appendix: Emergency Contacts

RoleNameContactEscalation Time
On-call engineerPagerDuty rotationPagerDutyImmediate
Team lead[TBD][TBD]5 min (SEV1), 15 min (SEV2)
VP Engineering[TBD][TBD]15 min (SEV1)
AWS TAM[TBD][TBD]30 min (SEV1)

Appendix: Key AWS Resources

ResourceIdentifierRegion
EKS Cluster (primary)kaireon-produs-east-1
EKS Cluster (DR)kaireon-drus-west-2
RDS Instance (primary)kaireon-produs-east-1
ElastiCache (primary)kaireon-redisus-east-1
S3 Backups (primary)kaireon-backupsus-east-1
S3 Backups (DR)kaireon-backups-drus-west-2
Route 53 Hosted ZoneZXXXXXXXXXXXXXGlobal
ACM Certificatearn:aws:acm:…us-east-1