Disaster Recovery

Audience: SREs, platform operators, engineering leadership Last updated: 2026-02-23 Infrastructure: AWS (EKS, RDS, ElastiCache, S3) Classification: Internal — Confidential

RTO/RPO Targets
Backup Strategy
Failover Procedures
Recovery Steps
DR Testing Schedule

1. RTO/RPO Targets

Definitions

RTO (Recovery Time Objective): Maximum acceptable time from failure to full service restoration.
RPO (Recovery Point Objective): Maximum acceptable data loss measured in time.

Targets by Tier

Component	Tier	RTO	RPO	Strategy
Decision API	Critical	5 min	0 (zero data loss)	Multi-AZ, auto-failover
PostgreSQL (RDS)	Critical	10 min	1 min	Multi-AZ, continuous backup
Redis (ElastiCache)	High	5 min	5 min	Multi-AZ replica, auto-failover
Worker queues	High	15 min	5 min	Redis persistence + DLQ replay
Pipeline execution	Medium	30 min	15 min	Re-run from last checkpoint
Dashboard / UI	Medium	15 min	N/A (stateless)	Multi-replica, ALB health checks
Batch analytics	Low	4 hours	1 hour	Re-run from source data

SLA Commitments

SLA Metric	Target	Measurement Window
Availability	99.95%	Monthly
Decision API uptime	99.99%	Monthly
Data durability	99.999999999% (11 nines)	Annual (S3 backed)
Planned maintenance downtime	< 30 min/month	Monthly

2. Backup Strategy

Overview

                    ┌─────────────────────────────────────────┐
                    │           Backup Architecture           │
                    ├─────────────────────────────────────────┤
                    │                                         │
                    │  PostgreSQL (RDS)                       │
                    │  ├── Automated snapshots (daily, 7d)    │
                    │  ├── Continuous WAL archiving (5m RPO)  │
                    │  └── Manual snapshots (pre-migration)   │
                    │                                         │
                    │  Redis (ElastiCache)                    │
                    │  ├── Daily snapshots (retained 7d)      │
                    │  └── AOF persistence (1s sync)          │
                    │                                         │
                    │  Application State                      │
                    │  ├── Git (infrastructure as code)       │
                    │  ├── S3 (pipeline artifacts, exports)   │
                    │  └── Secrets Manager (credentials)      │
                    │                                         │
                    │  All backups replicated to us-west-2    │
                    └─────────────────────────────────────────┘

PostgreSQL Backups

Automated (RDS):

Automated daily snapshots at 02:00 UTC.
Retention: 7 days (automated), 90 days (manual).
Continuous WAL archiving enables point-in-time recovery (PITR) to any second within the retention window.

Manual snapshots:

# Pre-migration snapshot
aws rds create-db-snapshot \
  --db-instance-identifier kaireon-prod \
  --db-snapshot-identifier "kaireon-pre-migration-$(date +%Y%m%d-%H%M%S)"

# Cross-region copy for DR
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier "kaireon-pre-migration-20260223-120000" \
  --target-db-snapshot-identifier "kaireon-pre-migration-20260223-120000" \
  --source-region us-east-1 \
  --region us-west-2

Logical backups (pg_dump):

# Daily logical backup (in addition to RDS snapshots)
make backup

# Stored in: s3://kaireon-backups/daily/
# Cross-region replicated to: s3://kaireon-backups-dr/daily/ (us-west-2)

Redis Backups

ElastiCache snapshots:

# Manual snapshot
aws elasticache create-snapshot \
  --replication-group-id kaireon-redis \
  --snapshot-name "kaireon-redis-$(date +%Y%m%d-%H%M%S)"

# Export to S3 for cross-region access
aws elasticache copy-snapshot \
  --source-snapshot-name "kaireon-redis-20260223-120000" \
  --target-snapshot-name "kaireon-redis-20260223-120000-dr" \
  --target-bucket "kaireon-backups-dr"

Redis persistence configuration:

appendonly yes
appendfsync everysec
save 900 1
save 300 10
save 60 10000

Application and Configuration Backups

Asset	Backup Location	Frequency	Retention
Infrastructure code (Terraform)	Git (GitHub)	Every commit	Indefinite
Kubernetes manifests	Git (GitHub)	Every commit	Indefinite
Application code	Git (GitHub)	Every commit	Indefinite
Secrets	AWS Secrets Manager	On change	30 days versioned
SSL certificates	AWS ACM / Secrets Manager	On renewal	Previous + current
PgBouncer config	Git + ConfigMap	On change	Indefinite
Pipeline artifacts	S3 (kaireon-artifacts)	Per execution	90 days

Backup Verification

Backups are verified monthly. See DR Testing Schedule.

# Automated backup verification script
#!/bin/bash
set -euo pipefail

echo "=== KaireonAI Backup Verification ==="
echo "Date: $(date -u)"

# 1. Verify RDS snapshot exists
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
  --db-instance-identifier kaireon-prod \
  --query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].DBSnapshotIdentifier' \
  --output text)
echo "Latest RDS snapshot: $LATEST_SNAPSHOT"

# 2. Verify S3 backup exists
LATEST_S3=$(aws s3 ls s3://kaireon-backups/daily/ --recursive | sort | tail -1)
echo "Latest S3 backup: $LATEST_S3"

# 3. Verify cross-region replica
LATEST_DR=$(aws s3 ls s3://kaireon-backups-dr/daily/ --recursive --region us-west-2 | sort | tail -1)
echo "Latest DR backup: $LATEST_DR"

# 4. Verify Redis snapshot
LATEST_REDIS=$(aws elasticache describe-snapshots \
  --replication-group-id kaireon-redis \
  --query 'Snapshots | sort_by(@, &NodeSnapshots[0].SnapshotCreateTime) | [-1].SnapshotName' \
  --output text)
echo "Latest Redis snapshot: $LATEST_REDIS"

echo "=== Verification Complete ==="

3. Failover Procedures

3.1 RDS Multi-AZ Failover

Automatic failover occurs when:

The primary instance becomes unreachable.
The primary AZ experiences an outage.
The primary instance is rebooted with failover.
The instance type is modified (with apply-immediately).

Expected downtime: 60-120 seconds. Manual failover (for testing or planned maintenance):

# Trigger failover
aws rds reboot-db-instance \
  --db-instance-identifier kaireon-prod \
  --force-failover

# Monitor failover progress
watch -n 5 "aws rds describe-db-instances \
  --db-instance-identifier kaireon-prod \
  --query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]'"

Post-failover verification:

# 1. Check the new AZ
aws rds describe-db-instances \
  --db-instance-identifier kaireon-prod \
  --query 'DBInstances[0].AvailabilityZone'

# 2. Verify application connectivity
kubectl exec -it deploy/kaireon-api -n kaireon -- curl -s localhost:9090/healthz

# 3. Check PgBouncer reconnected
kubectl exec -it deploy/kaireon-pgbouncer -n kaireon -- psql -p 6432 pgbouncer -c "SHOW POOLS;"

# 4. Monitor error rate for 10 minutes
kubectl logs -l app=kaireon-api -n kaireon --since=5m | grep -c "error" || echo "0 errors"

3.2 Redis Failover

ElastiCache Multi-AZ automatic failover:

Promotes a read replica to primary within 60 seconds.
Application reconnects automatically via the primary endpoint.

Manual failover:

# Trigger failover
aws elasticache test-failover \
  --replication-group-id kaireon-redis \
  --node-group-id 0001

# Monitor
watch -n 5 "aws elasticache describe-replication-groups \
  --replication-group-id kaireon-redis \
  --query 'ReplicationGroups[0].Status'"

Post-failover verification:

# 1. Verify Redis is responding
kubectl exec -it deploy/kaireon-api -n kaireon -- redis-cli -h kaireon-redis PING

# 2. Check for data loss (compare key count)
kubectl exec -it deploy/kaireon-api -n kaireon -- redis-cli -h kaireon-redis DBSIZE

# 3. Verify application is using the new primary
kubectl logs -l app=kaireon-api -n kaireon --since=2m | grep -i redis

3.3 EKS Multi-Node Failover

Node failure handling:

Kubernetes automatically reschedules pods from failed nodes.
PodDisruptionBudgets ensure minimum replicas remain available.

Current PDB configuration:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kaireon-api-pdb
  namespace: kaireon
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: kaireon-api
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kaireon-worker-pdb
  namespace: kaireon
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: kaireon-worker

Node failure response:

# 1. Check node status
kubectl get nodes
kubectl describe node <failed-node>

# 2. Check affected pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<failed-node>

# 3. If node is NotReady for >5 minutes, drain it
kubectl drain <failed-node> --ignore-daemonsets --delete-emptydir-data --force

# 4. Verify pods rescheduled
kubectl get pods -n kaireon -o wide

# 5. If cluster autoscaler is not adding nodes, add manually
aws eks update-nodegroup-config \
  --cluster-name kaireon-prod \
  --nodegroup-name kaireon-workers \
  --scaling-config desiredSize=5

Full AZ failure response:

# 1. Verify which AZ is affected
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"

# 2. Ensure node group spans multiple AZs
aws eks describe-nodegroup \
  --cluster-name kaireon-prod \
  --nodegroup-name kaireon-workers \
  --query 'nodegroup.subnets'

# 3. Scale up to compensate for lost capacity
kubectl patch hpa kaireon-api-hpa -n kaireon --type merge -p \
  '{"spec":{"minReplicas":6}}'

3.4 Full Region Failure

This is the most severe scenario. Follow these steps in order. Prerequisites:

DR region (us-west-2) has a standby EKS cluster (provisioned via Terraform).
Database snapshots are replicated cross-region.
Container images are in ECR with cross-region replication.
DNS is managed via Route 53 with health checks.

Failover procedure:

# === STEP 1: Confirm primary region is down ===
# Verify from multiple sources before declaring a regional failure
aws health describe-events --region us-east-1
curl -s -o /dev/null -w "%{http_code}" https://api.kaireon.com/healthz

# === STEP 2: Activate DR database ===
# Restore from the latest cross-region snapshot
LATEST_DR_SNAPSHOT=$(aws rds describe-db-snapshots \
  --region us-west-2 \
  --query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].DBSnapshotIdentifier' \
  --output text)

aws rds restore-db-instance-from-db-snapshot \
  --region us-west-2 \
  --db-instance-identifier kaireon-dr \
  --db-snapshot-identifier "$LATEST_DR_SNAPSHOT" \
  --db-instance-class db.r6g.xlarge \
  --multi-az \
  --vpc-security-group-ids sg-dr-xxxxxxxx

aws rds wait db-instance-available \
  --region us-west-2 \
  --db-instance-identifier kaireon-dr

# === STEP 3: Activate DR Redis ===
LATEST_REDIS_SNAPSHOT=$(aws elasticache describe-snapshots \
  --region us-west-2 \
  --query 'Snapshots | sort_by(@, &NodeSnapshots[0].SnapshotCreateTime) | [-1].SnapshotName' \
  --output text)

aws elasticache create-replication-group \
  --region us-west-2 \
  --replication-group-id kaireon-redis-dr \
  --replication-group-description "KaireonAI Redis DR" \
  --snapshot-name "$LATEST_REDIS_SNAPSHOT" \
  --cache-node-type cache.r6g.large \
  --automatic-failover-enabled \
  --multi-az-enabled

# === STEP 4: Update application configuration ===
kubectl --context kaireon-dr config use-context kaireon-dr

kubectl set env deploy/kaireon-api -n kaireon \
  DATABASE_URL="postgresql://$DB_USER:$DB_PASSWORD@kaireon-dr.xxxxxxxx.us-west-2.rds.amazonaws.com:5432/kaireon" \
  REDIS_URL="redis://kaireon-redis-dr.xxxxxxxx.usw2.cache.amazonaws.com:6379"

kubectl set env deploy/kaireon-worker -n kaireon \
  DATABASE_URL="postgresql://$DB_USER:$DB_PASSWORD@kaireon-dr.xxxxxxxx.us-west-2.rds.amazonaws.com:5432/kaireon" \
  REDIS_URL="redis://kaireon-redis-dr.xxxxxxxx.usw2.cache.amazonaws.com:6379"

# === STEP 5: Scale up DR workloads ===
kubectl --context kaireon-dr scale deploy/kaireon-api -n kaireon --replicas=5
kubectl --context kaireon-dr scale deploy/kaireon-worker -n kaireon --replicas=3

# === STEP 6: Switch DNS ===
aws route53 change-resource-record-sets \
  --hosted-zone-id ZXXXXXXXXXXXXX \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.kaireon.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z1H1FL5HABSF5",
          "DNSName": "kaireon-dr-alb.us-west-2.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

# === STEP 7: Verify ===
curl -s https://api.kaireon.com/healthz
echo "DR failover complete at $(date -u)"

4. Recovery Steps

4.1 Post-Incident Recovery Checklist

After any disaster recovery event, complete the following steps: Immediate (0-1 hour):

Verify all services are healthy (/healthz returns 200).
Confirm decision API is processing requests.
Check error rates are at baseline levels.
Verify database connectivity from all pods.
Confirm Redis connectivity and cache warming.
Validate worker queues are draining normally.

Short-term (1-4 hours):

Run data integrity checks:

-- Check for orphaned records
SELECT count(*) FROM decision_logs dl
LEFT JOIN customers c ON dl.customer_id = c.id
WHERE c.id IS NULL;

-- Check sequence integrity
SELECT schemaname, sequencename, last_value
FROM pg_sequences
WHERE schemaname = 'public';

Replay any messages from the DLQ.
Verify all scheduled jobs are running (cron, batch).
Check replication lag if read replicas are active.
Review and resolve any stuck pipelines.

Medium-term (4-24 hours):

Run full application test suite against production.
Verify dashboard data is complete and accurate.
Confirm backup jobs are running in the new configuration.
Update monitoring and alerting for the new infrastructure.
Notify stakeholders of resolution and any data impact.

Long-term (1-7 days):

Conduct a post-mortem and publish findings.
Update this runbook with lessons learned.
Plan failback to the primary region (if regional failover occurred).
Review and update RTO/RPO targets based on actual recovery time.
Reprovision DR infrastructure to be ready for the next event.

4.2 Failback to Primary Region

After a regional failover, fail back to the primary region once it is stable.

# === STEP 1: Verify primary region is stable ===
aws health describe-events --region us-east-1
# Wait at least 24 hours after region recovery before failing back

# === STEP 2: Sync data from DR to primary ===
# Create a snapshot of the DR database
aws rds create-db-snapshot \
  --region us-west-2 \
  --db-instance-identifier kaireon-dr \
  --db-snapshot-identifier "kaireon-failback-$(date +%Y%m%d-%H%M%S)"

# Copy to primary region
aws rds copy-db-snapshot \
  --region us-east-1 \
  --source-region us-west-2 \
  --source-db-snapshot-identifier "kaireon-failback-20260223-120000" \
  --target-db-snapshot-identifier "kaireon-failback-20260223-120000"

# Restore in primary region
aws rds restore-db-instance-from-db-snapshot \
  --region us-east-1 \
  --db-instance-identifier kaireon-prod-restored \
  --db-snapshot-identifier "kaireon-failback-20260223-120000" \
  --db-instance-class db.r6g.xlarge \
  --multi-az

# === STEP 3: Switch traffic back ===
# Follow the same DNS update procedure as failover but point back to us-east-1

# === STEP 4: Decommission DR resources ===
# After 48 hours of stable operation on primary:
aws rds delete-db-instance --region us-west-2 --db-instance-identifier kaireon-dr --skip-final-snapshot
aws elasticache delete-replication-group --region us-west-2 --replication-group-id kaireon-redis-dr

4.3 Data Reconciliation

After any failover that may have caused data divergence:

-- Compare record counts between backup and current
-- (Run against a restored copy of the pre-failure backup)

-- Decision logs gap analysis
SELECT
  date_trunc('hour', created_at) AS hour,
  count(*) AS decisions
FROM decision_logs
WHERE created_at > NOW() - INTERVAL '48 hours'
GROUP BY hour
ORDER BY hour;

-- Identify the gap window
SELECT
  min(created_at) AS first_record,
  max(created_at) AS last_record,
  count(*) AS total_records
FROM decision_logs
WHERE created_at > NOW() - INTERVAL '48 hours';

5. DR Testing Schedule

Annual DR Testing Calendar

Month	Test Type	Scope	Duration	Impact
January	Backup restore	Restore RDS snapshot to test instance	2 hours	None
February	Redis failover	ElastiCache failover test	30 min	Brief cache miss
March	RDS failover	Multi-AZ failover test	1 hour	1-2 min downtime
April	Backup restore	Full pg_dump restore verification	2 hours	None
May	Node failure	Drain a worker node	1 hour	None (PDB protected)
June	Full DR drill	Regional failover to us-west-2	4 hours	Planned downtime
July	Backup restore	Restore RDS snapshot to test instance	2 hours	None
August	Redis failover	ElastiCache failover test	30 min	Brief cache miss
September	RDS failover	Multi-AZ failover test	1 hour	1-2 min downtime
October	Backup restore	Full pg_dump restore verification	2 hours	None
November	Node failure	Simulate AZ failure	2 hours	None (multi-AZ)
December	Full DR drill	Regional failover to us-west-2	4 hours	Planned downtime

DR Test Procedure

Pre-test checklist:

Announce the test in #kaireon-incidents and #engineering at least 48 hours in advance.
Create a fresh RDS snapshot.
Verify DR region infrastructure is provisioned.
Assign roles: Incident Commander, Communications Lead, Technical Lead.
Prepare the rollback plan.

During the test:

Start a timer when the test begins.
Follow the relevant failover procedure from Section 3.
Record actual times for each step.
Monitor error rates, latency, and data integrity throughout.
Document any deviations from the runbook.

Post-test checklist:

Fail back to the primary region (if regional test).
Verify all services are healthy.
Record actual RTO and RPO achieved.
Write a DR test report including:
- Actual RTO vs. target RTO.
- Actual RPO vs. target RPO.
- Issues encountered.
- Runbook updates needed.
- Action items with owners and deadlines.
Update this runbook with findings.

DR Test Report Template

# DR Test Report - [Date]

## Summary
- **Test type:** [Regional failover / RDS failover / Redis failover / etc.]
- **Date:** [YYYY-MM-DD]
- **Duration:** [HH:MM]
- **Participants:** [Names]

## Results
| Metric | Target | Actual | Pass/Fail |
|--------|--------|--------|-----------|
| RTO | [target] | [actual] | [P/F] |
| RPO | [target] | [actual] | [P/F] |
| Data integrity | 100% | [actual]% | [P/F] |

## Timeline
| Time | Action | Result |
|------|--------|--------|
| HH:MM | [action] | [result] |

## Issues Found
1. [Issue description, severity, owner, deadline]

## Runbook Updates Required
1. [Update description]

## Action Items
- [ ] [Action] - Owner: [name] - Due: [date]

Appendix: Emergency Contacts

Role	Name	Contact	Escalation Time
On-call engineer	PagerDuty rotation	PagerDuty	Immediate
Team lead	[TBD]	[TBD]	5 min (SEV1), 15 min (SEV2)
VP Engineering	[TBD]	[TBD]	15 min (SEV1)
AWS TAM	[TBD]	[TBD]	30 min (SEV1)

Appendix: Key AWS Resources

Resource	Identifier	Region
EKS Cluster (primary)	kaireon-prod	us-east-1
EKS Cluster (DR)	kaireon-dr	us-west-2
RDS Instance (primary)	kaireon-prod	us-east-1
ElastiCache (primary)	kaireon-redis	us-east-1
S3 Backups (primary)	kaireon-backups	us-east-1
S3 Backups (DR)	kaireon-backups-dr	us-west-2
Route 53 Hosted Zone	ZXXXXXXXXXXXXX	Global
ACM Certificate	arn:aws:acm:…	us-east-1

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Table of Contents

1. RTO/RPO Targets

Definitions

Targets by Tier

SLA Commitments

2. Backup Strategy

Overview

PostgreSQL Backups

Redis Backups

Application and Configuration Backups

Backup Verification

3. Failover Procedures

3.1 RDS Multi-AZ Failover

3.2 Redis Failover

3.3 EKS Multi-Node Failover

3.4 Full Region Failure

4. Recovery Steps

4.1 Post-Incident Recovery Checklist

4.2 Failback to Primary Region

4.3 Data Reconciliation

5. DR Testing Schedule

Annual DR Testing Calendar

DR Test Procedure

DR Test Report Template

Appendix: Emergency Contacts

Appendix: Key AWS Resources

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Documentation Index

​Table of Contents

​1. RTO/RPO Targets

​Definitions

​Targets by Tier

​SLA Commitments

​2. Backup Strategy

​Overview

​PostgreSQL Backups

​Redis Backups

​Application and Configuration Backups

​Backup Verification

​3. Failover Procedures

​3.1 RDS Multi-AZ Failover

​3.2 Redis Failover

​3.3 EKS Multi-Node Failover

​3.4 Full Region Failure

​4. Recovery Steps

​4.1 Post-Incident Recovery Checklist

​4.2 Failback to Primary Region

​4.3 Data Reconciliation

​5. DR Testing Schedule

​Annual DR Testing Calendar

​DR Test Procedure

​DR Test Report Template

​Appendix: Emergency Contacts

​Appendix: Key AWS Resources

Table of Contents

1. RTO/RPO Targets

Definitions

Targets by Tier

SLA Commitments

2. Backup Strategy

Overview

PostgreSQL Backups

Redis Backups

Application and Configuration Backups

Backup Verification

3. Failover Procedures

3.1 RDS Multi-AZ Failover

3.2 Redis Failover

3.3 EKS Multi-Node Failover

3.4 Full Region Failure

4. Recovery Steps

4.1 Post-Incident Recovery Checklist

4.2 Failback to Primary Region

4.3 Data Reconciliation

5. DR Testing Schedule

Annual DR Testing Calendar

DR Test Procedure

DR Test Report Template

Appendix: Emergency Contacts

Appendix: Key AWS Resources