Audience: SREs, platform operators, engineering leadership Last updated: 2026-02-23 Infrastructure: AWS (EKS, RDS, ElastiCache, S3) Classification: Internal — Confidential
Table of Contents
1. RTO/RPO Targets
Definitions
- RTO (Recovery Time Objective): Maximum acceptable time from failure to full service restoration.
- RPO (Recovery Point Objective): Maximum acceptable data loss measured in time.
Targets by Tier
| Component | Tier | RTO | RPO | Strategy |
|---|---|---|---|---|
| Decision API | Critical | 5 min | 0 (zero data loss) | Multi-AZ, auto-failover |
| PostgreSQL (RDS) | Critical | 10 min | 1 min | Multi-AZ, continuous backup |
| Redis (ElastiCache) | High | 5 min | 5 min | Multi-AZ replica, auto-failover |
| Worker queues | High | 15 min | 5 min | Redis persistence + DLQ replay |
| Pipeline execution | Medium | 30 min | 15 min | Re-run from last checkpoint |
| Dashboard / UI | Medium | 15 min | N/A (stateless) | Multi-replica, ALB health checks |
| Batch analytics | Low | 4 hours | 1 hour | Re-run from source data |
SLA Commitments
| SLA Metric | Target | Measurement Window |
|---|---|---|
| Availability | 99.95% | Monthly |
| Decision API uptime | 99.99% | Monthly |
| Data durability | 99.999999999% (11 nines) | Annual (S3 backed) |
| Planned maintenance downtime | < 30 min/month | Monthly |
2. Backup Strategy
Overview
PostgreSQL Backups
Automated (RDS):- Automated daily snapshots at 02:00 UTC.
- Retention: 7 days (automated), 90 days (manual).
- Continuous WAL archiving enables point-in-time recovery (PITR) to any second within the retention window.
Redis Backups
ElastiCache snapshots:Application and Configuration Backups
| Asset | Backup Location | Frequency | Retention |
|---|---|---|---|
| Infrastructure code (Terraform) | Git (GitHub) | Every commit | Indefinite |
| Kubernetes manifests | Git (GitHub) | Every commit | Indefinite |
| Application code | Git (GitHub) | Every commit | Indefinite |
| Secrets | AWS Secrets Manager | On change | 30 days versioned |
| SSL certificates | AWS ACM / Secrets Manager | On renewal | Previous + current |
| PgBouncer config | Git + ConfigMap | On change | Indefinite |
| Pipeline artifacts | S3 (kaireon-artifacts) | Per execution | 90 days |
Backup Verification
Backups are verified monthly. See DR Testing Schedule.3. Failover Procedures
3.1 RDS Multi-AZ Failover
Automatic failover occurs when:- The primary instance becomes unreachable.
- The primary AZ experiences an outage.
- The primary instance is rebooted with failover.
- The instance type is modified (with apply-immediately).
3.2 Redis Failover
ElastiCache Multi-AZ automatic failover:- Promotes a read replica to primary within 60 seconds.
- Application reconnects automatically via the primary endpoint.
3.3 EKS Multi-Node Failover
Node failure handling:- Kubernetes automatically reschedules pods from failed nodes.
- PodDisruptionBudgets ensure minimum replicas remain available.
3.4 Full Region Failure
This is the most severe scenario. Follow these steps in order. Prerequisites:- DR region (us-west-2) has a standby EKS cluster (provisioned via Terraform).
- Database snapshots are replicated cross-region.
- Container images are in ECR with cross-region replication.
- DNS is managed via Route 53 with health checks.
4. Recovery Steps
4.1 Post-Incident Recovery Checklist
After any disaster recovery event, complete the following steps: Immediate (0-1 hour):- Verify all services are healthy (
/healthzreturns 200). - Confirm decision API is processing requests.
- Check error rates are at baseline levels.
- Verify database connectivity from all pods.
- Confirm Redis connectivity and cache warming.
- Validate worker queues are draining normally.
- Run data integrity checks:
- Replay any messages from the DLQ.
- Verify all scheduled jobs are running (cron, batch).
- Check replication lag if read replicas are active.
- Review and resolve any stuck pipelines.
- Run full application test suite against production.
- Verify dashboard data is complete and accurate.
- Confirm backup jobs are running in the new configuration.
- Update monitoring and alerting for the new infrastructure.
- Notify stakeholders of resolution and any data impact.
- Conduct a post-mortem and publish findings.
- Update this runbook with lessons learned.
- Plan failback to the primary region (if regional failover occurred).
- Review and update RTO/RPO targets based on actual recovery time.
- Reprovision DR infrastructure to be ready for the next event.
4.2 Failback to Primary Region
After a regional failover, fail back to the primary region once it is stable.4.3 Data Reconciliation
After any failover that may have caused data divergence:5. DR Testing Schedule
Annual DR Testing Calendar
| Month | Test Type | Scope | Duration | Impact |
|---|---|---|---|---|
| January | Backup restore | Restore RDS snapshot to test instance | 2 hours | None |
| February | Redis failover | ElastiCache failover test | 30 min | Brief cache miss |
| March | RDS failover | Multi-AZ failover test | 1 hour | 1-2 min downtime |
| April | Backup restore | Full pg_dump restore verification | 2 hours | None |
| May | Node failure | Drain a worker node | 1 hour | None (PDB protected) |
| June | Full DR drill | Regional failover to us-west-2 | 4 hours | Planned downtime |
| July | Backup restore | Restore RDS snapshot to test instance | 2 hours | None |
| August | Redis failover | ElastiCache failover test | 30 min | Brief cache miss |
| September | RDS failover | Multi-AZ failover test | 1 hour | 1-2 min downtime |
| October | Backup restore | Full pg_dump restore verification | 2 hours | None |
| November | Node failure | Simulate AZ failure | 2 hours | None (multi-AZ) |
| December | Full DR drill | Regional failover to us-west-2 | 4 hours | Planned downtime |
DR Test Procedure
Pre-test checklist:- Announce the test in
#kaireon-incidentsand#engineeringat least 48 hours in advance. - Create a fresh RDS snapshot.
- Verify DR region infrastructure is provisioned.
- Assign roles: Incident Commander, Communications Lead, Technical Lead.
- Prepare the rollback plan.
- Start a timer when the test begins.
- Follow the relevant failover procedure from Section 3.
- Record actual times for each step.
- Monitor error rates, latency, and data integrity throughout.
- Document any deviations from the runbook.
- Fail back to the primary region (if regional test).
- Verify all services are healthy.
- Record actual RTO and RPO achieved.
- Write a DR test report including:
- Actual RTO vs. target RTO.
- Actual RPO vs. target RPO.
- Issues encountered.
- Runbook updates needed.
- Action items with owners and deadlines.
- Update this runbook with findings.
DR Test Report Template
Appendix: Emergency Contacts
| Role | Name | Contact | Escalation Time |
|---|---|---|---|
| On-call engineer | PagerDuty rotation | PagerDuty | Immediate |
| Team lead | [TBD] | [TBD] | 5 min (SEV1), 15 min (SEV2) |
| VP Engineering | [TBD] | [TBD] | 15 min (SEV1) |
| AWS TAM | [TBD] | [TBD] | 30 min (SEV1) |
Appendix: Key AWS Resources
| Resource | Identifier | Region |
|---|---|---|
| EKS Cluster (primary) | kaireon-prod | us-east-1 |
| EKS Cluster (DR) | kaireon-dr | us-west-2 |
| RDS Instance (primary) | kaireon-prod | us-east-1 |
| ElastiCache (primary) | kaireon-redis | us-east-1 |
| S3 Backups (primary) | kaireon-backups | us-east-1 |
| S3 Backups (DR) | kaireon-backups-dr | us-west-2 |
| Route 53 Hosted Zone | ZXXXXXXXXXXXXX | Global |
| ACM Certificate | arn:aws:acm:… | us-east-1 |