Audience: SREs, platform operators, DevOps engineers Last updated: 2026-02-23 Infrastructure: EKS (Kubernetes), RDS PostgreSQL, ElastiCache Redis, PgBouncer
Table of Contents
- Scaling the API Layer
- Scaling Workers
- Scaling the Database
- Scaling Redis
- Scaling PgBouncer
- Capacity Planning
1. Scaling the API Layer
Current Architecture
Horizontal Pod Autoscaler (HPA)
Default HPA configuration (from Helm chart):HPA Tuning
Adjusting target CPU utilization:Manual Scaling
For planned events or emergencies:Pod Resource Tuning
Pre-scaling for Planned Events
2. Scaling Workers
Current Architecture
KEDA Autoscaler
Current KEDA configuration:KEDA Tuning
Adjusting trigger thresholds:| Scenario | listLength | pollingInterval | cooldownPeriod | maxReplicas |
|---|---|---|---|---|
| Normal | 50 | 15 | 300 | 30 |
| High throughput | 20 | 10 | 180 | 50 |
| Cost sensitive | 100 | 30 | 600 | 15 |
| Batch processing | 10 | 10 | 60 | 50 |
Manual Worker Scaling
Worker Resource Tuning
Workers are CPU-intensive during scoring and memory-intensive during pipeline execution.Queue-Specific Worker Pools
For isolating workloads, deploy separate worker pools per queue:3. Scaling the Database
Vertical Scaling (RDS)
Instance type progression:| Tier | Instance Type | vCPUs | Memory | Max Connections | Use Case |
|---|---|---|---|---|---|
| Dev | db.t3.medium | 2 | 4 GB | 100 | Development |
| Small | db.r6g.large | 2 | 16 GB | 200 | Low traffic |
| Medium | db.r6g.xlarge | 4 | 32 GB | 400 | Standard |
| Large | db.r6g.2xlarge | 8 | 64 GB | 800 | High traffic |
| XLarge | db.r6g.4xlarge | 16 | 128 GB | 1600 | Peak traffic |
Read Replicas
Use read replicas to offload read-heavy queries (dashboards, analytics, reporting). Creating a read replica:Storage Scaling
Connection Limit Scaling
When scaling the database vertically, adjustmax_connections accordingly:
4. Scaling Redis
Vertical Scaling (ElastiCache)
Instance type progression:| Tier | Instance Type | Memory | Network | Use Case |
|---|---|---|---|---|
| Dev | cache.t3.small | 1.37 GB | Low-Moderate | Development |
| Small | cache.r6g.large | 13.07 GB | Up to 10 Gbps | Standard |
| Medium | cache.r6g.xlarge | 26.32 GB | Up to 10 Gbps | High cache |
| Large | cache.r6g.2xlarge | 52.82 GB | Up to 10 Gbps | Large dataset |
Cluster Mode
For datasets that exceed single-node memory or require higher throughput. Enabling cluster mode:Redis Memory Management
Scaling Redis for Specific Use Cases
Decision caching (read-heavy):- Use read replicas for read distribution.
- Set short TTLs (30-60s) to limit memory growth.
- Use
volatile-lrueviction.
- Scale vertically for more throughput.
- Monitor
instantaneous_ops_per_sec. - Consider cluster mode if >100K ops/sec.
- Separate from cache Redis.
- Use
noevictionpolicy (sessions must not be evicted). - Size for peak concurrent users.
5. Scaling PgBouncer
Pool Size Calculations
Scaling PgBouncer Horizontally
Deploy multiple PgBouncer instances behind a Kubernetes Service:| API Replicas | Worker Replicas | PgBouncer Instances | default_pool_size | max_db_connections |
|---|---|---|---|---|
| 3 | 2 | 1 | 25 | 80 |
| 10 | 10 | 2 | 25 | 100 |
| 20 | 20 | 3 | 30 | 120 |
| 30 | 30 | 4 | 25 | 100 |
max_db_connections across all PgBouncer instances must not exceed PostgreSQL max_connections.
Applying Pool Changes
6. Capacity Planning
Metrics to Track
| Metric | Current | Warning Threshold | Scaling Action |
|---|---|---|---|
| API CPU utilization | - | >60% sustained | Increase HPA maxReplicas |
| API memory utilization | - | >75% sustained | Increase pod memory limit |
| Worker queue depth | - | >50 sustained | Lower KEDA listLength |
| DB CPU utilization | - | >70% sustained | Scale up instance type |
| DB connections used | - | >70% of max | Scale PgBouncer or DB |
| DB storage used | - | >70% of allocated | Increase storage |
| Redis memory used | - | >70% of maxmemory | Scale up instance type |
| Redis ops/sec | - | >80% of benchmark | Enable cluster mode |
Monthly Capacity Review Checklist
- Review 30-day trends for all metrics above.
- Project growth for the next 90 days based on customer onboarding pipeline.
- Identify any component within 30 days of hitting a threshold.
- Plan scaling actions with cost estimates.
- Update this document with new current values.
Cost-Aware Scaling
- Use Spot instances for batch workers (up to 70% savings).
- Use Reserved Instances for baseline API and database capacity.
- Use Graviton (arm64) instances for 20% better price-performance.
- Scale down non-production environments during off-hours:
Load Testing Before Scaling
Before major scaling changes, validate with load tests:- P50, P95, P99 latency at each VU level.
- Error rate at each VU level.
- Database connection count and query time.
- Redis memory and ops/sec.
- Pod CPU and memory utilization.