Audience: On-call engineers, SREs, platform operators
First written: 2026-05-06 (after a free-tier Upstash quota exhaustion incident)
Related: /api/v1/cron/drain-queues reference, Env vars, Incident response
This runbook captures everything you need to operate KaireonAI’s BullMQ worker queues on free-tier or low-budget Redis. It exists because, on 2026-05-06, the playground hit Upstash’s 500K-commands/month limit purely from idle worker polling — zero queued jobs, ~1.3M ops/month wasted on BRPOPLPUSH polls. The fix took ~30 minutes; the patterns below prevent it from happening again.
1. The two worker modes
KaireonAI ships with two execution modes:| Mode | Toggle | Behavior | Job latency | Idle Redis cost |
|---|---|---|---|---|
| Always-on | WORKER_INPROCESS=1 (default) | Five BullMQ workers run continuously inside the API container, each long-polling Redis with the BRPOPLPUSH command. | ≈ 0 (jobs picked up the moment they’re enqueued) | ~30 ops/min permanently |
| Cron-driven drain | WORKER_INPROCESS=0 | No always-on workers. A scheduled cron hits POST /api/v1/cron/drain-queues every few minutes; each invocation connects, processes available jobs, disconnects. | ≤ cron interval (5 min default) | ~20 ops per invocation |
batch-jobs, dsar-jobs, journey-jobs, retrain-jobs, seed-jobs), a 5-min cron sustains low-latency batch + DSAR + retrain workloads while burning ~170K ops/month idle — comfortably under Upstash free tier 500K. A 30-min cron drops to ~29K ops/month idle but adds up to 30 min of latency for batch / DSAR / retrain (still fine — these aren’t real-time).
When to pick which
| Workload | Recommended |
|---|---|
| Free-tier Redis (Upstash 500K, Fly Redis, etc.) | WORKER_INPROCESS=0 + 5-min cron |
| Paid Redis + active batch/journey traffic | WORKER_INPROCESS=1 |
| Mixed (paid Redis, but workers run elsewhere) | WORKER_INPROCESS=0 on API + dedicated kaireon-worker container |
2. Upstash quota math (at a glance)
Free-tier Upstash gives 500,000 Redis commands per month. Here’s where the budget goes:| Source | Ops / month |
|---|---|
5 BullMQ workers polling idle (WORKER_INPROCESS=1) | ~1,300,000 |
WORKER_INPROCESS=0 + 5-min drain cron, idle queues | ~173,000 |
WORKER_INPROCESS=0 + 15-min drain cron, idle queues | ~58,000 |
WORKER_INPROCESS=0 + 30-min drain cron, idle queues | ~29,000 |
Per /recommend call with rate-limit + flow cache hit | ~5–8 |
| Real /recommend traffic (e.g., 500 calls/day) | ~75,000–120,000 |
WORKER_INPROCESS=0 + 5-min cron, you can sustain ~50K /recommend calls/month on free-tier Upstash before any rate-limit/cache cost becomes the binding factor. Above that, upgrade Redis.
3. Setup checklist (new deployment)
If you’re standing up a new App Runner service or migrating an existing one to cron-driven mode:3.1 — Set the env vars on App Runner
Why the snapshot-merge dance:update-servicereplaces the entire runtime-environment-variables object on the service — passing only the new vars would silently deleteDATABASE_URL,REDIS_URL, all your secrets. Always read-modify-write.
3.2 — Schedule the drain endpoint
Pick one scheduler. All work; pick by ergonomics + cost.Option A — cron-job.org (recommended, free, zero AWS resources)
- Sign in at https://cron-job.org.
- Create cronjob:
- Title:
Kaireon drain queues - URL:
https://<your-domain>/api/v1/cron/drain-queues - Schedule:
Every 5 minutes - Save responses: ✅ on
- Title:
- Advanced tab:
- Method: POST
- Headers:
X-Cron-Token: <DRAIN_QUEUES_TOKEN-value> - Timeout:
60seconds
200 OK.
Option B — AWS EventBridge (cleanest if all-in on AWS)
Option C — GitHub Actions cron (simplest if your repo is public)
.github/workflows/drain-queues.yml:
Cost note: free for public repos. Private repos incur ~0.008/min × 8,640 ticks/month minus 2,000 free min). Use cron-job.org or EventBridge for private repos.
Option D — UptimeRobot or other uptime-monitor
Same shape as Option A. Most uptime monitors support custom HTTP headers on free tiers.3.3 — Verify
4. Token rotation procedure
Periodic rotation reduces the blast radius of a leaked token. The drain endpoint accepts (in priority order):DRAIN_QUEUES_TOKEN → CRON_SECRET → CRON_TOKEN. The fallback chain lets you rotate without downtime.
5. Token blast radius (reference)
What an attacker withDRAIN_QUEUES_TOKEN can do:
| Allowed | Blocked |
|---|---|
Hit /api/v1/cron/drain-queues repeatedly | All other /api/v1/cron/* endpoints (need CRON_SECRET) |
| Trigger queue processing for already-enqueued jobs | Enqueue new jobs |
| Cause cost amplification (compute via repeated drains, rate-limited at 12/min/IP) | Read tenant data, /recommend, customer entities |
| Modify roles, tenants, API keys | |
| Touch AWS / ECR / RDS / App Runner config | |
| Drop tables, delete the app |
6. Common operations
6.1 — Switch between worker modes
=1 on the API with a separate kaireon-worker container running =0 (otherwise both consume the same queues and you double-count metrics + risk job ordering issues).
6.2 — Trigger a one-shot drain manually
maxDurationMs caps wall-clock at 1 min (default) up to 10 min (hard cap). maxConcurrentQueues=2 (default) means at most 2 BullMQ workers run inside one invocation.
6.3 — Move to a new Upstash database
Caveat: switching Redis instances abandons any in-flight queued jobs in the old database. For playground / dev environments this is fine; for production, wait for queues to drain (drain-queuesreturnstotalProcessed: 0for several consecutive ticks) before switching.
7. Deploy flow (this app)
Standard deploy path for code or env-var changes:| Trigger | Command | What it does |
|---|---|---|
| Code change | bash tools/scripts/build-and-deploy.sh from repo root | Build Docker → push to ECR → trigger App Runner deployment |
| Env-var change only | aws apprunner update-service --cli-input-json file:///tmp/sc-new.json | Restart container with new env vars (no image rebuild) |
| Both | Run the deploy script first, then env-var update | Image rolls out, then env vars apply on the new image |
update-service replaces the entire runtime-environment-variables block on the service.
Each deploy takes ~4-5 minutes end-to-end:
- Docker build (~2 min)
- ECR push (~30 sec for incremental layers)
- App Runner roll-out (~2 min — container restart + health check)
8. Incident: “Upstash quota exhausted”
Symptoms: every Redis-backed feature returnsERR max requests limit exceeded (rate limit, cache, BullMQ enqueue). Email from Upstash announcing free-tier limit reached.
Diagnosis
Resolution paths
| Option | Effort | Cost |
|---|---|---|
WORKER_INPROCESS=0 + cron drain (§3) | 30 min | $0 |
| New Upstash + same fix (cleaner reset) | 45 min | $0 |
| Upgrade Upstash to pay-as-you-go | 1 min | ~$0.20/100K ops |
| Move to a different Redis (Fly free, ElastiCache, self-hosted) | 1-2 hours | varies |
9. Dependency-vulnerability triage process
When GitHub Dependabot opens alerts on the repo:@xmldom/xmldom, 2 medium postcss/fast-xml-parser, 1 low @tootallnate/once) plus a bonus axios HIGH via npm overrides — see commit 86eff3f for the exact change.