Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Audience: On-call engineers, SREs, platform operators
First written: 2026-05-06 (after a free-tier Upstash quota exhaustion incident)
Related: /api/v1/cron/drain-queues reference, Env vars, Incident response
This runbook captures everything you need to operate KaireonAI’s BullMQ worker queues on free-tier or low-budget Redis. It exists because, on 2026-05-06, the playground hit Upstash’s 500K-commands/month limit purely from idle worker polling — zero queued jobs, ~1.3M ops/month wasted on BRPOPLPUSH polls. The fix took ~30 minutes; the patterns below prevent it from happening again.
1. The two worker modes
KaireonAI ships with two execution modes:
| Mode | Toggle | Behavior | Job latency | Idle Redis cost |
|---|
| Always-on | WORKER_INPROCESS=1 (default) | Five BullMQ workers run continuously inside the API container, each long-polling Redis with BRPOPLPUSH. | ≈ 0 (jobs picked up the moment they’re enqueued) | ~30 ops/min permanently |
| Cron-driven drain | WORKER_INPROCESS=0 | No always-on workers. A scheduled cron hits POST /api/v1/cron/drain-queues every few minutes; each invocation connects, processes available jobs, disconnects. | ≤ cron interval (5 min default) | ~20 ops per invocation |
For the queue mix on a typical deployment (batch-jobs, dsar-jobs, journey-jobs, retrain-jobs, seed-jobs), a 5-min cron sustains low-latency batch + DSAR + retrain workloads while burning ~170K ops/month idle — comfortably under Upstash free tier 500K. A 30-min cron drops to ~29K ops/month idle but adds up to 30 min of latency for batch / DSAR / retrain (still fine — these aren’t real-time).
When to pick which
| Workload | Recommended |
|---|
| Free-tier Redis (Upstash 500K, Fly Redis, etc.) | WORKER_INPROCESS=0 + 5-min cron |
| Paid Redis + active batch/journey traffic | WORKER_INPROCESS=1 |
| Mixed (paid Redis, but workers run elsewhere) | WORKER_INPROCESS=0 on API + dedicated kaireon-worker container |
2. Upstash quota math (at a glance)
Free-tier Upstash gives 500,000 Redis commands per month. Here’s where the budget goes:
| Source | Ops / month |
|---|
5 BullMQ workers polling idle (WORKER_INPROCESS=1) | ~1,300,000 |
WORKER_INPROCESS=0 + 5-min drain cron, idle queues | ~173,000 |
WORKER_INPROCESS=0 + 15-min drain cron, idle queues | ~58,000 |
WORKER_INPROCESS=0 + 30-min drain cron, idle queues | ~29,000 |
Per /recommend call with rate-limit + flow cache hit | ~5–8 |
| Real /recommend traffic (e.g., 500 calls/day) | ~75,000–120,000 |
Rule of thumb: with WORKER_INPROCESS=0 + 5-min cron, you can sustain ~50K /recommend calls/month on free-tier Upstash before any rate-limit/cache cost becomes the binding factor. Above that, upgrade Redis.
3. Setup checklist (new deployment)
If you’re standing up a new App Runner service or migrating an existing one to cron-driven mode:
3.1 — Set the env vars on App Runner
ARN="arn:aws:apprunner:us-east-1:422500312304:service/kaireon-playground/<service-id>"
# 1. Snapshot the current SourceConfiguration so we don't drop unrelated env vars.
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
--query 'Service.SourceConfiguration' --output json > /tmp/sc.json
# 2. Inject the new env vars (preserve existing ones).
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
cur.WORKER_INPROCESS = '0';
cur.DRAIN_QUEUES_TOKEN = require('crypto').randomBytes(32).toString('hex');
cur.DRAIN_QUEUES_RATE_LIMIT = '12';
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
console.log('DRAIN_QUEUES_TOKEN=' + cur.DRAIN_QUEUES_TOKEN);
"
# 3. Apply. App Runner restarts the container with the new vars.
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
--cli-input-json file:///tmp/sc-new.json \
--query 'OperationId' --output text
Why the snapshot-merge dance: update-service replaces the entire RuntimeEnvironmentVariables object — passing only the new vars would silently delete DATABASE_URL, REDIS_URL, all your secrets. Always read-modify-write.
3.2 — Schedule the drain endpoint
Pick one scheduler. All work; pick by ergonomics + cost.
Option A — cron-job.org (recommended, free, zero AWS resources)
- Sign in at https://cron-job.org.
- Create cronjob:
- Title:
Kaireon drain queues
- URL:
https://<your-domain>/api/v1/cron/drain-queues
- Schedule:
Every 5 minutes
- Save responses: ✅ on
- Advanced tab:
- Method:
POST
- Headers:
X-Cron-Token: <DRAIN_QUEUES_TOKEN-value>
- Timeout:
60 seconds
First execution fires within 5 min. Verify in cron-job.org’s “Execution history” tab that you see 200 OK.
Option B — AWS EventBridge (cleanest if all-in on AWS)
# EventBridge → Lambda relay → App Runner (because EventBridge can't POST directly to App Runner URLs).
# Or use EventBridge API Destinations if your App Runner service is behind API Gateway.
Cost: ~$0.002/month at 5-min cadence (well under EventBridge’s 14M-invocation free tier).
Option C — GitHub Actions cron (simplest if your repo is public)
.github/workflows/drain-queues.yml:
name: Drain Queues
on:
schedule:
- cron: "*/5 * * * *"
workflow_dispatch:
jobs:
drain:
runs-on: ubuntu-latest
steps:
- run: |
curl -fsS -X POST "${{ secrets.PLAYGROUND_URL }}/api/v1/cron/drain-queues" \
-H "X-Cron-Token: ${{ secrets.DRAIN_QUEUES_TOKEN }}"
Cost note: free for public repos. Private repos incur ~53/monthat5−mincadence(eachtick=1minuterounded×0.008/min × 8,640 ticks/month minus 2,000 free min). Use cron-job.org or EventBridge for private repos.
Option D — UptimeRobot or other uptime-monitor
Same shape as Option A. Most uptime monitors support custom HTTP headers on free tiers.
3.3 — Verify
# 1. Endpoint healthy?
curl -fsS -X POST "https://<your-domain>/api/v1/cron/drain-queues" \
-H "X-Cron-Token: $DRAIN_QUEUES_TOKEN" | jq
# Expect: { "ok": true, "totalProcessed": 0, "queues": { ... all idleAt: 0 ... } }
# 2. Cron actually firing?
LG="/aws/apprunner/kaireon-playground/<service-id>/application"
START=$(($(date +%s) * 1000 - 1200000)) # 20 min ago
aws logs filter-log-events --region us-east-1 --log-group-name "$LG" \
--start-time $START --filter-pattern '"drain-queues complete"' \
--query 'length(events)'
# Expect: ≥ 4 events in the last 20 min for a 5-min cadence.
4. Token rotation procedure
Periodic rotation reduces the blast radius of a leaked token. The drain endpoint accepts (in priority order): DRAIN_QUEUES_TOKEN → CRON_SECRET → CRON_TOKEN. The fallback chain lets you rotate without downtime.
# 1. Generate the new token.
NEW_TOKEN=$(openssl rand -hex 32)
echo "New: $NEW_TOKEN"
# 2. Append it as the active DRAIN_QUEUES_TOKEN; keep the old token as
# CRON_TOKEN (a fallback the endpoint still accepts) for the rotation window.
ARN="arn:aws:apprunner:us-east-1:422500312304:service/kaireon-playground/<service-id>"
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
--query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
cur.CRON_TOKEN = cur.DRAIN_QUEUES_TOKEN; // demote old to fallback
cur.DRAIN_QUEUES_TOKEN = '$NEW_TOKEN'; // new active
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
--cli-input-json file:///tmp/sc-new.json --query 'OperationId' --output text
# 3. Wait for restart (~5 min), then update the cron-job.org header
# to use $NEW_TOKEN instead of the old one. Test one tick manually.
# 4. After the next cron-job.org tick succeeds with the new token,
# remove CRON_TOKEN entirely:
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
delete cur.CRON_TOKEN;
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-final.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
--cli-input-json file:///tmp/sc-final.json --query 'OperationId' --output text
The fallback chain means there’s never a window where the cron is rejected by the endpoint.
5. Token blast radius (reference)
What an attacker with DRAIN_QUEUES_TOKEN can do:
| Allowed | Blocked |
|---|
Hit /api/v1/cron/drain-queues repeatedly | All other /api/v1/cron/* endpoints (need CRON_SECRET) |
| Trigger queue processing for already-enqueued jobs | Enqueue new jobs |
| Cause cost amplification (compute via repeated drains, rate-limited at 12/min/IP) | Read tenant data, /recommend, customer entities |
| Modify roles, tenants, API keys |
| Touch AWS / ECR / RDS / App Runner config |
| Drop tables, delete the app |
Realistic worst case: cost amplification on compute. Rate-limit blocks abuse beyond 12 req/min per IP. Even with a leaked token, no user data is exposed and the app cannot be destroyed.
If a leak is suspected: rotate (above) and the leaked token becomes invalid on the next App Runner restart (~5 min).
6. Common operations
6.1 — Switch between worker modes
# Always-on → cron-driven
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
--query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables.WORKER_INPROCESS = '0';
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
--cli-input-json file:///tmp/sc-new.json
Reverse the value to go cron-driven → always-on. Always pair =1 on the API with a separate kaireon-worker container running =0 (otherwise both consume the same queues and you double-count metrics + risk job ordering issues).
6.2 — Trigger a one-shot drain manually
curl -fsS -X POST "https://<your-domain>/api/v1/cron/drain-queues?maxDurationMs=60000" \
-H "X-Cron-Token: $DRAIN_QUEUES_TOKEN" | jq
maxDurationMs caps wall-clock at 1 min (default) up to 10 min (hard cap). maxConcurrentQueues=2 (default) means at most 2 BullMQ workers run inside one invocation.
6.3 — Move to a new Upstash database
# 1. Provision the new database in Upstash console; grab the standard
# Redis URL (rediss://default:<token>@<host>:6379), NOT the REST URL.
NEW_REDIS_URL='rediss://default:...@<new-host>.upstash.io:6379'
# 2. Smoke-test the URL first — bake into a local probe so we don't
# discover a typo only after App Runner restarts.
node -e "
const Redis = require('ioredis');
const r = new Redis('$NEW_REDIS_URL', { lazyConnect: true, connectTimeout: 5000, maxRetriesPerRequest: 1 });
(async () => {
await r.connect();
console.log('PING:', await r.ping());
await r.quit();
})();
"
# Expect: PING: PONG
# 3. Update App Runner.
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
--query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables.REDIS_URL = '$NEW_REDIS_URL';
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
--cli-input-json file:///tmp/sc-new.json --query 'OperationId' --output text
Caveat: switching Redis instances abandons any in-flight queued jobs in the old database. For playground / dev environments this is fine; for production, wait for queues to drain (drain-queues returns totalProcessed: 0 for several consecutive ticks) before switching.
7. Deploy flow (this app)
Standard deploy path for code or env-var changes:
| Trigger | Command | What it does |
|---|
| Code change | bash tools/scripts/build-and-deploy.sh from repo root | Build Docker → push to ECR → trigger App Runner deployment |
| Env-var change only | aws apprunner update-service --cli-input-json file:///tmp/sc-new.json | Restart container with new env vars (no image rebuild) |
| Both | Run the deploy script first, then env-var update | Image rolls out, then env vars apply on the new image |
Always read-modify-write env vars via the snapshot pattern above — update-service replaces the entire RuntimeEnvironmentVariables object.
Each deploy takes ~4-5 minutes end-to-end:
- Docker build (~2 min)
- ECR push (~30 sec for incremental layers)
- App Runner roll-out (~2 min — container restart + health check)
Verify with:
aws apprunner list-operations --service-arn "$ARN" --region us-east-1 \
--max-results 1 --query 'OperationSummaryList[0].[Id,Status,EndedAt]' --output text
# Expect: <id> SUCCEEDED <timestamp>
curl -sm 8 https://<your-domain>/api/health | jq .uptime
# Expect: uptime < 300 (i.e., container restarted within last 5 min)
8. Incident: “Upstash quota exhausted”
Symptoms: every Redis-backed feature returns ERR max requests limit exceeded (rate limit, cache, BullMQ enqueue). Email from Upstash announcing free-tier limit reached.
Diagnosis
# Confirm via Upstash REST API:
curl -sm 8 -H "Authorization: Bearer <UPSTASH_REDIS_REST_TOKEN>" \
"https://<host>.upstash.io/PING"
# If quota exhausted: { "error": "ERR max requests limit exceeded. Limit: 500000, Usage: 500000" }
# Check current /recommend traffic to size the burn:
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM decision_traces WHERE \"createdAt\" > NOW() - INTERVAL '30 days';"
If real /recommend traffic × ~5-8 ops/call < 500K, the burn is from idle worker polling — apply the fix in §3.
Resolution paths
| Option | Effort | Cost |
|---|
WORKER_INPROCESS=0 + cron drain (§3) | 30 min | $0 |
| New Upstash + same fix (cleaner reset) | 45 min | $0 |
| Upgrade Upstash to pay-as-you-go | 1 min | ~$0.20/100K ops |
| Move to a different Redis (Fly free, ElastiCache, self-hosted) | 1-2 hours | varies |
The first option is recommended for free-tier deployments; it removes the burn at its source. The second is the “panic-restart” version when you also want a clean Redis with no abandoned queue state.
9. Dependency-vulnerability triage process
When GitHub Dependabot opens alerts on the repo:
# 1. List open alerts.
gh api repos/<org>/<repo>/dependabot/alerts --paginate \
-q '.[] | select(.state=="open") | "\(.security_advisory.severity) [\(.dependency.package.ecosystem)] \(.dependency.package.name) — \(.security_advisory.summary) (alert #\(.number))"'
# 2. For each, check the safe version + ancestry.
npm view <pkg> version
npm ls <pkg>
# 3. Add an override in package.json (preferred when the vuln is transitive
# and the parent hasn't released a fix yet):
# "overrides": { "<pkg>": "^<safe-version>" }
# OR bump the direct dep if you control it.
# 4. Reinstall + re-audit.
npm install
npm audit
The pattern shipped on 2026-05-06 closed 7 Dependabot alerts (4 high @xmldom/xmldom, 2 medium postcss/fast-xml-parser, 1 low @tootallnate/once) plus a bonus axios HIGH via npm overrides — see commit 86eff3f for the exact change.