Worker Mode + Cron Drain

Audience: On-call engineers, SREs, platform operators First written: 2026-05-06 (after a free-tier Upstash quota exhaustion incident) Related: /api/v1/cron/drain-queues reference, Env vars, Incident response

This runbook captures everything you need to operate KaireonAI’s BullMQ worker queues on free-tier or low-budget Redis. It exists because, on 2026-05-06, the playground hit Upstash’s 500K-commands/month limit purely from idle worker polling — zero queued jobs, ~1.3M ops/month wasted on BRPOPLPUSH polls. The fix took ~30 minutes; the patterns below prevent it from happening again.

1. The two worker modes

KaireonAI ships with two execution modes:

Mode	Toggle	Behavior	Job latency	Idle Redis cost
Always-on	`WORKER_INPROCESS=1` (default)	Five BullMQ workers run continuously inside the API container, each long-polling Redis with the BRPOPLPUSH command.	≈ 0 (jobs picked up the moment they’re enqueued)	~30 ops/min permanently
Cron-driven drain	`WORKER_INPROCESS=0`	No always-on workers. A scheduled cron hits `POST /api/v1/cron/drain-queues` every few minutes; each invocation connects, processes available jobs, disconnects.	≤ cron interval (5 min default)	~20 ops per invocation

For the queue mix on a typical deployment (batch-jobs, dsar-jobs, journey-jobs, retrain-jobs, seed-jobs), a 5-min cron sustains low-latency batch + DSAR + retrain workloads while burning ~170K ops/month idle — comfortably under Upstash free tier 500K. A 30-min cron drops to ~29K ops/month idle but adds up to 30 min of latency for batch / DSAR / retrain (still fine — these aren’t real-time).

When to pick which

Workload	Recommended
Free-tier Redis (Upstash 500K, Fly Redis, etc.)	`WORKER_INPROCESS=0` + 5-min cron
Paid Redis + active batch/journey traffic	`WORKER_INPROCESS=1`
Mixed (paid Redis, but workers run elsewhere)	`WORKER_INPROCESS=0` on API + dedicated `kaireon-worker` container

2. Upstash quota math (at a glance)

Free-tier Upstash gives 500,000 Redis commands per month. Here’s where the budget goes:

Source	Ops / month
5 BullMQ workers polling idle (`WORKER_INPROCESS=1`)	~1,300,000
`WORKER_INPROCESS=0` + 5-min drain cron, idle queues	~173,000
`WORKER_INPROCESS=0` + 15-min drain cron, idle queues	~58,000
`WORKER_INPROCESS=0` + 30-min drain cron, idle queues	~29,000
Per `/recommend` call with rate-limit + flow cache hit	~5–8
Real /recommend traffic (e.g., 500 calls/day)	~75,000–120,000

Rule of thumb: with WORKER_INPROCESS=0 + 5-min cron, you can sustain ~50K /recommend calls/month on free-tier Upstash before any rate-limit/cache cost becomes the binding factor. Above that, upgrade Redis.

3. Setup checklist (new deployment)

If you’re standing up a new App Runner service or migrating an existing one to cron-driven mode:

3.1 — Set the env vars on App Runner

ARN="arn:aws:apprunner:us-east-1:422500312304:service/kaireon-playground/<service-id>"

# 1. Snapshot the current SourceConfiguration so we don't drop unrelated env vars.
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json

# 2. Inject the new env vars (preserve existing ones).
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
cur.WORKER_INPROCESS = '0';
cur.DRAIN_QUEUES_TOKEN = require('crypto').randomBytes(32).toString('hex');
cur.DRAIN_QUEUES_RATE_LIMIT = '12';
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
console.log('DRAIN_QUEUES_TOKEN=' + cur.DRAIN_QUEUES_TOKEN);
"

# 3. Apply. App Runner restarts the container with the new vars.
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json \
  --query 'OperationId' --output text

Why the snapshot-merge dance: update-service replaces the entire runtime-environment-variables object on the service — passing only the new vars would silently delete DATABASE_URL, REDIS_URL, all your secrets. Always read-modify-write.

3.2 — Schedule the drain endpoint

Pick one scheduler. All work; pick by ergonomics + cost.

Option A — cron-job.org (recommended, free, zero AWS resources)

Sign in at https://cron-job.org.
Create cronjob:
- Title: Kaireon drain queues
- URL: https://<your-domain>/api/v1/cron/drain-queues
- Schedule: Every 5 minutes
- Save responses: ✅ on
Advanced tab:
- Method: POST
- Headers: X-Cron-Token: <DRAIN_QUEUES_TOKEN-value>
- Timeout: 60 seconds

First execution fires within 5 min. Verify in cron-job.org’s “Execution history” tab that you see 200 OK.

Option B — AWS EventBridge (cleanest if all-in on AWS)

# EventBridge → Lambda relay → App Runner (because EventBridge can't POST directly to App Runner URLs).
# Or use EventBridge API Destinations if your App Runner service is behind API Gateway.

Cost: ~$0.002/month at 5-min cadence (well under EventBridge’s 14M-invocation free tier).

Option C — GitHub Actions cron (simplest if your repo is public)

.github/workflows/drain-queues.yml:

name: Drain Queues
on:
  schedule:
    - cron: "*/5 * * * *"
  workflow_dispatch:
jobs:
  drain:
    runs-on: ubuntu-latest
    steps:
      - run: |
          curl -fsS -X POST "${{ secrets.PLAYGROUND_URL }}/api/v1/cron/drain-queues" \
            -H "X-Cron-Token: ${{ secrets.DRAIN_QUEUES_TOKEN }}"

Cost note: free for public repos. Private repos incur ~ $53/month at 5-min cadence (each tick = 1 minute rounded ×$ 0.008/min × 8,640 ticks/month minus 2,000 free min). Use cron-job.org or EventBridge for private repos.

Option D — UptimeRobot or other uptime-monitor

Same shape as Option A. Most uptime monitors support custom HTTP headers on free tiers.

3.3 — Verify

# 1. Endpoint healthy?
curl -fsS -X POST "https://<your-domain>/api/v1/cron/drain-queues" \
  -H "X-Cron-Token: $DRAIN_QUEUES_TOKEN" | jq

# Expect: { "ok": true, "totalProcessed": 0, "queues": { ... all idleAt: 0 ... } }

# 2. Cron actually firing?
LG="/aws/apprunner/kaireon-playground/<service-id>/application"
START=$(($(date +%s) * 1000 - 1200000))   # 20 min ago
aws logs filter-log-events --region us-east-1 --log-group-name "$LG" \
  --start-time $START --filter-pattern '"drain-queues complete"' \
  --query 'length(events)'
# Expect: ≥ 4 events in the last 20 min for a 5-min cadence.

4. Token rotation procedure

Periodic rotation reduces the blast radius of a leaked token. The drain endpoint accepts (in priority order): DRAIN_QUEUES_TOKEN → CRON_SECRET → CRON_TOKEN. The fallback chain lets you rotate without downtime.

# 1. Generate the new token.
NEW_TOKEN=$(openssl rand -hex 32)
echo "New: $NEW_TOKEN"

# 2. Append it as the active DRAIN_QUEUES_TOKEN; keep the old token as
# CRON_TOKEN (a fallback the endpoint still accepts) for the rotation window.
ARN="arn:aws:apprunner:us-east-1:422500312304:service/kaireon-playground/<service-id>"
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
cur.CRON_TOKEN = cur.DRAIN_QUEUES_TOKEN;        // demote old to fallback
cur.DRAIN_QUEUES_TOKEN = '$NEW_TOKEN';          // new active
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json --query 'OperationId' --output text

# 3. Wait for restart (~5 min), then update the cron-job.org header
# to use $NEW_TOKEN instead of the old one. Test one tick manually.

# 4. After the next cron-job.org tick succeeds with the new token,
# remove CRON_TOKEN entirely:
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
delete cur.CRON_TOKEN;
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-final.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-final.json --query 'OperationId' --output text

The fallback chain means there’s never a window where the cron is rejected by the endpoint.

5. Token blast radius (reference)

What an attacker with DRAIN_QUEUES_TOKEN can do:

Allowed	Blocked
Hit `/api/v1/cron/drain-queues` repeatedly	All other `/api/v1/cron/*` endpoints (need `CRON_SECRET`)
Trigger queue processing for already-enqueued jobs	Enqueue new jobs
Cause cost amplification (compute via repeated drains, rate-limited at 12/min/IP)	Read tenant data, /recommend, customer entities
	Modify roles, tenants, API keys
	Touch AWS / ECR / RDS / App Runner config
	Drop tables, delete the app

Realistic worst case: cost amplification on compute. Rate-limit blocks abuse beyond 12 req/min per IP. Even with a leaked token, no user data is exposed and the app cannot be destroyed. If a leak is suspected: rotate (above) and the leaked token becomes invalid on the next App Runner restart (~5 min).

6. Common operations

6.1 — Switch between worker modes

# Always-on → cron-driven
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables.WORKER_INPROCESS = '0';
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json

Reverse the value to go cron-driven → always-on. Always pair =1 on the API with a separate kaireon-worker container running =0 (otherwise both consume the same queues and you double-count metrics + risk job ordering issues).

6.2 — Trigger a one-shot drain manually

curl -fsS -X POST "https://<your-domain>/api/v1/cron/drain-queues?maxDurationMs=60000" \
  -H "X-Cron-Token: $DRAIN_QUEUES_TOKEN" | jq

maxDurationMs caps wall-clock at 1 min (default) up to 10 min (hard cap). maxConcurrentQueues=2 (default) means at most 2 BullMQ workers run inside one invocation.

6.3 — Move to a new Upstash database

# 1. Provision the new database in Upstash console; grab the standard
# Redis URL (rediss://default:<token>@<host>:6379), NOT the REST URL.
NEW_REDIS_URL='rediss://default:...@<new-host>.upstash.io:6379'

# 2. Smoke-test the URL first — bake into a local probe so we don't
# discover a typo only after App Runner restarts.
node -e "
const Redis = require('ioredis');
const r = new Redis('$NEW_REDIS_URL', { lazyConnect: true, connectTimeout: 5000, maxRetriesPerRequest: 1 });
(async () => {
  await r.connect();
  console.log('PING:', await r.ping());
  await r.quit();
})();
"
# Expect: PING: PONG

# 3. Update App Runner.
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables.REDIS_URL = '$NEW_REDIS_URL';
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json --query 'OperationId' --output text

Caveat: switching Redis instances abandons any in-flight queued jobs in the old database. For playground / dev environments this is fine; for production, wait for queues to drain (drain-queues returns totalProcessed: 0 for several consecutive ticks) before switching.

7. Deploy flow (this app)

Standard deploy path for code or env-var changes:

Trigger	Command	What it does
Code change	`bash tools/scripts/build-and-deploy.sh` from repo root	Build Docker → push to ECR → trigger App Runner deployment
Env-var change only	`aws apprunner update-service --cli-input-json file:///tmp/sc-new.json`	Restart container with new env vars (no image rebuild)
Both	Run the deploy script first, then env-var update	Image rolls out, then env vars apply on the new image

Always read-modify-write env vars via the snapshot pattern above — update-service replaces the entire runtime-environment-variables block on the service. Each deploy takes ~4-5 minutes end-to-end:

Docker build (~2 min)
ECR push (~30 sec for incremental layers)
App Runner roll-out (~2 min — container restart + health check)

Verify with:

aws apprunner list-operations --service-arn "$ARN" --region us-east-1 \
  --max-results 1 --query 'OperationSummaryList[0].[Id,Status,EndedAt]' --output text
# Expect: <id> SUCCEEDED <timestamp>

curl -sm 8 https://<your-domain>/api/health | jq .uptime
# Expect: uptime < 300 (i.e., container restarted within last 5 min)

8. Incident: “Upstash quota exhausted”

Symptoms: every Redis-backed feature returns ERR max requests limit exceeded (rate limit, cache, BullMQ enqueue). Email from Upstash announcing free-tier limit reached.

Diagnosis

# Confirm via Upstash REST API:
curl -sm 8 -H "Authorization: Bearer <UPSTASH_REDIS_REST_TOKEN>" \
  "https://<host>.upstash.io/PING"
# If quota exhausted: { "error": "ERR max requests limit exceeded. Limit: 500000, Usage: 500000" }

# Check current /recommend traffic to size the burn:
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM decision_traces WHERE \"createdAt\" > NOW() - INTERVAL '30 days';"

If real /recommend traffic × ~5-8 ops/call < 500K, the burn is from idle worker polling — apply the fix in §3.

Resolution paths

Option	Effort	Cost
`WORKER_INPROCESS=0` + cron drain (§3)	30 min	$0
New Upstash + same fix (cleaner reset)	45 min	$0
Upgrade Upstash to pay-as-you-go	1 min	~$0.20/100K ops
Move to a different Redis (Fly free, ElastiCache, self-hosted)	1-2 hours	varies

The first option is recommended for free-tier deployments; it removes the burn at its source. The second is the “panic-restart” version when you also want a clean Redis with no abandoned queue state.

9. Dependency-vulnerability triage process

When GitHub Dependabot opens alerts on the repo:

# 1. List open alerts.
gh api repos/<org>/<repo>/dependabot/alerts --paginate \
  -q '.[] | select(.state=="open") | "\(.security_advisory.severity) [\(.dependency.package.ecosystem)] \(.dependency.package.name) — \(.security_advisory.summary) (alert #\(.number))"'

# 2. For each, check the safe version + ancestry.
npm view <pkg> version
npm ls <pkg>

# 3. Add an override in package.json (preferred when the vuln is transitive
# and the parent hasn't released a fix yet):
#   "overrides": { "<pkg>": "^<safe-version>" }
# OR bump the direct dep if you control it.

# 4. Reinstall + re-audit.
npm install
npm audit

The pattern shipped on 2026-05-06 closed 7 Dependabot alerts (4 high @xmldom/xmldom, 2 medium postcss/fast-xml-parser, 1 low @tootallnate/once) plus a bonus axios HIGH via npm overrides — see commit 86eff3f for the exact change.

Deploy

Configure

Operate

Architecture

Runbooks

Worker Mode + Cron Drain

1. The two worker modes

When to pick which

2. Upstash quota math (at a glance)

3. Setup checklist (new deployment)

3.1 — Set the env vars on App Runner

3.2 — Schedule the drain endpoint

Option A — cron-job.org (recommended, free, zero AWS resources)

Option B — AWS EventBridge (cleanest if all-in on AWS)

Option C — GitHub Actions cron (simplest if your repo is public)

Option D — UptimeRobot or other uptime-monitor

3.3 — Verify

4. Token rotation procedure

5. Token blast radius (reference)

6. Common operations

6.1 — Switch between worker modes

6.2 — Trigger a one-shot drain manually

6.3 — Move to a new Upstash database

7. Deploy flow (this app)

8. Incident: “Upstash quota exhausted”

Diagnosis

Resolution paths

9. Dependency-vulnerability triage process

Deploy

Configure

Operate

Architecture

Runbooks

Documentation Index

​1. The two worker modes

​When to pick which

​2. Upstash quota math (at a glance)

​3. Setup checklist (new deployment)

​3.1 — Set the env vars on App Runner

​3.2 — Schedule the drain endpoint

​Option A — cron-job.org (recommended, free, zero AWS resources)

​Option B — AWS EventBridge (cleanest if all-in on AWS)

​Option C — GitHub Actions cron (simplest if your repo is public)

​Option D — UptimeRobot or other uptime-monitor

​3.3 — Verify

​4. Token rotation procedure

​5. Token blast radius (reference)

​6. Common operations

​6.1 — Switch between worker modes

​6.2 — Trigger a one-shot drain manually

​6.3 — Move to a new Upstash database

​7. Deploy flow (this app)

​8. Incident: “Upstash quota exhausted”

​Diagnosis

​Resolution paths

​9. Dependency-vulnerability triage process

1. The two worker modes

When to pick which

2. Upstash quota math (at a glance)

3. Setup checklist (new deployment)

3.1 — Set the env vars on App Runner

3.2 — Schedule the drain endpoint

Option A — cron-job.org (recommended, free, zero AWS resources)

Option B — AWS EventBridge (cleanest if all-in on AWS)

Option C — GitHub Actions cron (simplest if your repo is public)

Option D — UptimeRobot or other uptime-monitor

3.3 — Verify

4. Token rotation procedure

5. Token blast radius (reference)

6. Common operations

6.1 — Switch between worker modes

6.2 — Trigger a one-shot drain manually

6.3 — Move to a new Upstash database

7. Deploy flow (this app)

8. Incident: “Upstash quota exhausted”

Diagnosis

Resolution paths

9. Dependency-vulnerability triage process