Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

Audience: On-call engineers, SREs, platform operators First written: 2026-05-06 (after a free-tier Upstash quota exhaustion incident) Related: /api/v1/cron/drain-queues reference, Env vars, Incident response
This runbook captures everything you need to operate KaireonAI’s BullMQ worker queues on free-tier or low-budget Redis. It exists because, on 2026-05-06, the playground hit Upstash’s 500K-commands/month limit purely from idle worker polling — zero queued jobs, ~1.3M ops/month wasted on BRPOPLPUSH polls. The fix took ~30 minutes; the patterns below prevent it from happening again.

1. The two worker modes

KaireonAI ships with two execution modes:
ModeToggleBehaviorJob latencyIdle Redis cost
Always-onWORKER_INPROCESS=1 (default)Five BullMQ workers run continuously inside the API container, each long-polling Redis with the BRPOPLPUSH command.≈ 0 (jobs picked up the moment they’re enqueued)~30 ops/min permanently
Cron-driven drainWORKER_INPROCESS=0No always-on workers. A scheduled cron hits POST /api/v1/cron/drain-queues every few minutes; each invocation connects, processes available jobs, disconnects.≤ cron interval (5 min default)~20 ops per invocation
For the queue mix on a typical deployment (batch-jobs, dsar-jobs, journey-jobs, retrain-jobs, seed-jobs), a 5-min cron sustains low-latency batch + DSAR + retrain workloads while burning ~170K ops/month idle — comfortably under Upstash free tier 500K. A 30-min cron drops to ~29K ops/month idle but adds up to 30 min of latency for batch / DSAR / retrain (still fine — these aren’t real-time).

When to pick which

WorkloadRecommended
Free-tier Redis (Upstash 500K, Fly Redis, etc.)WORKER_INPROCESS=0 + 5-min cron
Paid Redis + active batch/journey trafficWORKER_INPROCESS=1
Mixed (paid Redis, but workers run elsewhere)WORKER_INPROCESS=0 on API + dedicated kaireon-worker container

2. Upstash quota math (at a glance)

Free-tier Upstash gives 500,000 Redis commands per month. Here’s where the budget goes:
SourceOps / month
5 BullMQ workers polling idle (WORKER_INPROCESS=1)~1,300,000
WORKER_INPROCESS=0 + 5-min drain cron, idle queues~173,000
WORKER_INPROCESS=0 + 15-min drain cron, idle queues~58,000
WORKER_INPROCESS=0 + 30-min drain cron, idle queues~29,000
Per /recommend call with rate-limit + flow cache hit~5–8
Real /recommend traffic (e.g., 500 calls/day)~75,000–120,000
Rule of thumb: with WORKER_INPROCESS=0 + 5-min cron, you can sustain ~50K /recommend calls/month on free-tier Upstash before any rate-limit/cache cost becomes the binding factor. Above that, upgrade Redis.

3. Setup checklist (new deployment)

If you’re standing up a new App Runner service or migrating an existing one to cron-driven mode:

3.1 — Set the env vars on App Runner

ARN="arn:aws:apprunner:us-east-1:422500312304:service/kaireon-playground/<service-id>"

# 1. Snapshot the current SourceConfiguration so we don't drop unrelated env vars.
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json

# 2. Inject the new env vars (preserve existing ones).
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
cur.WORKER_INPROCESS = '0';
cur.DRAIN_QUEUES_TOKEN = require('crypto').randomBytes(32).toString('hex');
cur.DRAIN_QUEUES_RATE_LIMIT = '12';
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
console.log('DRAIN_QUEUES_TOKEN=' + cur.DRAIN_QUEUES_TOKEN);
"

# 3. Apply. App Runner restarts the container with the new vars.
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json \
  --query 'OperationId' --output text
Why the snapshot-merge dance: update-service replaces the entire runtime-environment-variables object on the service — passing only the new vars would silently delete DATABASE_URL, REDIS_URL, all your secrets. Always read-modify-write.

3.2 — Schedule the drain endpoint

Pick one scheduler. All work; pick by ergonomics + cost.
  1. Sign in at https://cron-job.org.
  2. Create cronjob:
    • Title: Kaireon drain queues
    • URL: https://<your-domain>/api/v1/cron/drain-queues
    • Schedule: Every 5 minutes
    • Save responses: ✅ on
  3. Advanced tab:
    • Method: POST
    • Headers: X-Cron-Token: <DRAIN_QUEUES_TOKEN-value>
    • Timeout: 60 seconds
First execution fires within 5 min. Verify in cron-job.org’s “Execution history” tab that you see 200 OK.

Option B — AWS EventBridge (cleanest if all-in on AWS)

# EventBridge → Lambda relay → App Runner (because EventBridge can't POST directly to App Runner URLs).
# Or use EventBridge API Destinations if your App Runner service is behind API Gateway.
Cost: ~$0.002/month at 5-min cadence (well under EventBridge’s 14M-invocation free tier).

Option C — GitHub Actions cron (simplest if your repo is public)

.github/workflows/drain-queues.yml:
name: Drain Queues
on:
  schedule:
    - cron: "*/5 * * * *"
  workflow_dispatch:
jobs:
  drain:
    runs-on: ubuntu-latest
    steps:
      - run: |
          curl -fsS -X POST "${{ secrets.PLAYGROUND_URL }}/api/v1/cron/drain-queues" \
            -H "X-Cron-Token: ${{ secrets.DRAIN_QUEUES_TOKEN }}"
Cost note: free for public repos. Private repos incur ~53/monthat5mincadence(eachtick=1minuterounded×53/month at 5-min cadence (each tick = 1 minute rounded × 0.008/min × 8,640 ticks/month minus 2,000 free min). Use cron-job.org or EventBridge for private repos.

Option D — UptimeRobot or other uptime-monitor

Same shape as Option A. Most uptime monitors support custom HTTP headers on free tiers.

3.3 — Verify

# 1. Endpoint healthy?
curl -fsS -X POST "https://<your-domain>/api/v1/cron/drain-queues" \
  -H "X-Cron-Token: $DRAIN_QUEUES_TOKEN" | jq

# Expect: { "ok": true, "totalProcessed": 0, "queues": { ... all idleAt: 0 ... } }

# 2. Cron actually firing?
LG="/aws/apprunner/kaireon-playground/<service-id>/application"
START=$(($(date +%s) * 1000 - 1200000))   # 20 min ago
aws logs filter-log-events --region us-east-1 --log-group-name "$LG" \
  --start-time $START --filter-pattern '"drain-queues complete"' \
  --query 'length(events)'
# Expect: ≥ 4 events in the last 20 min for a 5-min cadence.

4. Token rotation procedure

Periodic rotation reduces the blast radius of a leaked token. The drain endpoint accepts (in priority order): DRAIN_QUEUES_TOKENCRON_SECRETCRON_TOKEN. The fallback chain lets you rotate without downtime.
# 1. Generate the new token.
NEW_TOKEN=$(openssl rand -hex 32)
echo "New: $NEW_TOKEN"

# 2. Append it as the active DRAIN_QUEUES_TOKEN; keep the old token as
# CRON_TOKEN (a fallback the endpoint still accepts) for the rotation window.
ARN="arn:aws:apprunner:us-east-1:422500312304:service/kaireon-playground/<service-id>"
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
cur.CRON_TOKEN = cur.DRAIN_QUEUES_TOKEN;        // demote old to fallback
cur.DRAIN_QUEUES_TOKEN = '$NEW_TOKEN';          // new active
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json --query 'OperationId' --output text

# 3. Wait for restart (~5 min), then update the cron-job.org header
# to use $NEW_TOKEN instead of the old one. Test one tick manually.

# 4. After the next cron-job.org tick succeeds with the new token,
# remove CRON_TOKEN entirely:
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
const cur = sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables || {};
delete cur.CRON_TOKEN;
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables = cur;
fs.writeFileSync('/tmp/sc-final.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-final.json --query 'OperationId' --output text
The fallback chain means there’s never a window where the cron is rejected by the endpoint.

5. Token blast radius (reference)

What an attacker with DRAIN_QUEUES_TOKEN can do:
AllowedBlocked
Hit /api/v1/cron/drain-queues repeatedlyAll other /api/v1/cron/* endpoints (need CRON_SECRET)
Trigger queue processing for already-enqueued jobsEnqueue new jobs
Cause cost amplification (compute via repeated drains, rate-limited at 12/min/IP)Read tenant data, /recommend, customer entities
Modify roles, tenants, API keys
Touch AWS / ECR / RDS / App Runner config
Drop tables, delete the app
Realistic worst case: cost amplification on compute. Rate-limit blocks abuse beyond 12 req/min per IP. Even with a leaked token, no user data is exposed and the app cannot be destroyed. If a leak is suspected: rotate (above) and the leaked token becomes invalid on the next App Runner restart (~5 min).

6. Common operations

6.1 — Switch between worker modes

# Always-on → cron-driven
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables.WORKER_INPROCESS = '0';
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json
Reverse the value to go cron-driven → always-on. Always pair =1 on the API with a separate kaireon-worker container running =0 (otherwise both consume the same queues and you double-count metrics + risk job ordering issues).

6.2 — Trigger a one-shot drain manually

curl -fsS -X POST "https://<your-domain>/api/v1/cron/drain-queues?maxDurationMs=60000" \
  -H "X-Cron-Token: $DRAIN_QUEUES_TOKEN" | jq
maxDurationMs caps wall-clock at 1 min (default) up to 10 min (hard cap). maxConcurrentQueues=2 (default) means at most 2 BullMQ workers run inside one invocation.

6.3 — Move to a new Upstash database

# 1. Provision the new database in Upstash console; grab the standard
# Redis URL (rediss://default:<token>@<host>:6379), NOT the REST URL.
NEW_REDIS_URL='rediss://default:...@<new-host>.upstash.io:6379'

# 2. Smoke-test the URL first — bake into a local probe so we don't
# discover a typo only after App Runner restarts.
node -e "
const Redis = require('ioredis');
const r = new Redis('$NEW_REDIS_URL', { lazyConnect: true, connectTimeout: 5000, maxRetriesPerRequest: 1 });
(async () => {
  await r.connect();
  console.log('PING:', await r.ping());
  await r.quit();
})();
"
# Expect: PING: PONG

# 3. Update App Runner.
aws apprunner describe-service --service-arn "$ARN" --region us-east-1 \
  --query 'Service.SourceConfiguration' --output json > /tmp/sc.json
node -e "
const fs = require('fs');
const sc = JSON.parse(fs.readFileSync('/tmp/sc.json'));
sc.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables.REDIS_URL = '$NEW_REDIS_URL';
fs.writeFileSync('/tmp/sc-new.json', JSON.stringify({SourceConfiguration: sc}, null, 2));
"
aws apprunner update-service --service-arn "$ARN" --region us-east-1 \
  --cli-input-json file:///tmp/sc-new.json --query 'OperationId' --output text
Caveat: switching Redis instances abandons any in-flight queued jobs in the old database. For playground / dev environments this is fine; for production, wait for queues to drain (drain-queues returns totalProcessed: 0 for several consecutive ticks) before switching.

7. Deploy flow (this app)

Standard deploy path for code or env-var changes:
TriggerCommandWhat it does
Code changebash tools/scripts/build-and-deploy.sh from repo rootBuild Docker → push to ECR → trigger App Runner deployment
Env-var change onlyaws apprunner update-service --cli-input-json file:///tmp/sc-new.jsonRestart container with new env vars (no image rebuild)
BothRun the deploy script first, then env-var updateImage rolls out, then env vars apply on the new image
Always read-modify-write env vars via the snapshot pattern above — update-service replaces the entire runtime-environment-variables block on the service. Each deploy takes ~4-5 minutes end-to-end:
  1. Docker build (~2 min)
  2. ECR push (~30 sec for incremental layers)
  3. App Runner roll-out (~2 min — container restart + health check)
Verify with:
aws apprunner list-operations --service-arn "$ARN" --region us-east-1 \
  --max-results 1 --query 'OperationSummaryList[0].[Id,Status,EndedAt]' --output text
# Expect: <id> SUCCEEDED <timestamp>

curl -sm 8 https://<your-domain>/api/health | jq .uptime
# Expect: uptime < 300 (i.e., container restarted within last 5 min)

8. Incident: “Upstash quota exhausted”

Symptoms: every Redis-backed feature returns ERR max requests limit exceeded (rate limit, cache, BullMQ enqueue). Email from Upstash announcing free-tier limit reached.

Diagnosis

# Confirm via Upstash REST API:
curl -sm 8 -H "Authorization: Bearer <UPSTASH_REDIS_REST_TOKEN>" \
  "https://<host>.upstash.io/PING"
# If quota exhausted: { "error": "ERR max requests limit exceeded. Limit: 500000, Usage: 500000" }

# Check current /recommend traffic to size the burn:
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM decision_traces WHERE \"createdAt\" > NOW() - INTERVAL '30 days';"
If real /recommend traffic × ~5-8 ops/call < 500K, the burn is from idle worker polling — apply the fix in §3.

Resolution paths

OptionEffortCost
WORKER_INPROCESS=0 + cron drain (§3)30 min$0
New Upstash + same fix (cleaner reset)45 min$0
Upgrade Upstash to pay-as-you-go1 min~$0.20/100K ops
Move to a different Redis (Fly free, ElastiCache, self-hosted)1-2 hoursvaries
The first option is recommended for free-tier deployments; it removes the burn at its source. The second is the “panic-restart” version when you also want a clean Redis with no abandoned queue state.

9. Dependency-vulnerability triage process

When GitHub Dependabot opens alerts on the repo:
# 1. List open alerts.
gh api repos/<org>/<repo>/dependabot/alerts --paginate \
  -q '.[] | select(.state=="open") | "\(.security_advisory.severity) [\(.dependency.package.ecosystem)] \(.dependency.package.name) — \(.security_advisory.summary) (alert #\(.number))"'

# 2. For each, check the safe version + ancestry.
npm view <pkg> version
npm ls <pkg>

# 3. Add an override in package.json (preferred when the vuln is transitive
# and the parent hasn't released a fix yet):
#   "overrides": { "<pkg>": "^<safe-version>" }
# OR bump the direct dep if you control it.

# 4. Reinstall + re-audit.
npm install
npm audit
The pattern shipped on 2026-05-06 closed 7 Dependabot alerts (4 high @xmldom/xmldom, 2 medium postcss/fast-xml-parser, 1 low @tootallnate/once) plus a bonus axios HIGH via npm overrides — see commit 86eff3f for the exact change.