- Always-on (
WORKER_INPROCESS=1, the legacy default): five workers run continuously inside the API container, long-polling Redis with blocking pop operations. Job latency ≈ 0. Idle Redis cost ≈ 30 ops/min permanently. - Cron-driven drain (
WORKER_INPROCESS=0, recommended for free-tier Redis): no always-on workers. A scheduled cron hitsPOST /api/v1/cron/drain-queuesevery few minutes. Each invocation connects, processes available jobs, and disconnects. Job latency ≤ cron interval. Idle Redis cost ≈ 20 ops per invocation × invocations/day.
When to use
| Workload | Recommended mode |
|---|---|
| Free-tier Redis or low-traffic playground | Cron-driven, every 5 min |
| Production with paid Redis + active batch/journey traffic | Always-on |
| Mixed (paid Redis, but workers run elsewhere as a separate service) | Cron-driven on the API; standalone worker container for the heavy queues |
Set the toggle
Set the env var on your API container and redeploy:WORKER_INPROCESS is unset or =1, the API container runs the legacy always-on worker. When =0, only /api/v1/cron/drain-queues produces job consumption.
POST /api/v1/cron/drain-queues
Drain queued jobs across the 5 BullMQ queues (batch-jobs, dsar-jobs, journey-jobs, retrain-jobs, seed-jobs).
Auth
Three env vars are accepted in priority order:| Env var | Scope | When to use |
|---|---|---|
DRAIN_QUEUES_TOKEN | this endpoint only | Recommended for external schedulers (cron-job.org, etc.) — narrow blast radius if the token leaks |
CRON_SECRET | shared by all /api/v1/cron/* | Use when the same internal scheduler hits multiple cron endpoints |
CRON_TOKEN | backwards-compat alias | Pre-existing envs that haven’t migrated |
X-Cron-Secret, X-Cron-Token, or Authorization: Bearer <token>.
Hardening (also active by default):
- Sliding-window per-IP rate limit —
DRAIN_QUEUES_RATE_LIMITrequests/minute (default12). Returns429withRetry-Afterwhen exceeded. - Optional IP allowlist — set
CRON_ALLOWED_IPS=ip1,ip2,…(comma-separated). When set, only requests from listed IPs (matched againstX-Forwarded-Forleft-most) pass. When unset, IP check is skipped (token-only). - Constant-time token compare — prevents timing-attack token discovery.
Query parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
queue | string | (all 5) | Drain only this queue. Allowed: batch-jobs, dsar-jobs, journey-jobs, retrain-jobs, seed-jobs. |
maxDurationMs | number | 60000 (1 min) | Hard wall-clock cap. Max 600000 (10 min). The endpoint exits as soon as queues report idle, but never runs longer than this. |
maxConcurrentQueues | number | 2 | How many queues run BullMQ workers in parallel inside one invocation. Caps concurrent Redis connections so idle ops stay low on free-tier Redis. Bump to 5 when self-hosted. |
maxJobsPerQueue | number | unbounded | Safety stop per queue. Useful when chaining short cron ticks. |
Response 200
idleAt is the wall-clock-ms-since-start when the queue first reported idle. 0 means the queue was already empty when probed (no worker was started — the cheap getJobCounts probe runs and the endpoint moves on).
Error codes
| Code | Reason |
|---|---|
400 | Unknown queue parameter. |
401 | Missing or invalid CRON_SECRET. |
500 | REDIS_URL not configured. |
Scheduling — pick one
Option A — GitHub Actions cron (simplest, free)
.github/workflows/drain-queues.yml:
Option B — AWS EventBridge schedule (preferred when already on AWS)
Option C — External uptime monitor (cheap, hands-off)
Services like Cron-Job.org, EasyCron, or UptimeRobot can hit any HTTPS URL on a schedule. Configure:- URL:
https://playground.kaireonai.com/api/v1/cron/drain-queues - Method: HTTP POST
- Headers:
X-Cron-Secret: <CRON_SECRET> - Schedule:
*/5 * * * *
Option D — Self-managed (k8s CronJob, supervised cron, etc.)
Use whatever scheduler your platform provides. Each tick should run:Cost math
For a tenant with zero queued jobs (the common idle case on playground):Caveats
- Job-failure semantics differ from always-on: in always-on mode, a failed job retries via BullMQ’s exponential backoff immediately. In cron-driven mode, retries are picked up on the next tick. For low-frequency workloads this is fine. For SLA-sensitive workloads, run always-on workers on paid Redis.
- Long-running jobs (>
maxDurationMs) will be aborted mid-flight when the worker closes. They’ll be re-enqueued by BullMQ’s stalled-job detector on the next tick. SetmaxDurationMshigher than your longest expected job, or split jobs into smaller chunks. - The drain endpoint is idempotent — re-hitting it during an in-flight invocation just no-ops on jobs already in-flight (BullMQ’s lock semantics).