Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Why a separate tier
The outbox table guarantees at-least-once event delivery: events written inside a transaction (e.g.,interaction.recorded.v1, outcome.recorded)
are durable even when the configured EventPublisher backend is down or
slow. Without a dedicated publisher tier, the BullMQ-running worker
pods own this loop alongside long-running batch jobs, and a backed-up
batch can starve the publish loop. Splitting these tiers keeps the
publish tail latency independent of batch contention.
What changed in the respond hot path
/api/v1/respond/route.ts no longer calls publishInteractionRecorded
synchronously. The interaction.recorded.v1 event is now enqueued
inside the same prisma.$transaction that writes the
InteractionHistory row.
Behavior-change note: an outbox row insertion failure now rolls back
the interaction row. This is a correctness improvement vs the prior
fail-open path — the system no longer claims outcomes whose downstream
events it can’t persist. The cost is that pathological insert failures
(JSON-too-large, constraint violation, mid-tx connection drop) surface
as 500s to the caller instead of silent drops. Operators investigating
“respond returned 500” should check outbox_events insert errors first.
Operator visibility — what surfaces when things go wrong
| Failure mode | Signal | Operator action |
|---|---|---|
| Publisher pod down | Pod restart count + Pending events backlog (see metric below) | kubectl describe pod kaireon-outbox-publisher-* |
| Loop hangs | Liveness probe fails → k8s restarts | kubectl logs --previous |
| EventPublisher backend down | outbox-publisher tick failed ERROR logs with isTransient: true | Check Kafka/Redpanda health |
| Bad env config | Process exits with code 2 → CrashLoopBackoff | Fix OUTBOX_POLL_INTERVAL_MS etc. |
Stuck processing rows | outbox_events.status='processing' AND now() - updatedAt > OUTBOX_REAPER_STALENESS_SECONDS | Auto-handled by outboxReaper cron — see below |
| Sustained non-transient errors | Process exits with code 3 after 30 consecutive failures | Investigate root cause; pod will CrashLoopBackoff |
Recommended Prometheus alert
kaireon_outbox_pending_count gauge is registered at lib/metrics.ts:191 and refreshed by refreshOutboxPendingGauge() in lib/outbox-processor.ts on every poll tick. See Metrics Reference for the full PromQL alert family.)
Outbox reaper cron — /api/v1/cron/outbox-reaper
A dedicated cron job sweeps outbox_events and resets any row stuck in
processing whose updatedAt is older than the configured staleness
threshold back to pending. This closes the failure mode where a worker
dies between claiming a row (UPDATE → processing) and either
publishing it or marking it failed — without the reaper those rows
would sit in processing forever and never re-attempted.
The route at src/app/api/v1/cron/outbox-reaper/route.ts:24-71 calls
reapStuckProcessing(stalenessSeconds) from
src/lib/outbox-processor.ts:182. The underlying SQL is a single
bulk UPDATE — re-running on already-pending rows is a no-op.
Helm wiring
helm/values.yaml. Cadence of 1–5 minutes is fine
because the operation is idempotent.
Auth
The cron route fail-closes whenCRON_SECRET is unset (route.ts:25-32).
Authenticated callers present the secret via either Authorization: Bearer <secret> or x-cron-secret. Mismatched values return 401.
Response
Configuration
| Variable | Default | Effect |
|---|---|---|
OUTBOX_REAPER_STALENESS_SECONDS | 300 (5 min) | Rows in processing whose updatedAt is older than this are reset to pending. Invalid or non-positive values fall back to the default with a warning (route.ts:78-82). |
CRON_SECRET | unset → 401 | Shared secret for the cron route. Required. |
Configuration knobs
OUTBOX_LIVENESS_FILE (default
/tmp/outbox-publisher.alive, set in src/worker/outbox-publisher.ts:58).
The liveness probe checks this file’s age — when the publisher loop
stops touching it, the probe fails and Kubernetes restarts the pod.
Honest known gaps
logError+errorIdshipped on the worker tick path; not yet on every helper.src/worker/outbox-publisher.ts:92mints a per-tickerrorIdvialogError(src/lib/api-error.ts:118) and threads it into the next-attempt log line. Other in-tier helpers (shutdown drain,outboxReapercompanion) still emit baregetLogger().error()and are tracked as residual #18 in.planning/RESIDUALS_2026-04-29.mdfor migration. SIEM correlation works for the main loop today.kaireon_outbox_pending_countgauge shipped (W10 wave). Registered atlib/metrics.ts:191and refreshed every poll tick byrefreshOutboxPendingGauge()inlib/outbox-processor.ts. The recommended Prometheus alert above can be wired today.outboxProcessedTotal+outboxEventAgefrom W8.3 still cover throughput + freshness; this gauge closes the backlog visibility gap. See Metrics Reference for full alert PromQL.