Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Why a separate tier
The outbox table guarantees at-least-once event delivery: events written inside a transaction (e.g.,interaction.recorded.v1, outcome.recorded)
are durable even when the configured EventPublisher backend is down or
slow. Without a dedicated publisher tier, the BullMQ-running worker
pods own this loop alongside long-running batch jobs, and a backed-up
batch can starve the publish loop. Splitting these tiers keeps the
publish tail latency independent of batch contention.
What changed in the respond hot path
The/api/v1/respond endpoint no longer publishes events synchronously.
The interaction.recorded.v1 event is now enqueued inside the same
database transaction that writes the interaction history row.
Behavior-change note: an outbox row insertion failure now rolls back
the interaction row. This is a correctness improvement vs the prior
fail-open path — the system no longer claims outcomes whose downstream
events it can’t persist. The cost is that pathological insert failures
(JSON-too-large, constraint violation, mid-tx connection drop) surface
as 500s to the caller instead of silent drops. Operators investigating
“respond returned 500” should check outbox_events insert errors first.
Operator visibility — what surfaces when things go wrong
| Failure mode | Signal | Operator action |
|---|---|---|
| Publisher pod down | Pod restart count + Pending events backlog (see metric below) | kubectl describe pod kaireon-outbox-publisher-* |
| Loop hangs | Liveness probe fails → k8s restarts | kubectl logs --previous |
| EventPublisher backend down | outbox-publisher tick failed ERROR logs with isTransient: true | Check Kafka/Redpanda health |
| Bad env config | Process exits with code 2 → CrashLoopBackoff | Fix the OUTBOX_POLL_INTERVAL_MS env var etc. |
Stuck processing rows | outbox_events.status='processing' AND now() - updatedAt > OUTBOX_REAPER_STALENESS_SECONDS | Auto-handled by outboxReaper cron — see below |
| Sustained non-transient errors | Process exits with code 3 after 30 consecutive failures | Investigate root cause; pod will CrashLoopBackoff |
Recommended Prometheus alert
kaireon_outbox_pending_count gauge is registered with the platform metrics registry and refreshed by the outbox processor on every poll tick. See Metrics Reference for the full PromQL alert family.)
Outbox reaper cron — /api/v1/cron/outbox-reaper
A dedicated cron job sweeps outbox_events and resets any row stuck in
processing whose updatedAt is older than the configured staleness
threshold back to pending. This closes the failure mode where a worker
dies between claiming a row (UPDATE → processing) and either
publishing it or marking it failed — without the reaper those rows
would sit in processing forever and never re-attempted.
The cron route invokes the outbox processor’s stuck-row reaper, which
performs a single bulk SQL UPDATE driven by the configured staleness
threshold. The operation is idempotent — re-running on already-pending
rows is a no-op.
Helm wiring
helm/values.yaml. Cadence of 1–5 minutes is fine
because the operation is idempotent.
Auth
The cron route fail-closes whenCRON_SECRET is unset (route.ts:25-32).
Authenticated callers present the secret via either Authorization: Bearer <secret> or x-cron-secret. Mismatched values return 401.
Response
Configuration
| Variable | Default | Effect |
|---|---|---|
OUTBOX_REAPER_STALENESS_SECONDS | 300 (5 min) | Rows in processing whose updatedAt is older than this are reset to pending. Invalid or non-positive values fall back to the default with a warning (route.ts:78-82). |
CRON_SECRET | unset → 401 | Shared secret for the cron route. Required. |
Configuration knobs
OUTBOX_LIVENESS_FILE (default
/tmp/outbox-publisher.alive). The liveness probe checks this file’s
age — when the publisher loop stops touching it, the probe fails and
Kubernetes restarts the pod.
Honest known gaps
- Structured error IDs shipped on the worker tick path; not yet on every helper. The publisher’s main poll loop mints a per-tick
errorIdand threads it into the next-attempt log line so SIEM tooling can correlate retries. Other in-tier helpers (shutdown drain, reaper companion) still emit bare structured logs and are tracked as a residual for migration. SIEM correlation works for the main loop today. kaireon_outbox_pending_countgauge shipped (W10 wave). Registered with the platform metrics registry and refreshed every poll tick. The recommended Prometheus alert above can be wired today.outboxProcessedTotal+outboxEventAgefrom W8.3 still cover throughput + freshness; this gauge closes the backlog visibility gap. See Metrics Reference for full alert PromQL.