Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

Why a separate tier

The outbox table guarantees at-least-once event delivery: events written inside a transaction (e.g., interaction.recorded.v1, outcome.recorded) are durable even when the configured EventPublisher backend is down or slow. Without a dedicated publisher tier, the BullMQ-running worker pods own this loop alongside long-running batch jobs, and a backed-up batch can starve the publish loop. Splitting these tiers keeps the publish tail latency independent of batch contention.

What changed in the respond hot path

The /api/v1/respond endpoint no longer publishes events synchronously. The interaction.recorded.v1 event is now enqueued inside the same database transaction that writes the interaction history row. Behavior-change note: an outbox row insertion failure now rolls back the interaction row. This is a correctness improvement vs the prior fail-open path — the system no longer claims outcomes whose downstream events it can’t persist. The cost is that pathological insert failures (JSON-too-large, constraint violation, mid-tx connection drop) surface as 500s to the caller instead of silent drops. Operators investigating “respond returned 500” should check outbox_events insert errors first.

Operator visibility — what surfaces when things go wrong

Failure modeSignalOperator action
Publisher pod downPod restart count + Pending events backlog (see metric below)kubectl describe pod kaireon-outbox-publisher-*
Loop hangsLiveness probe fails → k8s restartskubectl logs --previous
EventPublisher backend downoutbox-publisher tick failed ERROR logs with isTransient: trueCheck Kafka/Redpanda health
Bad env configProcess exits with code 2 → CrashLoopBackoffFix the OUTBOX_POLL_INTERVAL_MS env var etc.
Stuck processing rowsoutbox_events.status='processing' AND now() - updatedAt > OUTBOX_REAPER_STALENESS_SECONDSAuto-handled by outboxReaper cron — see below
Sustained non-transient errorsProcess exits with code 3 after 30 consecutive failuresInvestigate root cause; pod will CrashLoopBackoff
- alert: OutboxBacklog
  expr: |
    pg_stat_activity_count{state="pending"} > 0
    OR (kaireon_outbox_pending_count > 100)
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Outbox backlog growing — publisher tier may be unhealthy"
(The kaireon_outbox_pending_count gauge is registered with the platform metrics registry and refreshed by the outbox processor on every poll tick. See Metrics Reference for the full PromQL alert family.)

Outbox reaper cron — /api/v1/cron/outbox-reaper

A dedicated cron job sweeps outbox_events and resets any row stuck in processing whose updatedAt is older than the configured staleness threshold back to pending. This closes the failure mode where a worker dies between claiming a row (UPDATE → processing) and either publishing it or marking it failed — without the reaper those rows would sit in processing forever and never re-attempted. The cron route invokes the outbox processor’s stuck-row reaper, which performs a single bulk SQL UPDATE driven by the configured staleness threshold. The operation is idempotent — re-running on already-pending rows is a no-op.

Helm wiring

cron:
  schedules:
    outboxReaper:
      enabled: true
      schedule: "*/2 * * * *"
      path: "/api/v1/cron/outbox-reaper"
Wired by default in helm/values.yaml. Cadence of 1–5 minutes is fine because the operation is idempotent.

Auth

The cron route fail-closes when CRON_SECRET is unset (route.ts:25-32). Authenticated callers present the secret via either Authorization: Bearer <secret> or x-cron-secret. Mismatched values return 401.

Response

{
  "status": "ok",
  "stalenessSeconds": 300,
  "resetCount": 2,
  "durationMs": 14,
  "timestamp": "2026-04-30T14:20:00.000Z"
}

Configuration

VariableDefaultEffect
OUTBOX_REAPER_STALENESS_SECONDS300 (5 min)Rows in processing whose updatedAt is older than this are reset to pending. Invalid or non-positive values fall back to the default with a warning (route.ts:78-82).
CRON_SECRETunset → 401Shared secret for the cron route. Required.

Configuration knobs

outboxPublisher:
  enabled: true
  replicas: 1
  pollIntervalMs: 2000               # tick cadence on idle
  shutdownDrainTimeoutMs: 15000      # SIGTERM drain budget
  livenessFile: "/tmp/outbox-publisher.alive"
  livenessProbe:
    enabled: true
    initialDelaySeconds: 15
    periodSeconds: 30
    maxStaleSeconds: 90
    failureThreshold: 3
  pdb:
    enabled: true
    minAvailable: 1
The publisher pod reads OUTBOX_LIVENESS_FILE (default /tmp/outbox-publisher.alive). The liveness probe checks this file’s age — when the publisher loop stops touching it, the probe fails and Kubernetes restarts the pod.

Honest known gaps

  1. Structured error IDs shipped on the worker tick path; not yet on every helper. The publisher’s main poll loop mints a per-tick errorId and threads it into the next-attempt log line so SIEM tooling can correlate retries. Other in-tier helpers (shutdown drain, reaper companion) still emit bare structured logs and are tracked as a residual for migration. SIEM correlation works for the main loop today.
  2. kaireon_outbox_pending_count gauge shipped (W10 wave). Registered with the platform metrics registry and refreshed every poll tick. The recommended Prometheus alert above can be wired today. outboxProcessedTotal + outboxEventAge from W8.3 still cover throughput + freshness; this gauge closes the backlog visibility gap. See Metrics Reference for full alert PromQL.