Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

Why a separate tier

The outbox table guarantees at-least-once event delivery: events written inside a transaction (e.g., interaction.recorded.v1, outcome.recorded) are durable even when the configured EventPublisher backend is down or slow. Without a dedicated publisher tier, the BullMQ-running worker pods own this loop alongside long-running batch jobs, and a backed-up batch can starve the publish loop. Splitting these tiers keeps the publish tail latency independent of batch contention.

What changed in the respond hot path

/api/v1/respond/route.ts no longer calls publishInteractionRecorded synchronously. The interaction.recorded.v1 event is now enqueued inside the same prisma.$transaction that writes the InteractionHistory row. Behavior-change note: an outbox row insertion failure now rolls back the interaction row. This is a correctness improvement vs the prior fail-open path — the system no longer claims outcomes whose downstream events it can’t persist. The cost is that pathological insert failures (JSON-too-large, constraint violation, mid-tx connection drop) surface as 500s to the caller instead of silent drops. Operators investigating “respond returned 500” should check outbox_events insert errors first.

Operator visibility — what surfaces when things go wrong

Failure modeSignalOperator action
Publisher pod downPod restart count + Pending events backlog (see metric below)kubectl describe pod kaireon-outbox-publisher-*
Loop hangsLiveness probe fails → k8s restartskubectl logs --previous
EventPublisher backend downoutbox-publisher tick failed ERROR logs with isTransient: trueCheck Kafka/Redpanda health
Bad env configProcess exits with code 2 → CrashLoopBackoffFix OUTBOX_POLL_INTERVAL_MS etc.
Stuck processing rowsoutbox_events.status='processing' AND now() - updatedAt > OUTBOX_REAPER_STALENESS_SECONDSAuto-handled by outboxReaper cron — see below
Sustained non-transient errorsProcess exits with code 3 after 30 consecutive failuresInvestigate root cause; pod will CrashLoopBackoff
- alert: OutboxBacklog
  expr: |
    pg_stat_activity_count{state="pending"} > 0
    OR (kaireon_outbox_pending_count > 100)
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Outbox backlog growing — publisher tier may be unhealthy"
(The kaireon_outbox_pending_count gauge is registered at lib/metrics.ts:191 and refreshed by refreshOutboxPendingGauge() in lib/outbox-processor.ts on every poll tick. See Metrics Reference for the full PromQL alert family.)

Outbox reaper cron — /api/v1/cron/outbox-reaper

A dedicated cron job sweeps outbox_events and resets any row stuck in processing whose updatedAt is older than the configured staleness threshold back to pending. This closes the failure mode where a worker dies between claiming a row (UPDATE → processing) and either publishing it or marking it failed — without the reaper those rows would sit in processing forever and never re-attempted. The route at src/app/api/v1/cron/outbox-reaper/route.ts:24-71 calls reapStuckProcessing(stalenessSeconds) from src/lib/outbox-processor.ts:182. The underlying SQL is a single bulk UPDATE — re-running on already-pending rows is a no-op.

Helm wiring

cron:
  schedules:
    outboxReaper:
      enabled: true
      schedule: "*/2 * * * *"
      path: "/api/v1/cron/outbox-reaper"
Wired by default in helm/values.yaml. Cadence of 1–5 minutes is fine because the operation is idempotent.

Auth

The cron route fail-closes when CRON_SECRET is unset (route.ts:25-32). Authenticated callers present the secret via either Authorization: Bearer <secret> or x-cron-secret. Mismatched values return 401.

Response

{
  "status": "ok",
  "stalenessSeconds": 300,
  "resetCount": 2,
  "durationMs": 14,
  "timestamp": "2026-04-30T14:20:00.000Z"
}

Configuration

VariableDefaultEffect
OUTBOX_REAPER_STALENESS_SECONDS300 (5 min)Rows in processing whose updatedAt is older than this are reset to pending. Invalid or non-positive values fall back to the default with a warning (route.ts:78-82).
CRON_SECRETunset → 401Shared secret for the cron route. Required.

Configuration knobs

outboxPublisher:
  enabled: true
  replicas: 1
  pollIntervalMs: 2000               # tick cadence on idle
  shutdownDrainTimeoutMs: 15000      # SIGTERM drain budget
  livenessFile: "/tmp/outbox-publisher.alive"
  livenessProbe:
    enabled: true
    initialDelaySeconds: 15
    periodSeconds: 30
    maxStaleSeconds: 90
    failureThreshold: 3
  pdb:
    enabled: true
    minAvailable: 1
The publisher pod reads OUTBOX_LIVENESS_FILE (default /tmp/outbox-publisher.alive, set in src/worker/outbox-publisher.ts:58). The liveness probe checks this file’s age — when the publisher loop stops touching it, the probe fails and Kubernetes restarts the pod.

Honest known gaps

  1. logError + errorId shipped on the worker tick path; not yet on every helper. src/worker/outbox-publisher.ts:92 mints a per-tick errorId via logError (src/lib/api-error.ts:118) and threads it into the next-attempt log line. Other in-tier helpers (shutdown drain, outboxReaper companion) still emit bare getLogger().error() and are tracked as residual #18 in .planning/RESIDUALS_2026-04-29.md for migration. SIEM correlation works for the main loop today.
  2. kaireon_outbox_pending_count gauge shipped (W10 wave). Registered at lib/metrics.ts:191 and refreshed every poll tick by refreshOutboxPendingGauge() in lib/outbox-processor.ts. The recommended Prometheus alert above can be wired today. outboxProcessedTotal + outboxEventAge from W8.3 still cover throughput + freshness; this gauge closes the backlog visibility gap. See Metrics Reference for full alert PromQL.