Outbox publisher (Helm)

Why a separate tier
What changed in the respond hot path
Operator visibility — what surfaces when things go wrong
Recommended Prometheus alert
Outbox reaper cron — /api/v1/cron/outbox-reaper
Helm wiring
Auth
Response
Configuration
Configuration knobs
Honest known gaps

Why a separate tier

The outbox table guarantees at-least-once event delivery: events written inside a transaction (e.g., interaction.recorded.v1, outcome.recorded) are durable even when the configured EventPublisher backend is down or slow. Without a dedicated publisher tier, the BullMQ-running worker pods own this loop alongside long-running batch jobs, and a backed-up batch can starve the publish loop. Splitting these tiers keeps the publish tail latency independent of batch contention.

What changed in the respond hot path

/api/v1/respond/route.ts no longer calls publishInteractionRecorded synchronously. The interaction.recorded.v1 event is now enqueued inside the same prisma.$transaction that writes the InteractionHistory row. Behavior-change note: an outbox row insertion failure now rolls back the interaction row. This is a correctness improvement vs the prior fail-open path — the system no longer claims outcomes whose downstream events it can’t persist. The cost is that pathological insert failures (JSON-too-large, constraint violation, mid-tx connection drop) surface as 500s to the caller instead of silent drops. Operators investigating “respond returned 500” should check outbox_events insert errors first.

Operator visibility — what surfaces when things go wrong

Failure mode	Signal	Operator action
Publisher pod down	Pod restart count + Pending events backlog (see metric below)	`kubectl describe pod kaireon-outbox-publisher-*`
Loop hangs	Liveness probe fails → k8s restarts	`kubectl logs --previous`
EventPublisher backend down	`outbox-publisher tick failed` ERROR logs with `isTransient: true`	Check Kafka/Redpanda health
Bad env config	Process exits with code 2 → CrashLoopBackoff	Fix `OUTBOX_POLL_INTERVAL_MS` etc.
Stuck `processing` rows	`outbox_events.status='processing' AND now() - updatedAt > OUTBOX_REAPER_STALENESS_SECONDS`	Auto-handled by `outboxReaper` cron — see below
Sustained non-transient errors	Process exits with code 3 after 30 consecutive failures	Investigate root cause; pod will CrashLoopBackoff

Recommended Prometheus alert

- alert: OutboxBacklog
  expr: |
    pg_stat_activity_count{state="pending"} > 0
    OR (kaireon_outbox_pending_count > 100)
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Outbox backlog growing — publisher tier may be unhealthy"

(The kaireon_outbox_pending_count gauge is registered at lib/metrics.ts:191 and refreshed by refreshOutboxPendingGauge() in lib/outbox-processor.ts on every poll tick. See Metrics Reference for the full PromQL alert family.)

Outbox reaper cron — `/api/v1/cron/outbox-reaper`

A dedicated cron job sweeps outbox_events and resets any row stuck in processing whose updatedAt is older than the configured staleness threshold back to pending. This closes the failure mode where a worker dies between claiming a row (UPDATE → processing) and either publishing it or marking it failed — without the reaper those rows would sit in processing forever and never re-attempted. The route at src/app/api/v1/cron/outbox-reaper/route.ts:24-71 calls reapStuckProcessing(stalenessSeconds) from src/lib/outbox-processor.ts:182. The underlying SQL is a single bulk UPDATE — re-running on already-pending rows is a no-op.

Helm wiring

cron:
  schedules:
    outboxReaper:
      enabled: true
      schedule: "*/2 * * * *"
      path: "/api/v1/cron/outbox-reaper"

Wired by default in helm/values.yaml. Cadence of 1–5 minutes is fine because the operation is idempotent.

Auth

The cron route fail-closes when CRON_SECRET is unset (route.ts:25-32). Authenticated callers present the secret via either Authorization: Bearer <secret> or x-cron-secret. Mismatched values return 401.

Response

{
  "status": "ok",
  "stalenessSeconds": 300,
  "resetCount": 2,
  "durationMs": 14,
  "timestamp": "2026-04-30T14:20:00.000Z"
}

Configuration

Variable	Default	Effect
`OUTBOX_REAPER_STALENESS_SECONDS`	`300` (5 min)	Rows in `processing` whose `updatedAt` is older than this are reset to `pending`. Invalid or non-positive values fall back to the default with a warning (`route.ts:78-82`).
`CRON_SECRET`	unset → 401	Shared secret for the cron route. Required.

Configuration knobs

outboxPublisher:
  enabled: true
  replicas: 1
  pollIntervalMs: 2000               # tick cadence on idle
  shutdownDrainTimeoutMs: 15000      # SIGTERM drain budget
  livenessFile: "/tmp/outbox-publisher.alive"
  livenessProbe:
    enabled: true
    initialDelaySeconds: 15
    periodSeconds: 30
    maxStaleSeconds: 90
    failureThreshold: 3
  pdb:
    enabled: true
    minAvailable: 1

The publisher pod reads OUTBOX_LIVENESS_FILE (default /tmp/outbox-publisher.alive, set in src/worker/outbox-publisher.ts:58). The liveness probe checks this file’s age — when the publisher loop stops touching it, the probe fails and Kubernetes restarts the pod.

Honest known gaps

logError + errorId shipped on the worker tick path; not yet on every helper. src/worker/outbox-publisher.ts:92 mints a per-tick errorId via logError (src/lib/api-error.ts:118) and threads it into the next-attempt log line. Other in-tier helpers (shutdown drain, outboxReaper companion) still emit bare getLogger().error() and are tracked as residual #18 in .planning/RESIDUALS_2026-04-29.md for migration. SIEM correlation works for the main loop today.
kaireon_outbox_pending_count gauge shipped (W10 wave). Registered at lib/metrics.ts:191 and refreshed every poll tick by refreshOutboxPendingGauge() in lib/outbox-processor.ts. The recommended Prometheus alert above can be wired today. outboxProcessedTotal + outboxEventAge from W8.3 still cover throughput + freshness; this gauge closes the backlog visibility gap. See Metrics Reference for full alert PromQL.

Cron tier (Helm)Helm values-large.yaml — enterprise topology

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Outbox publisher (Helm)

Why a separate tier

What changed in the respond hot path

Operator visibility — what surfaces when things go wrong

Recommended Prometheus alert

Outbox reaper cron — `/api/v1/cron/outbox-reaper`

Helm wiring

Auth

Response

Configuration

Configuration knobs

Honest known gaps

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Documentation Index

​Why a separate tier

​What changed in the respond hot path

​Operator visibility — what surfaces when things go wrong

​Recommended Prometheus alert

​Outbox reaper cron — /api/v1/cron/outbox-reaper

​Helm wiring

​Auth

​Response

​Configuration

​Configuration knobs

​Honest known gaps

Why a separate tier

What changed in the respond hot path

Operator visibility — what surfaces when things go wrong

Recommended Prometheus alert

Outbox reaper cron — `/api/v1/cron/outbox-reaper`

Helm wiring

Auth

Response

Configuration

Configuration knobs

Honest known gaps