Outbox publisher (Helm)

Why a separate tier

The outbox table guarantees at-least-once event delivery: events written inside a transaction (e.g., interaction.recorded.v1, outcome.recorded) are durable even when the configured EventPublisher backend is down or slow. Without a dedicated publisher tier, the BullMQ-running worker pods own this loop alongside long-running batch jobs, and a backed-up batch can starve the publish loop. Splitting these tiers keeps the publish tail latency independent of batch contention.

What changed in the respond hot path

The /api/v1/respond endpoint no longer publishes events synchronously. The interaction.recorded.v1 event is now enqueued inside the same database transaction that writes the interaction history row. Behavior-change note: an outbox row insertion failure now rolls back the interaction row. This is a correctness improvement vs the prior fail-open path — the system no longer claims outcomes whose downstream events it can’t persist. The cost is that pathological insert failures (JSON-too-large, constraint violation, mid-tx connection drop) surface as 500s to the caller instead of silent drops. Operators investigating “respond returned 500” should check outbox_events insert errors first.

Operator visibility — what surfaces when things go wrong

Failure mode	Signal	Operator action
Publisher pod down	Pod restart count + Pending events backlog (see metric below)	`kubectl describe pod kaireon-outbox-publisher-*`
Loop hangs	Liveness probe fails → k8s restarts	`kubectl logs --previous`
EventPublisher backend down	`outbox-publisher tick failed` ERROR logs with `isTransient: true`	Check Kafka/Redpanda health
Bad env config	Process exits with code 2 → CrashLoopBackoff	Fix the OUTBOX_POLL_INTERVAL_MS env var etc.
Stuck `processing` rows	`outbox_events.status='processing' AND now() - updatedAt > OUTBOX_REAPER_STALENESS_SECONDS`	Auto-handled by `outboxReaper` cron — see below
Sustained non-transient errors	Process exits with code 3 after 30 consecutive failures	Investigate root cause; pod will CrashLoopBackoff

Recommended Prometheus alert

- alert: OutboxBacklog
  expr: |
    pg_stat_activity_count{state="pending"} > 0
    OR (kaireon_outbox_pending_count > 100)
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Outbox backlog growing — publisher tier may be unhealthy"

(The kaireon_outbox_pending_count gauge is registered with the platform metrics registry and refreshed by the outbox processor on every poll tick. See Metrics Reference for the full PromQL alert family.)

Outbox reaper cron — `/api/v1/cron/outbox-reaper`

A dedicated cron job sweeps outbox_events and resets any row stuck in processing whose updatedAt is older than the configured staleness threshold back to pending. This closes the failure mode where a worker dies between claiming a row (UPDATE → processing) and either publishing it or marking it failed — without the reaper those rows would sit in processing forever and never re-attempted. The cron route invokes the outbox processor’s stuck-row reaper, which performs a single bulk SQL UPDATE driven by the configured staleness threshold. The operation is idempotent — re-running on already-pending rows is a no-op.

Helm wiring

cron:
  schedules:
    outboxReaper:
      enabled: true
      schedule: "*/2 * * * *"
      path: "/api/v1/cron/outbox-reaper"

Wired by default in helm/values.yaml. Cadence of 1–5 minutes is fine because the operation is idempotent.

Auth

The cron route fail-closes when CRON_SECRET is unset (route.ts:25-32). Authenticated callers present the secret via either Authorization: Bearer <secret> or x-cron-secret. Mismatched values return 401.

Response

{
  "status": "ok",
  "stalenessSeconds": 300,
  "resetCount": 2,
  "durationMs": 14,
  "timestamp": "2026-04-30T14:20:00.000Z"
}

Configuration

Variable	Default	Effect
`OUTBOX_REAPER_STALENESS_SECONDS`	`300` (5 min)	Rows in `processing` whose `updatedAt` is older than this are reset to `pending`. Invalid or non-positive values fall back to the default with a warning (`route.ts:78-82`).
`CRON_SECRET`	unset → 401	Shared secret for the cron route. Required.

Configuration knobs

outboxPublisher:
  enabled: true
  replicas: 1
  pollIntervalMs: 2000               # tick cadence on idle
  shutdownDrainTimeoutMs: 15000      # SIGTERM drain budget
  livenessFile: "/tmp/outbox-publisher.alive"
  livenessProbe:
    enabled: true
    initialDelaySeconds: 15
    periodSeconds: 30
    maxStaleSeconds: 90
    failureThreshold: 3
  pdb:
    enabled: true
    minAvailable: 1

The publisher pod reads OUTBOX_LIVENESS_FILE (default /tmp/outbox-publisher.alive). The liveness probe checks this file’s age — when the publisher loop stops touching it, the probe fails and Kubernetes restarts the pod.

Honest known gaps

Structured error IDs shipped on the worker tick path; not yet on every helper. The publisher’s main poll loop mints a per-tick errorId and threads it into the next-attempt log line so SIEM tooling can correlate retries. Other in-tier helpers (shutdown drain, reaper companion) still emit bare structured logs and are tracked as a residual for migration. SIEM correlation works for the main loop today.
kaireon_outbox_pending_count gauge shipped (W10 wave). Registered with the platform metrics registry and refreshed every poll tick. The recommended Prometheus alert above can be wired today. outboxProcessedTotal + outboxEventAge from W8.3 still cover throughput + freshness; this gauge closes the backlog visibility gap. See Metrics Reference for full alert PromQL.

Deploy

Configure

Operate

Architecture

Runbooks

Outbox publisher (Helm)

Why a separate tier

What changed in the respond hot path

Operator visibility — what surfaces when things go wrong

Recommended Prometheus alert

Outbox reaper cron — `/api/v1/cron/outbox-reaper`

Helm wiring

Auth

Response

Configuration

Configuration knobs

Honest known gaps

Deploy

Configure

Operate

Architecture

Runbooks

Documentation Index

​Why a separate tier

​What changed in the respond hot path

​Operator visibility — what surfaces when things go wrong

​Recommended Prometheus alert

​Outbox reaper cron — /api/v1/cron/outbox-reaper

​Helm wiring

​Auth

​Response

​Configuration

​Configuration knobs

​Honest known gaps

Why a separate tier

What changed in the respond hot path

Operator visibility — what surfaces when things go wrong

Recommended Prometheus alert

Outbox reaper cron — `/api/v1/cron/outbox-reaper`

Helm wiring

Auth

Response

Configuration

Configuration knobs

Honest known gaps