Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

System Health is the operational alerts feed for a tenant. It is not the bell — the bell is reserved for product-update content. System Health surfaces things that need an operator’s attention right now: pipeline failures, configuration tripwires, threshold trips, license limits, approvals waiting.

The widget

Top-right of every page, an Activity icon (the EKG-line glyph from Lucide) shows the current operational state:
StateIconBehavior
All cleanmuted grey, no badgehover: “System Health — all clear”
Has warningsamber, count badgeclick → drawer
Has errorred, count badgeclick → drawer
Has criticalred, pulsing, count badgeclick → drawer
Clicking the icon opens a 320px drawer with the recent 20 alerts. “View all” routes to /system-health for the full filterable table.

Severity taxonomy

SeverityColorAuto-purgeExternal channel (Slack/Teams/email)
infoforeground90 daysno
successgreen90 daysno
warningamber90 daysno
errorred90 daysyes (when tenant has a provider configured)
criticalred, pulsesnot auto-purged (pinned: true semantics)yes
The 90-day default lives in RetentionConfig keyed by dataClass: "system_health" per tenant. Set a custom value via the standard retention API.

Server emitter

Any module records an alert via recordHealthAlert:
import { recordHealthAlert } from "@/lib/system-health/emit";

await recordHealthAlert({
  tenantId,
  userId: null,                 // null = fan-out to every user; set for a specific operator
  severity: "warning",
  source: "flow",               // "flow" | "decisioning" | "approvals" | "license" | "system" | …
  title: "Pipeline source had no files",
  message: "Source node \"src\" matched 0 files for pattern \"*.csv\".",
  link: "/data/flow-runs",      // optional in-app route; validated against an allow-list
  metadata: { pipelineId, runId, retries },
  pinned: false,                // true = exclude from auto-purge (compliance-class)
});
Best-effort: failures are logged but don’t throw — alert recording must never break a calling code path. Side-channel routing to external notification providers fires only for error and critical.

Read API

VerbPathNotes
GET/api/v1/system-healthCursor-paginated feed. Query params: ?cursor=&limit=&unreadOnly=&since=&severity=&source=. Tenant-scoped, per-user read state.
PATCH/api/v1/system-health/:id/readMark one alert read for the current user. Idempotent.
POST/api/v1/system-health/read-allBulk-mark every visible alert read for the current user. Optional { source: "flow" } to scope.
DELETE/api/v1/system-health/:idDismiss for the tenant. Pinned alerts are admin-only; tenant-wide alerts (userId = null) are admin-only — non-admins must mark-read.
Polling: the topbar widget fetches every 30s while the tab is focused; polling pauses on tab background.

First consumers

SourceWhenSeverity
flowPipeline run fails for any reasonerror
flowSource node’s onMissAction: alert matches 0 fileswarning
flowTarget’s expectedRowCountDelta band trippedwarning
More consumers (approvals waiting, license-tier soft limits, model retraining done, decision flow errors) are wired as their respective features lift.

Retention purge

GET /api/v1/cron/system-health-purge deletes alerts past their tenant’s RetentionConfig.dataClass="system_health" window. Pinned alerts are skipped. Auth: CRON_SECRET Bearer token (matches the rest of the cron tier).

Honest residuals

  • External side-channel is currently a no-op when the tenant hasn’t installed a Slack/Teams/email provider. Wiring is duck-typed through lib/notifications/provider#dispatchExternal so providers can land later without churning the emitter.
  • No SSE / WebSocket push. 30-second polling is the v1 pattern; real-time push lands in a follow-up if latency becomes a problem.
  • No bulk dismiss UI on /system-health. Per-row mark-read / dismiss only.