kaireon_ prefix at GET /api/metrics. The product-facing key metrics are summarised in Dashboards. This page covers the six metrics added or upgraded for operator alerting in the W1–W10 + post-W10 wave — the ones that are not yet tabled on the Dashboards page and need explicit alert wiring before production traffic ramps.
What it does
Each metric below is registered inlib/metrics.ts and emitted from a single owner module. The “what to alert on” column states the failure mode the metric surfaces; the PromQL expression is a sane default for a Prometheus alert rule, not an aspirational SLO.
Reference
kaireon_connector_test_total
| Field | Value |
|---|---|
| Type | Counter |
| Labels | connectorType, status |
| Registered at | lib/metrics.ts:62 |
| Emitter | app/api/v1/connectors/test/route.ts (per test attempt) |
aws_s3, snowflake, kafka, etc.) and outcome (success / failure). Pairs with the existing kaireon_connector_test_latency_ms histogram — that one tells you how long tests took, this one tells you whether they passed.
Alert on: sustained failure ratio per connector type. A single failure during a credential rotation is normal; a 5-minute window where every test fails for one type usually means the connector is down or credentials expired.
kaireon_qualification_unknown_rule_total
| Field | Value |
|---|---|
| Type | Counter |
| Labels | ruleType |
| Registered at | lib/metrics.ts:127 |
| Emitter | lib/qualification engine, on rule-type miss in the classifier |
QualificationRule.ruleType value its classifier does not know how to handle. This is a drift signal — the database rule type does not match the code dispatch table, usually because a migration shipped a new rule type without the matching engine update, or a tenant edited the database directly.
Alert on: any non-zero increment. The healthy rate is exactly zero. Alerting on the first sample protects against silent drops (the engine returns the rule as “did not pass” on unknown types — see decisioning-gates.mdx).
kaireon_identity_resolution_total
| Field | Value |
|---|---|
| Type | Counter |
| Labels | result |
| Registered at | lib/metrics.ts:204 |
| Emitter | identity-graph resolution pipeline |
result (typical values: matched, created, ambiguous, error). The identity graph stitches anonymous sessions to known customer ids — a sustained drop in matched usually means an upstream id source stopped flowing.
Alert on: a meaningful spike in error results, or a sudden collapse of matched rate.
kaireon_ai_intelligence_calls_total
| Field | Value |
|---|---|
| Type | Counter |
| Labels | tool, status |
| Registered at | lib/metrics.ts:251 |
| Emitter | AI intelligence layer (explainDecision, analyzeOfferPerformance, simulateRuleChange, etc. — same set the MCP tool reference lists under Intelligence & Analytics) |
tool name and status (success / error). The MCP server, the in-app AI assistant, and the intelligence dashboard all share this counter.
Alert on: elevated error ratio per tool. A single tool spiking in errors usually means the underlying API contract drifted; an across-the-board error spike usually means the configured AI provider is down.
kaireon_ai_intelligence_duration_seconds
| Field | Value |
|---|---|
| Type | Histogram |
| Labels | tool |
| Registered at | lib/metrics.ts:257 |
| Buckets | 0.1, 0.5, 1, 2, 5, 10 seconds |
kaireon_ai_intelligence_calls_total — calls counter tells you “are tools being invoked”; this histogram tells you “how long they take.”
Alert on: P95 above the longest bucket boundary that is acceptable for the call site. The default bucket layout assumes most tools complete inside 2 s; persistent P95 above 5 s usually means the AI provider degraded or a tool is fanning out into an N+1 fetch.
kaireon_outbox_pending_count
| Field | Value |
|---|---|
| Type | Gauge |
| Labels | tenant |
| Registered at | lib/metrics.ts:191 |
| Refreshed by | The outbox processor’s pending-gauge refresh, called on every poll tick |
outbox_events.status='pending' row count. The outbox publisher tier polls this set; sustained backlog implies the publisher cannot keep up with insert rate (slow EventPublisher backend, downed pod, or a runaway producer). Closes the alert gap called out in Outbox publisher.
Alert on: absolute backlog, plus growth rate. A few-row spike during a deploy is normal; a sustained four-figure backlog is a real incident.
Configuration / Operator notes
- The
/api/metricsscrape endpoint requires theadminrole. The same Bearer-token auth shown in Dashboards — Scrape Configuration applies. - All six metrics above are registered unconditionally in
lib/metrics.ts— no env flag gates them. Operators only need to wire the alert rules into Prometheus and the gauge polling loop into the outbox-publisher deployment. - The companion JSON endpoint
/api/v1/metrics/summary(see Metrics Summary API) does NOT include these six metrics by default; it returns the curated subset the Operations Dashboard renders.
Honest limits
kaireon_outbox_pending_countis sampled at the publisher tick cadence (default 2 s — seelib/outbox-processor.ts). A burst of inserts shorter than the tick interval can be missed by the gauge between samples; use theoutbox_eventsrow count directly for sub-second accounting.kaireon_ai_intelligence_duration_secondsbuckets stop at 10 s. Tool calls that exceed 10 s land in the+Infoverflow bucket — operators alerting on tail latency above 10 s should query_countminus the sum of all finite buckets, not the histogram quantile.- The
resultlabel onkaireon_identity_resolution_totalis owner-defined; the exact label values depend on the resolver implementation. Confirm the live value set againstlib/metrics.ts:204and the resolver source before authoring tenant-specific alerts. - These metrics live in process memory. A pod restart resets every counter and gauge — long-term trend analysis must come from the Prometheus scrape store, not a single pod.
Related
- Dashboards — Prometheus Metrics — the curated key-metric tables the Operations Dashboard renders.
- Metrics Summary API — JSON view of the same scrape feed.
- Outbox publisher — owner of the
outbox_pending_countgauge. - Decisioning Gates — owner of the
qualification_unknown_rule_totalcounter. - MCP Integration — owner of
ai_intelligence_*counters via the intelligence tool category.