Metrics Reference

The platform exposes 38 Prometheus metrics under the kaireon_ prefix at GET /api/metrics. The product-facing key metrics are summarised in Dashboards. This page covers the six metrics added or upgraded for operator alerting in the W1–W10 + post-W10 wave — the ones that are not yet tabled on the Dashboards page and need explicit alert wiring before production traffic ramps.

What it does

Each metric below is registered in lib/metrics.ts and emitted from a single owner module. The “what to alert on” column states the failure mode the metric surfaces; the PromQL expression is a sane default for a Prometheus alert rule, not an aspirational SLO.

Reference

`kaireon_connector_test_total`

Field	Value
Type	Counter
Labels	`connectorType`, `status`
Registered at	`lib/metrics.ts:62`
Emitter	`app/api/v1/connectors/test/route.ts` (per test attempt)

Counts every connector connectivity test, broken down by connector type (aws_s3, snowflake, kafka, etc.) and outcome (success / failure). Pairs with the existing kaireon_connector_test_latency_ms histogram — that one tells you how long tests took, this one tells you whether they passed. Alert on: sustained failure ratio per connector type. A single failure during a credential rotation is normal; a 5-minute window where every test fails for one type usually means the connector is down or credentials expired.

- alert: ConnectorTestsFailing
  expr: |
    sum by (connectorType) (rate(kaireon_connector_test_total{status="failure"}[5m]))
      / sum by (connectorType) (rate(kaireon_connector_test_total[5m])) > 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "More than 50% of connector tests are failing for {{ $labels.connectorType }}"

`kaireon_qualification_unknown_rule_total`

Field	Value
Type	Counter
Labels	`ruleType`
Registered at	`lib/metrics.ts:127`
Emitter	`lib/qualification` engine, on rule-type miss in the classifier

Increments whenever the qualification engine encounters a QualificationRule.ruleType value its classifier does not know how to handle. This is a drift signal — the database rule type does not match the code dispatch table, usually because a migration shipped a new rule type without the matching engine update, or a tenant edited the database directly. Alert on: any non-zero increment. The healthy rate is exactly zero. Alerting on the first sample protects against silent drops (the engine returns the rule as “did not pass” on unknown types — see decisioning-gates.mdx).

- alert: UnknownQualificationRuleType
  expr: increase(kaireon_qualification_unknown_rule_total[5m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Unknown qualification ruleType '{{ $labels.ruleType }}' encountered — check schema/code drift"

`kaireon_identity_resolution_total`

Field	Value
Type	Counter
Labels	`result`
Registered at	`lib/metrics.ts:204`
Emitter	identity-graph resolution pipeline

Counts identity-resolution attempts by result (typical values: matched, created, ambiguous, error). The identity graph stitches anonymous sessions to known customer ids — a sustained drop in matched usually means an upstream id source stopped flowing. Alert on: a meaningful spike in error results, or a sudden collapse of matched rate.

- alert: IdentityResolutionErrors
  expr: |
    sum(rate(kaireon_identity_resolution_total{result="error"}[5m]))
      / sum(rate(kaireon_identity_resolution_total[5m])) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "More than 5% of identity-resolution attempts are erroring"

`kaireon_ai_intelligence_calls_total`

Field	Value
Type	Counter
Labels	`tool`, `status`
Registered at	`lib/metrics.ts:251`
Emitter	AI intelligence layer (`explainDecision`, `analyzeOfferPerformance`, `simulateRuleChange`, etc. — same set the MCP tool reference lists under Intelligence & Analytics)

Counts every AI intelligence tool invocation by tool name and status (success / error). The MCP server, the in-app AI assistant, and the intelligence dashboard all share this counter. Alert on: elevated error ratio per tool. A single tool spiking in errors usually means the underlying API contract drifted; an across-the-board error spike usually means the configured AI provider is down.

- alert: AIIntelligenceToolErrors
  expr: |
    sum by (tool) (rate(kaireon_ai_intelligence_calls_total{status="error"}[5m]))
      / sum by (tool) (rate(kaireon_ai_intelligence_calls_total[5m])) > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "AI tool {{ $labels.tool }} error ratio above 10%"

`kaireon_ai_intelligence_duration_seconds`

Field	Value
Type	Histogram
Labels	`tool`
Registered at	`lib/metrics.ts:257`
Buckets	`0.1, 0.5, 1, 2, 5, 10` seconds

Per-tool latency distribution for the AI intelligence layer. Pairs with kaireon_ai_intelligence_calls_total — calls counter tells you “are tools being invoked”; this histogram tells you “how long they take.” Alert on: P95 above the longest bucket boundary that is acceptable for the call site. The default bucket layout assumes most tools complete inside 2 s; persistent P95 above 5 s usually means the AI provider degraded or a tool is fanning out into an N+1 fetch.

- alert: AIIntelligenceP95Slow
  expr: |
    histogram_quantile(0.95,
      sum by (tool, le) (rate(kaireon_ai_intelligence_duration_seconds_bucket[5m]))
    ) > 5
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "AI tool {{ $labels.tool }} P95 latency above 5s"

`kaireon_outbox_pending_count`

Field	Value
Type	Gauge
Labels	`tenant`
Registered at	`lib/metrics.ts:191`
Refreshed by	`refreshOutboxPendingGauge()` in `lib/outbox-processor.ts` (every poll tick)

Per-tenant gauge of outbox_events.status='pending' row count. The outbox publisher tier polls this set; sustained backlog implies the publisher cannot keep up with insert rate (slow EventPublisher backend, downed pod, or a runaway producer). Closes the alert gap called out in Outbox publisher. Alert on: absolute backlog, plus growth rate. A few-row spike during a deploy is normal; a sustained four-figure backlog is a real incident.

- alert: OutboxBacklog
  expr: kaireon_outbox_pending_count > 100
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Tenant {{ $labels.tenant }} outbox backlog above 100 — publisher tier may be unhealthy"

- alert: OutboxBacklogGrowing
  expr: deriv(kaireon_outbox_pending_count[10m]) > 1
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: "Tenant {{ $labels.tenant }} outbox backlog growing — investigate publisher pod"

Configuration / Operator notes

The /api/metrics scrape endpoint requires the admin role. The same Bearer-token auth shown in Dashboards — Scrape Configuration applies.
All six metrics above are registered unconditionally in lib/metrics.ts — no env flag gates them. Operators only need to wire the alert rules into Prometheus and the gauge polling loop into the outbox-publisher deployment.
The companion JSON endpoint /api/v1/metrics/summary (see Metrics Summary API) does NOT include these six metrics by default; it returns the curated subset the Operations Dashboard renders.

Honest limits

kaireon_outbox_pending_count is sampled at the publisher tick cadence (default 2 s — see lib/outbox-processor.ts). A burst of inserts shorter than the tick interval can be missed by the gauge between samples; use the outbox_events row count directly for sub-second accounting.
kaireon_ai_intelligence_duration_seconds buckets stop at 10 s. Tool calls that exceed 10 s land in the +Inf overflow bucket — operators alerting on tail latency above 10 s should query _count minus the sum of all finite buckets, not the histogram quantile.
The result label on kaireon_identity_resolution_total is owner-defined; the exact label values depend on the resolver implementation. Confirm the live value set against lib/metrics.ts:204 and the resolver source before authoring tenant-specific alerts.
These metrics live in process memory. A pod restart resets every counter and gauge — long-term trend analysis must come from the Prometheus scrape store, not a single pod.

Dashboards — Prometheus Metrics — the curated key-metric tables the Operations Dashboard renders.
Metrics Summary API — JSON view of the same scrape feed.
Outbox publisher — owner of the outbox_pending_count gauge.
Decisioning Gates — owner of the qualification_unknown_rule_total counter.
MCP Integration — owner of ai_intelligence_* counters via the intelligence tool category.

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

What it does

Reference

`kaireon_connector_test_total`

`kaireon_qualification_unknown_rule_total`

`kaireon_identity_resolution_total`

`kaireon_ai_intelligence_calls_total`

`kaireon_ai_intelligence_duration_seconds`

`kaireon_outbox_pending_count`

Configuration / Operator notes

Honest limits

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Documentation Index

​What it does

​Reference

​kaireon_connector_test_total

​kaireon_qualification_unknown_rule_total

​kaireon_identity_resolution_total

​kaireon_ai_intelligence_calls_total

​kaireon_ai_intelligence_duration_seconds

​kaireon_outbox_pending_count

​Configuration / Operator notes

​Honest limits

​Related

What it does

Reference

`kaireon_connector_test_total`

`kaireon_qualification_unknown_rule_total`

`kaireon_identity_resolution_total`

`kaireon_ai_intelligence_calls_total`

`kaireon_ai_intelligence_duration_seconds`

`kaireon_outbox_pending_count`

Configuration / Operator notes

Honest limits

Related