Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
The platform exposes 38 Prometheus metrics under the kaireon_ prefix at GET /api/metrics. The product-facing key metrics are summarised in Dashboards. This page covers the six metrics added or upgraded for operator alerting in the W1–W10 + post-W10 wave — the ones that are not yet tabled on the Dashboards page and need explicit alert wiring before production traffic ramps.
What it does
Each metric below is registered in lib/metrics.ts and emitted from a single owner module. The “what to alert on” column states the failure mode the metric surfaces; the PromQL expression is a sane default for a Prometheus alert rule, not an aspirational SLO.
Reference
kaireon_connector_test_total
| Field | Value |
|---|
| Type | Counter |
| Labels | connectorType, status |
| Registered at | lib/metrics.ts:62 |
| Emitter | app/api/v1/connectors/test/route.ts (per test attempt) |
Counts every connector connectivity test, broken down by connector type (aws_s3, snowflake, kafka, etc.) and outcome (success / failure). Pairs with the existing kaireon_connector_test_latency_ms histogram — that one tells you how long tests took, this one tells you whether they passed.
Alert on: sustained failure ratio per connector type. A single failure during a credential rotation is normal; a 5-minute window where every test fails for one type usually means the connector is down or credentials expired.
- alert: ConnectorTestsFailing
expr: |
sum by (connectorType) (rate(kaireon_connector_test_total{status="failure"}[5m]))
/ sum by (connectorType) (rate(kaireon_connector_test_total[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "More than 50% of connector tests are failing for {{ $labels.connectorType }}"
kaireon_qualification_unknown_rule_total
| Field | Value |
|---|
| Type | Counter |
| Labels | ruleType |
| Registered at | lib/metrics.ts:127 |
| Emitter | lib/qualification engine, on rule-type miss in the classifier |
Increments whenever the qualification engine encounters a QualificationRule.ruleType value its classifier does not know how to handle. This is a drift signal — the database rule type does not match the code dispatch table, usually because a migration shipped a new rule type without the matching engine update, or a tenant edited the database directly.
Alert on: any non-zero increment. The healthy rate is exactly zero. Alerting on the first sample protects against silent drops (the engine returns the rule as “did not pass” on unknown types — see decisioning-gates.mdx).
- alert: UnknownQualificationRuleType
expr: increase(kaireon_qualification_unknown_rule_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Unknown qualification ruleType '{{ $labels.ruleType }}' encountered — check schema/code drift"
kaireon_identity_resolution_total
| Field | Value |
|---|
| Type | Counter |
| Labels | result |
| Registered at | lib/metrics.ts:204 |
| Emitter | identity-graph resolution pipeline |
Counts identity-resolution attempts by result (typical values: matched, created, ambiguous, error). The identity graph stitches anonymous sessions to known customer ids — a sustained drop in matched usually means an upstream id source stopped flowing.
Alert on: a meaningful spike in error results, or a sudden collapse of matched rate.
- alert: IdentityResolutionErrors
expr: |
sum(rate(kaireon_identity_resolution_total{result="error"}[5m]))
/ sum(rate(kaireon_identity_resolution_total[5m])) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "More than 5% of identity-resolution attempts are erroring"
kaireon_ai_intelligence_calls_total
| Field | Value |
|---|
| Type | Counter |
| Labels | tool, status |
| Registered at | lib/metrics.ts:251 |
| Emitter | AI intelligence layer (explainDecision, analyzeOfferPerformance, simulateRuleChange, etc. — same set the MCP tool reference lists under Intelligence & Analytics) |
Counts every AI intelligence tool invocation by tool name and status (success / error). The MCP server, the in-app AI assistant, and the intelligence dashboard all share this counter.
Alert on: elevated error ratio per tool. A single tool spiking in errors usually means the underlying API contract drifted; an across-the-board error spike usually means the configured AI provider is down.
- alert: AIIntelligenceToolErrors
expr: |
sum by (tool) (rate(kaireon_ai_intelligence_calls_total{status="error"}[5m]))
/ sum by (tool) (rate(kaireon_ai_intelligence_calls_total[5m])) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "AI tool {{ $labels.tool }} error ratio above 10%"
kaireon_ai_intelligence_duration_seconds
| Field | Value |
|---|
| Type | Histogram |
| Labels | tool |
| Registered at | lib/metrics.ts:257 |
| Buckets | 0.1, 0.5, 1, 2, 5, 10 seconds |
Per-tool latency distribution for the AI intelligence layer. Pairs with kaireon_ai_intelligence_calls_total — calls counter tells you “are tools being invoked”; this histogram tells you “how long they take.”
Alert on: P95 above the longest bucket boundary that is acceptable for the call site. The default bucket layout assumes most tools complete inside 2 s; persistent P95 above 5 s usually means the AI provider degraded or a tool is fanning out into an N+1 fetch.
- alert: AIIntelligenceP95Slow
expr: |
histogram_quantile(0.95,
sum by (tool, le) (rate(kaireon_ai_intelligence_duration_seconds_bucket[5m]))
) > 5
for: 15m
labels:
severity: warning
annotations:
summary: "AI tool {{ $labels.tool }} P95 latency above 5s"
kaireon_outbox_pending_count
| Field | Value |
|---|
| Type | Gauge |
| Labels | tenant |
| Registered at | lib/metrics.ts:191 |
| Refreshed by | refreshOutboxPendingGauge() in lib/outbox-processor.ts (every poll tick) |
Per-tenant gauge of outbox_events.status='pending' row count. The outbox publisher tier polls this set; sustained backlog implies the publisher cannot keep up with insert rate (slow EventPublisher backend, downed pod, or a runaway producer). Closes the alert gap called out in Outbox publisher.
Alert on: absolute backlog, plus growth rate. A few-row spike during a deploy is normal; a sustained four-figure backlog is a real incident.
- alert: OutboxBacklog
expr: kaireon_outbox_pending_count > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Tenant {{ $labels.tenant }} outbox backlog above 100 — publisher tier may be unhealthy"
- alert: OutboxBacklogGrowing
expr: deriv(kaireon_outbox_pending_count[10m]) > 1
for: 15m
labels:
severity: critical
annotations:
summary: "Tenant {{ $labels.tenant }} outbox backlog growing — investigate publisher pod"
Configuration / Operator notes
- The
/api/metrics scrape endpoint requires the admin role. The same Bearer-token auth shown in Dashboards — Scrape Configuration applies.
- All six metrics above are registered unconditionally in
lib/metrics.ts — no env flag gates them. Operators only need to wire the alert rules into Prometheus and the gauge polling loop into the outbox-publisher deployment.
- The companion JSON endpoint
/api/v1/metrics/summary (see Metrics Summary API) does NOT include these six metrics by default; it returns the curated subset the Operations Dashboard renders.
Honest limits
kaireon_outbox_pending_count is sampled at the publisher tick cadence (default 2 s — see lib/outbox-processor.ts). A burst of inserts shorter than the tick interval can be missed by the gauge between samples; use the outbox_events row count directly for sub-second accounting.
kaireon_ai_intelligence_duration_seconds buckets stop at 10 s. Tool calls that exceed 10 s land in the +Inf overflow bucket — operators alerting on tail latency above 10 s should query _count minus the sum of all finite buckets, not the histogram quantile.
- The
result label on kaireon_identity_resolution_total is owner-defined; the exact label values depend on the resolver implementation. Confirm the live value set against lib/metrics.ts:204 and the resolver source before authoring tenant-specific alerts.
- These metrics live in process memory. A pod restart resets every counter and gauge — long-term trend analysis must come from the Prometheus scrape store, not a single pod.