Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

The platform exposes 38 Prometheus metrics under the kaireon_ prefix at GET /api/metrics. The product-facing key metrics are summarised in Dashboards. This page covers the six metrics added or upgraded for operator alerting in the W1–W10 + post-W10 wave — the ones that are not yet tabled on the Dashboards page and need explicit alert wiring before production traffic ramps.

What it does

Each metric below is registered in lib/metrics.ts and emitted from a single owner module. The “what to alert on” column states the failure mode the metric surfaces; the PromQL expression is a sane default for a Prometheus alert rule, not an aspirational SLO.

Reference

kaireon_connector_test_total

FieldValue
TypeCounter
LabelsconnectorType, status
Registered atlib/metrics.ts:62
Emitterapp/api/v1/connectors/test/route.ts (per test attempt)
Counts every connector connectivity test, broken down by connector type (aws_s3, snowflake, kafka, etc.) and outcome (success / failure). Pairs with the existing kaireon_connector_test_latency_ms histogram — that one tells you how long tests took, this one tells you whether they passed. Alert on: sustained failure ratio per connector type. A single failure during a credential rotation is normal; a 5-minute window where every test fails for one type usually means the connector is down or credentials expired.
- alert: ConnectorTestsFailing
  expr: |
    sum by (connectorType) (rate(kaireon_connector_test_total{status="failure"}[5m]))
      / sum by (connectorType) (rate(kaireon_connector_test_total[5m])) > 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "More than 50% of connector tests are failing for {{ $labels.connectorType }}"

kaireon_qualification_unknown_rule_total

FieldValue
TypeCounter
LabelsruleType
Registered atlib/metrics.ts:127
Emitterlib/qualification engine, on rule-type miss in the classifier
Increments whenever the qualification engine encounters a QualificationRule.ruleType value its classifier does not know how to handle. This is a drift signal — the database rule type does not match the code dispatch table, usually because a migration shipped a new rule type without the matching engine update, or a tenant edited the database directly. Alert on: any non-zero increment. The healthy rate is exactly zero. Alerting on the first sample protects against silent drops (the engine returns the rule as “did not pass” on unknown types — see decisioning-gates.mdx).
- alert: UnknownQualificationRuleType
  expr: increase(kaireon_qualification_unknown_rule_total[5m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Unknown qualification ruleType '{{ $labels.ruleType }}' encountered — check schema/code drift"

kaireon_identity_resolution_total

FieldValue
TypeCounter
Labelsresult
Registered atlib/metrics.ts:204
Emitteridentity-graph resolution pipeline
Counts identity-resolution attempts by result (typical values: matched, created, ambiguous, error). The identity graph stitches anonymous sessions to known customer ids — a sustained drop in matched usually means an upstream id source stopped flowing. Alert on: a meaningful spike in error results, or a sudden collapse of matched rate.
- alert: IdentityResolutionErrors
  expr: |
    sum(rate(kaireon_identity_resolution_total{result="error"}[5m]))
      / sum(rate(kaireon_identity_resolution_total[5m])) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "More than 5% of identity-resolution attempts are erroring"

kaireon_ai_intelligence_calls_total

FieldValue
TypeCounter
Labelstool, status
Registered atlib/metrics.ts:251
EmitterAI intelligence layer (explainDecision, analyzeOfferPerformance, simulateRuleChange, etc. — same set the MCP tool reference lists under Intelligence & Analytics)
Counts every AI intelligence tool invocation by tool name and status (success / error). The MCP server, the in-app AI assistant, and the intelligence dashboard all share this counter. Alert on: elevated error ratio per tool. A single tool spiking in errors usually means the underlying API contract drifted; an across-the-board error spike usually means the configured AI provider is down.
- alert: AIIntelligenceToolErrors
  expr: |
    sum by (tool) (rate(kaireon_ai_intelligence_calls_total{status="error"}[5m]))
      / sum by (tool) (rate(kaireon_ai_intelligence_calls_total[5m])) > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "AI tool {{ $labels.tool }} error ratio above 10%"

kaireon_ai_intelligence_duration_seconds

FieldValue
TypeHistogram
Labelstool
Registered atlib/metrics.ts:257
Buckets0.1, 0.5, 1, 2, 5, 10 seconds
Per-tool latency distribution for the AI intelligence layer. Pairs with kaireon_ai_intelligence_calls_total — calls counter tells you “are tools being invoked”; this histogram tells you “how long they take.” Alert on: P95 above the longest bucket boundary that is acceptable for the call site. The default bucket layout assumes most tools complete inside 2 s; persistent P95 above 5 s usually means the AI provider degraded or a tool is fanning out into an N+1 fetch.
- alert: AIIntelligenceP95Slow
  expr: |
    histogram_quantile(0.95,
      sum by (tool, le) (rate(kaireon_ai_intelligence_duration_seconds_bucket[5m]))
    ) > 5
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "AI tool {{ $labels.tool }} P95 latency above 5s"

kaireon_outbox_pending_count

FieldValue
TypeGauge
Labelstenant
Registered atlib/metrics.ts:191
Refreshed byrefreshOutboxPendingGauge() in lib/outbox-processor.ts (every poll tick)
Per-tenant gauge of outbox_events.status='pending' row count. The outbox publisher tier polls this set; sustained backlog implies the publisher cannot keep up with insert rate (slow EventPublisher backend, downed pod, or a runaway producer). Closes the alert gap called out in Outbox publisher. Alert on: absolute backlog, plus growth rate. A few-row spike during a deploy is normal; a sustained four-figure backlog is a real incident.
- alert: OutboxBacklog
  expr: kaireon_outbox_pending_count > 100
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Tenant {{ $labels.tenant }} outbox backlog above 100 — publisher tier may be unhealthy"
- alert: OutboxBacklogGrowing
  expr: deriv(kaireon_outbox_pending_count[10m]) > 1
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: "Tenant {{ $labels.tenant }} outbox backlog growing — investigate publisher pod"

Configuration / Operator notes

  • The /api/metrics scrape endpoint requires the admin role. The same Bearer-token auth shown in Dashboards — Scrape Configuration applies.
  • All six metrics above are registered unconditionally in lib/metrics.ts — no env flag gates them. Operators only need to wire the alert rules into Prometheus and the gauge polling loop into the outbox-publisher deployment.
  • The companion JSON endpoint /api/v1/metrics/summary (see Metrics Summary API) does NOT include these six metrics by default; it returns the curated subset the Operations Dashboard renders.

Honest limits

  • kaireon_outbox_pending_count is sampled at the publisher tick cadence (default 2 s — see lib/outbox-processor.ts). A burst of inserts shorter than the tick interval can be missed by the gauge between samples; use the outbox_events row count directly for sub-second accounting.
  • kaireon_ai_intelligence_duration_seconds buckets stop at 10 s. Tool calls that exceed 10 s land in the +Inf overflow bucket — operators alerting on tail latency above 10 s should query _count minus the sum of all finite buckets, not the histogram quantile.
  • The result label on kaireon_identity_resolution_total is owner-defined; the exact label values depend on the resolver implementation. Confirm the live value set against lib/metrics.ts:204 and the resolver source before authoring tenant-specific alerts.
  • These metrics live in process memory. A pod restart resets every counter and gauge — long-term trend analysis must come from the Prometheus scrape store, not a single pod.