Prometheus Metrics
All metrics are exposed atGET /api/v1/metrics in Prometheus text format. Scrape this endpoint from your Prometheus server or any compatible collector. Default metrics (Node.js process stats) are auto-collected with the kaireon_ prefix.
Key Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
kaireon_http_request_duration_seconds | Histogram | method, route, status | HTTP request latency across all API routes |
kaireon_decision_latency_ms | Histogram | channel | End-to-end decision engine latency |
kaireon_decision_flow_execution_latency_ms | Histogram | — | Decision flow pipeline execution time |
kaireon_decision_pipeline_duration_ms | Histogram | — | Total pipeline duration including all sub-stages |
kaireon_qualification_filter_latency_ms | Histogram | — | Qualification rule evaluation time |
kaireon_contact_policy_filter_latency_ms | Histogram | — | Contact policy filter time |
kaireon_scoring_latency_ms | Histogram | — | Model scoring stage time |
kaireon_ranking_latency_ms | Histogram | — | Final ranking stage time |
kaireon_scoring_model_failure_total | Counter | modelKey | Scoring model failures (triggers circuit breaker) |
kaireon_circuit_breaker_state_change_total | Counter | name, from, to | Circuit breaker state transitions |
kaireon_dlq_depth | Gauge | tenant | Current dead letter queue depth |
kaireon_cache_hits_total | Counter | key | Cache hits |
kaireon_cache_misses_total | Counter | key | Cache misses |
kaireon_offers_evaluated_total | Counter | — | Total offers evaluated in decisions |
kaireon_decision_delivery_total | Counter | — | Total recommendation responses delivered |
kaireon_respond_outcome_total | Counter | outcomeType, classification | Outcomes recorded by type |
kaireon_experiment_assignment_total | Counter | variant | Experiment variant assignments |
kaireon_guardrail_evaluation_total | Counter | result | Guardrail evaluation outcomes |
kaireon_mandatory_cap_hit_total | Counter | — | Mandatory offer daily cap hits |
kaireon_http_error_total | Counter | method, route, status_class | HTTP 4xx/5xx error responses |
kaireon_pipeline_execution_latency_ms | Histogram | — | Data pipeline execution latency |
kaireon_pipeline_rows_processed_total | Counter | — | Rows processed by data pipelines |
kaireon_outbox_event_age_seconds | Histogram | topic | Age of outbox events when processed |
kaireon_outbox_processed_total | Counter | status | Outbox events processed |
kaireon_decision_candidates | Gauge | stage | Candidate count at each pipeline stage |
kaireon_decision_qualification_filter_rate | Gauge | — | Ratio of candidates filtered by qualification |
kaireon_decision_contact_policy_filter_rate | Gauge | — | Ratio of candidates filtered by contact policy |
Rate Limiting
The platform uses a sliding window algorithm to enforce per-key request limits. Each request timestamp is recorded; when the count within the window exceeds the configured maximum, subsequent requests are rejected with429 Too Many Requests.
How It Works
- On each request, timestamps older than the window are pruned
- If the remaining count is at or above
maxRequests, the request is rejected - Otherwise the timestamp is recorded and the request proceeds
Storage Modes
| Mode | When Used | How It Works |
|---|---|---|
| In-memory | Single-process deployments, or Redis unavailable | Timestamps stored in a Map per key (max 50,000 entries with automatic eviction) |
| Redis-backed | Multi-process / multi-node | Uses sorted sets (ZADD + ZREMRANGEBYSCORE + ZCARD) in an atomic pipeline per check |
| Fallback | Redis configured but temporarily down | checkWithFallback() tries Redis first, falls back to in-memory |
ratelimit:sw:{key} with automatic expiry set to the window duration.
Response Headers
When a request is rate-limited, the API returns:| Header | Value |
|---|---|
Retry-After | Milliseconds until the earliest window slot frees up |
RateLimitResult returned to the caller includes allowed (boolean), remaining (requests left in window), and retryAfterMs (when rejected).
Configuration
Rate limiters are instantiated with two parameters:If Redis is not configured (
REDIS_URL not set), rate limiting falls back to in-memory mode. The platform still works, but limits are per-process rather than global.Circuit Breakers
KaireonAI uses circuit breakers to prevent cascading failures when external services (connectors, webhooks, scoring models) become unavailable.State Machine
| State | Behavior |
|---|---|
| Closed | All requests pass through normally |
| Open | All requests are short-circuited immediately (fail fast). No calls to the downstream service |
| Half-open | After the cooldown expires, up to maxHalfOpenProbes probe requests are allowed through. A success resets to Closed; a failure returns to Open |
Default Thresholds
| Parameter | Default | Description |
|---|---|---|
failureThreshold | 5 | Consecutive failures before opening the circuit |
cooldownMs | 300,000 (5 min) | Time in Open state before allowing probes |
maxHalfOpenProbes | 3 | Maximum concurrent probe requests in Half-open |
Scoring Model Circuit Breaker
The decision engine has a dedicated circuit breaker for scoring models with tighter thresholds:| Parameter | Value |
|---|---|
| Failure threshold | 5 consecutive failures |
| Cooldown | 60 seconds |
| Fallback | Default priority-based score |
Persistence
Circuit breaker state is persisted to Redis (key prefixkaireon:cb:) when REDIS_URL is set, so state survives process restarts. If Redis is unavailable, state is maintained in-memory only.
Where Circuit Breakers Are Used
| Component | Circuit Key Pattern | Thresholds |
|---|---|---|
| Scoring models | Per model key | 5 failures / 60s cooldown |
| Alert webhooks | alert-webhook:{target} | 3 failures / 60s cooldown |
| Trigger webhooks | trigger-webhook:{url} | 3 failures / 60s cooldown |
| Audit logging | audit-log | Default (5 / 5 min) |
/api/health endpoint reports all circuit breaker statuses. An open breaker sets health to degraded.
Prometheus Integration
Every state transition emits akaireon_circuit_breaker_state_change_total counter increment with labels name, from, and to. Alert on transitions to open:
Dead Letter Queue
Events that fail processing after retries are moved from the outbox to the dead letter queue (DLQ). The DLQ is backed by theDeadLetterEvent database table, scoped per tenant and organized by topic.
Admin API
GET /api/v1/admin/dlq — Retrieve DLQ summary and events (admin role required).
| Parameter | Type | Default | Description |
|---|---|---|---|
limit | query | 50 | Max events returned (capped at 200) |
topic | query | — | Filter by topic |
totalEvents, a byTopic breakdown, the event list, and an alert field:
| Alert Level | Condition |
|---|---|
OK | 10 or fewer events |
WARNING | 11-100 events |
CRITICAL | More than 100 events |
POST /api/v1/admin/dlq — Retry or purge DLQ events (admin role required).
- Retry re-enqueues events back to the outbox with
status: "pending"andretryCount: 0, then deletes the DLQ entry (transactional). - Purge permanently deletes matching DLQ events.
Monitoring
Track DLQ growth with thekaireon_dlq_depth gauge and outbox health with kaireon_outbox_event_age_seconds and kaireon_outbox_processed_total.
Cache Management
The platform caches offers, qualification rules, and contact policies to reduce database load during decision execution. An emergency flush endpoint is available for situations where cached data becomes stale.POST /api/v1/admin/cache — Emergency cache invalidation (admin role required).
kaireon_cache_hits_total and kaireon_cache_misses_total.
Performance
KaireonAI applies several optimizations to keep decision latency low and pipeline throughput high.Decision Pipeline Caching
The decision engine caches frequently accessed data in Redis to avoid repeated database queries during recommendation processing:| Cached Data | TTL | Key Pattern | Impact |
|---|---|---|---|
| Qualification rules | 120s | qual:{tenantId} | Avoids per-request rule loading |
| Contact policies | 120s | policy:{tenantId} | Avoids per-request policy loading |
| Decision flow config | 120s | flow:{flowId} | Flow route resolution cached per flow |
| Enrichment data | 120s | enrich:{schemaId}:{customerId} | Customer data cached across offers |
Query Optimizations
- Creative queries are filtered by the set of candidate offer IDs, not loaded for the entire tenant. This prevents unbounded memory usage when a tenant has thousands of creatives across many offers.
- Flow route resolution is cached with a 120-second TTL, avoiding repeated database lookups for the same flow across concurrent requests.
Pipeline Throughput
- Chunked inserts — CSV ingestion writes to the database in batches of 1,000 rows, preventing memory exhaustion on large files and reducing transaction lock duration.
- Streaming batch execution — The batch executor uses summary counters (rows loaded, failed, skipped) instead of accumulating all row results in memory, allowing pipelines to process files larger than available RAM.
Decision Traces
Decision traces provide forensic visibility into every stage of the decision pipeline. Configure tracing in Settings > General > Retention > Decision Trace.| Setting | Description |
|---|---|
decisionTraceEnabled | Master toggle for trace capture |
decisionTraceSampleRate | Percentage of requests to trace (0-100) |
| Retention period | How long traces are retained before cleanup |
DecisionTrace table and viewable from the Decision Flows detail page. Each trace records the full pipeline execution: candidates at each stage, filter reasons, scores, rankings, and timing breakdowns.
Monitoring Recipes
Latency Spiked
Decision latency suddenly increased. Identify which pipeline stage is the bottleneck:Conversion Dropped
Response outcomes stopped improving. Check delivery and outcome rates:DLQ Growing
The dead letter queue is accumulating events:- Check the DLQ admin endpoint (
GET /api/v1/admin/dlq) to identify failing topics - Investigate the root cause (downstream service outage, schema mismatch)
- Fix the underlying issue
- Retry events with
POST /api/v1/admin/dlqwithaction: "retry"
HTTP Error Rate Elevated
Monitor 4xx and 5xx error rates across all API routes:Related
Dashboards
Monitor platform health with the built-in dashboards.
Scaling & Deployment
Scaling configuration for multi-node deployments.
Troubleshooting
Common issues and resolution steps.
API Reference
Full API endpoint documentation.