Skip to main content
KaireonAI ships with production-grade operational infrastructure built into the platform. Rate limiting, circuit breakers, dead letter queues, and Prometheus metrics all work out of the box — configure them through environment variables and tenant settings.

Prometheus Metrics

All metrics are exposed at GET /api/v1/metrics in Prometheus text format. Scrape this endpoint from your Prometheus server or any compatible collector. Default metrics (Node.js process stats) are auto-collected with the kaireon_ prefix.

Key Metrics

MetricTypeLabelsDescription
kaireon_http_request_duration_secondsHistogrammethod, route, statusHTTP request latency across all API routes
kaireon_decision_latency_msHistogramchannelEnd-to-end decision engine latency
kaireon_decision_flow_execution_latency_msHistogramDecision flow pipeline execution time
kaireon_decision_pipeline_duration_msHistogramTotal pipeline duration including all sub-stages
kaireon_qualification_filter_latency_msHistogramQualification rule evaluation time
kaireon_contact_policy_filter_latency_msHistogramContact policy filter time
kaireon_scoring_latency_msHistogramModel scoring stage time
kaireon_ranking_latency_msHistogramFinal ranking stage time
kaireon_scoring_model_failure_totalCountermodelKeyScoring model failures (triggers circuit breaker)
kaireon_circuit_breaker_state_change_totalCountername, from, toCircuit breaker state transitions
kaireon_dlq_depthGaugetenantCurrent dead letter queue depth
kaireon_cache_hits_totalCounterkeyCache hits
kaireon_cache_misses_totalCounterkeyCache misses
kaireon_offers_evaluated_totalCounterTotal offers evaluated in decisions
kaireon_decision_delivery_totalCounterTotal recommendation responses delivered
kaireon_respond_outcome_totalCounteroutcomeType, classificationOutcomes recorded by type
kaireon_experiment_assignment_totalCountervariantExperiment variant assignments
kaireon_guardrail_evaluation_totalCounterresultGuardrail evaluation outcomes
kaireon_mandatory_cap_hit_totalCounterMandatory offer daily cap hits
kaireon_http_error_totalCountermethod, route, status_classHTTP 4xx/5xx error responses
kaireon_pipeline_execution_latency_msHistogramData pipeline execution latency
kaireon_pipeline_rows_processed_totalCounterRows processed by data pipelines
kaireon_outbox_event_age_secondsHistogramtopicAge of outbox events when processed
kaireon_outbox_processed_totalCounterstatusOutbox events processed
kaireon_decision_candidatesGaugestageCandidate count at each pipeline stage
kaireon_decision_qualification_filter_rateGaugeRatio of candidates filtered by qualification
kaireon_decision_contact_policy_filter_rateGaugeRatio of candidates filtered by contact policy

Rate Limiting

The platform uses a sliding window algorithm to enforce per-key request limits. Each request timestamp is recorded; when the count within the window exceeds the configured maximum, subsequent requests are rejected with 429 Too Many Requests.

How It Works

  1. On each request, timestamps older than the window are pruned
  2. If the remaining count is at or above maxRequests, the request is rejected
  3. Otherwise the timestamp is recorded and the request proceeds

Storage Modes

ModeWhen UsedHow It Works
In-memorySingle-process deployments, or Redis unavailableTimestamps stored in a Map per key (max 50,000 entries with automatic eviction)
Redis-backedMulti-process / multi-nodeUses sorted sets (ZADD + ZREMRANGEBYSCORE + ZCARD) in an atomic pipeline per check
FallbackRedis configured but temporarily downcheckWithFallback() tries Redis first, falls back to in-memory
Redis keys follow the pattern ratelimit:sw:{key} with automatic expiry set to the window duration.

Response Headers

When a request is rate-limited, the API returns:
HeaderValue
Retry-AfterMilliseconds until the earliest window slot frees up
The RateLimitResult returned to the caller includes allowed (boolean), remaining (requests left in window), and retryAfterMs (when rejected).

Configuration

Rate limiters are instantiated with two parameters:
new RateLimiter({
  maxRequests: 100,  // requests per window
  windowMs: 60_000,  // window size in milliseconds
});
If Redis is not configured (REDIS_URL not set), rate limiting falls back to in-memory mode. The platform still works, but limits are per-process rather than global.

Circuit Breakers

KaireonAI uses circuit breakers to prevent cascading failures when external services (connectors, webhooks, scoring models) become unavailable.

State Machine

CLOSED  ──[failures >= threshold]──>  OPEN
  ^                                     |
  |                              [cooldown expires]
  |                                     v
  └──────[probe succeeds]────────  HALF_OPEN
StateBehavior
ClosedAll requests pass through normally
OpenAll requests are short-circuited immediately (fail fast). No calls to the downstream service
Half-openAfter the cooldown expires, up to maxHalfOpenProbes probe requests are allowed through. A success resets to Closed; a failure returns to Open

Default Thresholds

ParameterDefaultDescription
failureThreshold5Consecutive failures before opening the circuit
cooldownMs300,000 (5 min)Time in Open state before allowing probes
maxHalfOpenProbes3Maximum concurrent probe requests in Half-open

Scoring Model Circuit Breaker

The decision engine has a dedicated circuit breaker for scoring models with tighter thresholds:
ParameterValue
Failure threshold5 consecutive failures
Cooldown60 seconds
FallbackDefault priority-based score
When a model circuit opens, the engine falls back to the offer’s configured priority score so decisions continue without interruption.

Persistence

Circuit breaker state is persisted to Redis (key prefix kaireon:cb:) when REDIS_URL is set, so state survives process restarts. If Redis is unavailable, state is maintained in-memory only.

Where Circuit Breakers Are Used

ComponentCircuit Key PatternThresholds
Scoring modelsPer model key5 failures / 60s cooldown
Alert webhooksalert-webhook:{target}3 failures / 60s cooldown
Trigger webhookstrigger-webhook:{url}3 failures / 60s cooldown
Audit loggingaudit-logDefault (5 / 5 min)
The /api/health endpoint reports all circuit breaker statuses. An open breaker sets health to degraded.

Prometheus Integration

Every state transition emits a kaireon_circuit_breaker_state_change_total counter increment with labels name, from, and to. Alert on transitions to open:
increase(kaireon_circuit_breaker_state_change_total{to="open"}[5m]) > 0

Dead Letter Queue

Events that fail processing after retries are moved from the outbox to the dead letter queue (DLQ). The DLQ is backed by the DeadLetterEvent database table, scoped per tenant and organized by topic.

Admin API

GET /api/v1/admin/dlq — Retrieve DLQ summary and events (admin role required).
ParameterTypeDefaultDescription
limitquery50Max events returned (capped at 200)
topicqueryFilter by topic
Response includes totalEvents, a byTopic breakdown, the event list, and an alert field:
Alert LevelCondition
OK10 or fewer events
WARNING11-100 events
CRITICALMore than 100 events
POST /api/v1/admin/dlq — Retry or purge DLQ events (admin role required).
{
  "action": "retry",       // "retry" or "purge"
  "eventIds": ["..."],     // optional: specific event IDs
  "topic": "decisions"     // optional: all events for a topic
}
  • Retry re-enqueues events back to the outbox with status: "pending" and retryCount: 0, then deletes the DLQ entry (transactional).
  • Purge permanently deletes matching DLQ events.

Monitoring

Track DLQ growth with the kaireon_dlq_depth gauge and outbox health with kaireon_outbox_event_age_seconds and kaireon_outbox_processed_total.

Cache Management

The platform caches offers, qualification rules, and contact policies to reduce database load during decision execution. An emergency flush endpoint is available for situations where cached data becomes stale. POST /api/v1/admin/cache — Emergency cache invalidation (admin role required).
{
  "scope": "all"  // "all" | "offers" | "qualificationRule" | "contactPolicy"
}
If no body is provided, all caches are flushed. Every flush is audit-logged. Monitor cache effectiveness with kaireon_cache_hits_total and kaireon_cache_misses_total.

Performance

KaireonAI applies several optimizations to keep decision latency low and pipeline throughput high.

Decision Pipeline Caching

The decision engine caches frequently accessed data in Redis to avoid repeated database queries during recommendation processing:
Cached DataTTLKey PatternImpact
Qualification rules120squal:{tenantId}Avoids per-request rule loading
Contact policies120spolicy:{tenantId}Avoids per-request policy loading
Decision flow config120sflow:{flowId}Flow route resolution cached per flow
Enrichment data120senrich:{schemaId}:{customerId}Customer data cached across offers
Cache entries are automatically invalidated when the underlying entity is updated through the API.

Query Optimizations

  • Creative queries are filtered by the set of candidate offer IDs, not loaded for the entire tenant. This prevents unbounded memory usage when a tenant has thousands of creatives across many offers.
  • Flow route resolution is cached with a 120-second TTL, avoiding repeated database lookups for the same flow across concurrent requests.

Pipeline Throughput

  • Chunked inserts — CSV ingestion writes to the database in batches of 1,000 rows, preventing memory exhaustion on large files and reducing transaction lock duration.
  • Streaming batch execution — The batch executor uses summary counters (rows loaded, failed, skipped) instead of accumulating all row results in memory, allowing pipelines to process files larger than available RAM.

Decision Traces

Decision traces provide forensic visibility into every stage of the decision pipeline. Configure tracing in Settings > General > Retention > Decision Trace.
SettingDescription
decisionTraceEnabledMaster toggle for trace capture
decisionTraceSampleRatePercentage of requests to trace (0-100)
Retention periodHow long traces are retained before cleanup
Traces are stored in the DecisionTrace table and viewable from the Decision Flows detail page. Each trace records the full pipeline execution: candidates at each stage, filter reasons, scores, rankings, and timing breakdowns.

Monitoring Recipes

Latency Spiked

Decision latency suddenly increased. Identify which pipeline stage is the bottleneck:
# Overall decision latency p99
histogram_quantile(0.99, rate(kaireon_decision_pipeline_duration_ms_bucket[5m]))

# Break down by sub-stage to find the bottleneck
histogram_quantile(0.99, rate(kaireon_qualification_filter_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(kaireon_scoring_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(kaireon_ranking_latency_ms_bucket[5m]))
Check for scoring model circuit breakers opening (model inference may be slow or failing):
increase(kaireon_scoring_model_failure_total[5m]) > 0
Also check cache miss rates — a spike in misses can indicate a recent cache flush or deployment:
rate(kaireon_cache_misses_total[5m]) / (rate(kaireon_cache_hits_total[5m]) + rate(kaireon_cache_misses_total[5m]))

Conversion Dropped

Response outcomes stopped improving. Check delivery and outcome rates:
# Decision delivery rate over time
rate(kaireon_decision_delivery_total[1h])

# Outcome recording by type
rate(kaireon_respond_outcome_total[1h])
Look at qualification and contact policy filter rates to see if too many candidates are being filtered:
kaireon_decision_qualification_filter_rate
kaireon_decision_contact_policy_filter_rate
Check if experiment assignments are skewed:
rate(kaireon_experiment_assignment_total[1h])

DLQ Growing

The dead letter queue is accumulating events:
# DLQ depth by tenant
kaireon_dlq_depth

# Outbox processing failures
rate(kaireon_outbox_processed_total{status="failed"}[5m])

# Age of oldest unprocessed event
histogram_quantile(0.99, rate(kaireon_outbox_event_age_seconds_bucket[5m]))
Remediation steps:
  1. Check the DLQ admin endpoint (GET /api/v1/admin/dlq) to identify failing topics
  2. Investigate the root cause (downstream service outage, schema mismatch)
  3. Fix the underlying issue
  4. Retry events with POST /api/v1/admin/dlq with action: "retry"

HTTP Error Rate Elevated

Monitor 4xx and 5xx error rates across all API routes:
# 5xx error rate
rate(kaireon_http_error_total{status_class="5xx"}[5m])

# By route
topk(5, rate(kaireon_http_error_total{status_class="5xx"}[5m]))
Cross-reference with circuit breaker state changes — open breakers often correlate with elevated 5xx rates:
increase(kaireon_circuit_breaker_state_change_total{to="open"}[5m])

Dashboards

Monitor platform health with the built-in dashboards.

Scaling & Deployment

Scaling configuration for multi-node deployments.

Troubleshooting

Common issues and resolution steps.

API Reference

Full API endpoint documentation.