Operations Architecture

KaireonAI ships with production-grade operational infrastructure built into the platform. Rate limiting, circuit breakers, dead letter queues, and Prometheus metrics all work out of the box — configure them through environment variables and tenant settings.

Prometheus Metrics

All metrics are exposed at GET /api/metrics in Prometheus text format. Scrape this endpoint from your Prometheus server or any compatible collector. Default metrics (Node.js process stats) are auto-collected with the kaireon_ prefix.

Key Metrics

Metric	Type	Labels	Description
`kaireon_http_request_duration_seconds`	Histogram	`method`, `route`, `status`	HTTP request latency across all API routes
`kaireon_decision_latency_ms`	Histogram	`channel`	End-to-end decision engine latency
`kaireon_decision_flow_execution_latency_ms`	Histogram	—	Decision flow pipeline execution time
`kaireon_decision_pipeline_duration_ms`	Histogram	—	Total pipeline duration including all sub-stages
`kaireon_qualification_filter_latency_ms`	Histogram	—	Qualification rule evaluation time
`kaireon_contact_policy_filter_latency_ms`	Histogram	—	Contact policy filter time
`kaireon_scoring_latency_ms`	Histogram	—	Model scoring stage time
`kaireon_ranking_latency_ms`	Histogram	—	Final ranking stage time
`kaireon_scoring_model_failure_total`	Counter	`modelKey`	Scoring model failures (triggers circuit breaker)
`kaireon_circuit_breaker_state_change_total`	Counter	`name`, `from`, `to`	Circuit breaker state transitions
`kaireon_dlq_depth`	Gauge	`tenant`	Current dead letter queue depth
`kaireon_cache_hits_total`	Counter	`key`	Cache hits
`kaireon_cache_misses_total`	Counter	`key`	Cache misses
`kaireon_offers_evaluated_total`	Counter	—	Total offers evaluated in decisions
`kaireon_decision_delivery_total`	Counter	—	Total recommendation responses delivered
`kaireon_respond_outcome_total`	Counter	`outcomeType`, `classification`	Outcomes recorded by type
`kaireon_experiment_assignment_total`	Counter	`variant`	Experiment variant assignments
`kaireon_guardrail_evaluation_total`	Counter	`result`	Guardrail evaluation outcomes
`kaireon_mandatory_cap_hit_total`	Counter	—	Mandatory offer daily cap hits
`kaireon_http_error_total`	Counter	`method`, `route`, `status_class`	HTTP 4xx/5xx error responses
`kaireon_pipeline_execution_latency_ms`	Histogram	—	Data pipeline execution latency
`kaireon_pipeline_rows_processed_total`	Counter	—	Rows processed by data pipelines
`kaireon_outbox_event_age_seconds`	Histogram	`topic`	Age of outbox events when processed
`kaireon_outbox_processed_total`	Counter	`status`	Outbox events processed
`kaireon_decision_candidates`	Gauge	`stage`	Candidate count at each pipeline stage
`kaireon_decision_qualification_filter_rate`	Gauge	—	Ratio of candidates filtered by qualification
`kaireon_decision_contact_policy_filter_rate`	Gauge	—	Ratio of candidates filtered by contact policy

Rate Limiting

The platform uses a sliding window algorithm to enforce per-key request limits. Each request timestamp is recorded; when the count within the window exceeds the configured maximum, subsequent requests are rejected with 429 Too Many Requests.

How It Works

On each request, timestamps older than the window are pruned
If the remaining count is at or above maxRequests, the request is rejected
Otherwise the timestamp is recorded and the request proceeds

Storage Modes

Mode	When Used	How It Works
In-memory	Single-process deployments, or Redis unavailable	Timestamps stored in a `Map` per key (max 50,000 entries with automatic eviction)
Redis-backed	Multi-process / multi-node	Uses sorted sets (zadd + zremrangebyscore + zcard) in an atomic pipeline per check
Fallback	Redis configured but temporarily down	The limiter tries Redis first; on connection failure it transparently switches to the in-memory store so requests continue to be evaluated

Redis keys follow the pattern ratelimit:sw:{key} with automatic expiry set to the window duration.

Response Headers

When a request is rate-limited, the API returns:

Header	Value
`Retry-After`	Milliseconds until the earliest window slot frees up

Each rate-limit decision exposes allowed (boolean), remaining (requests left in the window), and retryAfterMs (set when the request is rejected) so callers can surface helpful retry guidance to clients.

Configuration

Rate limiters are instantiated with two parameters:

new RateLimiter({
  maxRequests: 100,  // requests per window
  windowMs: 60_000,  // window size in milliseconds
});

If Redis is not configured (REDIS_URL not set), rate limiting falls back to in-memory mode. The platform still works, but limits are per-process rather than global.

Circuit Breakers

KaireonAI uses circuit breakers to prevent cascading failures when external services (connectors, webhooks, scoring models) become unavailable.

State Machine

CLOSED  ──[failures >= threshold]──>  OPEN
  ^                                     |
  |                              [cooldown expires]
  |                                     v
  └──────[probe succeeds]────────  HALF_OPEN

State	Behavior
Closed	All requests pass through normally
Open	All requests are short-circuited immediately (fail fast). No calls to the downstream service
Half-open	After the cooldown expires, up to `maxHalfOpenProbes` probe requests are allowed through. A success resets to Closed; a failure returns to Open

Default Thresholds

Parameter	Default	Description
`failureThreshold`	5	Consecutive failures before opening the circuit
`cooldownMs`	300,000 (5 min)	Time in Open state before allowing probes
`maxHalfOpenProbes`	3	Maximum concurrent probe requests in Half-open

Scoring Model Circuit Breaker

The decision engine has a dedicated circuit breaker for scoring models with tighter thresholds:

Parameter	Value
Failure threshold	5 consecutive failures
Cooldown	60 seconds
Fallback	Default priority-based score

When a model circuit opens, the engine falls back to the offer’s configured priority score so decisions continue without interruption.

Persistence

Circuit breaker state is persisted to Redis (key prefix kaireon:cb:) when REDIS_URL is set, so state survives process restarts. If Redis is unavailable, state is maintained in-memory only.

Where Circuit Breakers Are Used

Component	Circuit Key Pattern	Thresholds
Scoring models	Per model key	5 failures / 60s cooldown
Alert webhooks	`alert-webhook:{target}`	3 failures / 60s cooldown
Trigger webhooks	`trigger-webhook:{url}`	3 failures / 60s cooldown
Audit logging	`audit-log`	Default (5 / 5 min)

The /api/health endpoint reports all circuit breaker statuses. An open breaker sets health to degraded.

Prometheus Integration

Every state transition emits a kaireon_circuit_breaker_state_change_total counter increment with labels name, from, and to. Alert on transitions to open:

increase(kaireon_circuit_breaker_state_change_total{to="open"}[5m]) > 0

Dead Letter Queue

Events that fail processing after retries are moved from the outbox to the dead letter queue (DLQ). DLQ entries are persisted, scoped per tenant, and organized by topic so admins can triage failed events by source.

Admin API

GET /api/v1/admin/dlq — Retrieve DLQ summary and events (admin role required).

Parameter	Type	Default	Description
`limit`	query	50	Max events returned (capped at 200)
`topic`	query	—	Filter by topic

Response includes totalEvents, a byTopic breakdown, the event list, and an alert field:

Alert Level	Condition
`"OK"`	10 or fewer events
`"WARNING"`	11-100 events
`"CRITICAL"`	More than 100 events

POST /api/v1/admin/dlq — Retry or purge DLQ events (admin role required).

{
  "action": "retry",       // "retry" or "purge"
  "eventIds": ["..."],     // optional: specific event IDs
  "topic": "decisions"     // optional: all events for a topic
}

Retry re-enqueues events back to the outbox with status: "pending" and retryCount: 0, then deletes the DLQ entry (transactional).
Purge permanently deletes matching DLQ events.

Monitoring

Track DLQ growth with the kaireon_dlq_depth gauge and outbox health with kaireon_outbox_event_age_seconds and kaireon_outbox_processed_total.

Cache Management

The platform caches offers, qualification rules, and contact policies to reduce database load during decision execution. An emergency flush endpoint is available for situations where cached data becomes stale. POST /api/v1/admin/cache — Emergency cache invalidation (admin role required).

{
  "scope": "all"  // "all" | "offers" | "qualificationRule" | "contactPolicy"
}

If no body is provided, all caches are flushed. Every flush is audit-logged. Monitor cache effectiveness with kaireon_cache_hits_total and kaireon_cache_misses_total.

Performance

KaireonAI applies several optimizations to keep decision latency low and pipeline throughput high.

Decision Pipeline Caching

The decision engine caches frequently accessed data in Redis to avoid repeated database queries during recommendation processing:

Cached Data	TTL	Key Pattern	Impact
Qualification rules	120s	`qual:{tenantId}`	Avoids per-request rule loading
Contact policies	120s	`policy:{tenantId}`	Avoids per-request policy loading
Decision flow config	120s	`flow:{flowId}`	Flow route resolution cached per flow
Enrichment data	120s	`enrich:{schemaId}:{customerId}`	Customer data cached across offers

Cache entries are automatically invalidated when the underlying entity is updated through the API.

Query Optimizations

Creative queries are filtered by the set of candidate offer IDs, not loaded for the entire tenant. This prevents unbounded memory usage when a tenant has thousands of creatives across many offers.
Flow route resolution is cached with a 120-second TTL, avoiding repeated database lookups for the same flow across concurrent requests.

Pipeline Throughput

Chunked inserts — CSV ingestion writes to the database in batches of 1,000 rows, preventing memory exhaustion on large files and reducing transaction lock duration.
Streaming batch execution — The batch executor uses summary counters (rows loaded, failed, skipped) instead of accumulating all row results in memory, allowing pipelines to process files larger than available RAM.

Decision Traces

Decision traces provide forensic visibility into every stage of the decision pipeline. Configure tracing in Settings > General > Retention > Decision Trace.

Setting	Description
`decisionTraceEnabled`	Master toggle for trace capture
`decisionTraceSampleRate`	Percentage of requests to trace (0-100)
Retention period	How long traces are retained before cleanup

Traces are persisted to the decision-trace store and viewable from the Decision Flows detail page. Each trace records the full pipeline execution: candidates at each stage, filter reasons, scores, rankings, and timing breakdowns.

Monitoring Recipes

Latency Spiked

Decision latency suddenly increased. Identify which pipeline stage is the bottleneck:

# Overall decision latency p99
histogram_quantile(0.99, rate(kaireon_decision_pipeline_duration_ms_bucket[5m]))

# Break down by sub-stage to find the bottleneck
histogram_quantile(0.99, rate(kaireon_qualification_filter_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(kaireon_scoring_latency_ms_bucket[5m]))
histogram_quantile(0.99, rate(kaireon_ranking_latency_ms_bucket[5m]))

Check for scoring model circuit breakers opening (model inference may be slow or failing):

increase(kaireon_scoring_model_failure_total[5m]) > 0

Also check cache miss rates — a spike in misses can indicate a recent cache flush or deployment:

rate(kaireon_cache_misses_total[5m]) / (rate(kaireon_cache_hits_total[5m]) + rate(kaireon_cache_misses_total[5m]))

Conversion Dropped

Response outcomes stopped improving. Check delivery and outcome rates:

# Decision delivery rate over time
rate(kaireon_decision_delivery_total[1h])

# Outcome recording by type
rate(kaireon_respond_outcome_total[1h])

Look at qualification and contact policy filter rates to see if too many candidates are being filtered:

kaireon_decision_qualification_filter_rate
kaireon_decision_contact_policy_filter_rate

Check if experiment assignments are skewed:

rate(kaireon_experiment_assignment_total[1h])

DLQ Growing

The dead letter queue is accumulating events:

# DLQ depth by tenant
kaireon_dlq_depth

# Outbox processing failures
rate(kaireon_outbox_processed_total{status="failed"}[5m])

# Age of oldest unprocessed event
histogram_quantile(0.99, rate(kaireon_outbox_event_age_seconds_bucket[5m]))

Remediation steps:

Check the DLQ admin endpoint (GET /api/v1/admin/dlq) to identify failing topics
Investigate the root cause (downstream service outage, schema mismatch)
Fix the underlying issue
Retry events with POST /api/v1/admin/dlq with action: "retry"

HTTP Error Rate Elevated

Monitor 4xx and 5xx error rates across all API routes:

# 5xx error rate
rate(kaireon_http_error_total{status_class="5xx"}[5m])

# By route
topk(5, rate(kaireon_http_error_total{status_class="5xx"}[5m]))

Cross-reference with circuit breaker state changes — open breakers often correlate with elevated 5xx rates:

increase(kaireon_circuit_breaker_state_change_total{to="open"}[5m])

Dashboards

Monitor platform health with the built-in dashboards.

Scaling & Deployment

Scaling configuration for multi-node deployments.

Troubleshooting

Common issues and resolution steps.

API Reference

Full API endpoint documentation.

Deploy

Configure

Operate

Architecture

Runbooks

Operations Architecture

Prometheus Metrics

Key Metrics

Rate Limiting

How It Works

Storage Modes

Response Headers

Configuration

Circuit Breakers

State Machine

Default Thresholds

Scoring Model Circuit Breaker

Persistence

Where Circuit Breakers Are Used

Prometheus Integration

Dead Letter Queue

Admin API

Monitoring

Cache Management

Performance

Decision Pipeline Caching

Query Optimizations

Pipeline Throughput

Decision Traces

Monitoring Recipes

Latency Spiked

Conversion Dropped

DLQ Growing

HTTP Error Rate Elevated

Dashboards

Scaling & Deployment

Troubleshooting

API Reference

Deploy

Configure

Operate

Architecture

Runbooks

Documentation Index

​Prometheus Metrics

​Key Metrics

​Rate Limiting

​How It Works

​Storage Modes

​Response Headers

​Configuration

​Circuit Breakers

​State Machine

​Default Thresholds

​Scoring Model Circuit Breaker

​Persistence

​Where Circuit Breakers Are Used

​Prometheus Integration

​Dead Letter Queue

​Admin API

​Monitoring

​Cache Management

​Performance

​Decision Pipeline Caching

​Query Optimizations

​Pipeline Throughput

​Decision Traces

​Monitoring Recipes

​Latency Spiked

​Conversion Dropped

​DLQ Growing

​HTTP Error Rate Elevated

​Related

Dashboards

Scaling & Deployment

Troubleshooting

API Reference

Prometheus Metrics

Key Metrics

Rate Limiting

How It Works

Storage Modes

Response Headers

Configuration

Circuit Breakers

State Machine

Default Thresholds

Scoring Model Circuit Breaker

Persistence

Where Circuit Breakers Are Used

Prometheus Integration

Dead Letter Queue

Admin API

Monitoring

Cache Management

Performance

Decision Pipeline Caching

Query Optimizations

Pipeline Throughput

Decision Traces

Monitoring Recipes

Latency Spiked

Conversion Dropped

DLQ Growing

HTTP Error Rate Elevated

Related