Scaling & Performance

Request Lifecycle

Every Recommend API request passes through a series of stages. Understanding where caching and rate limiting apply helps you tune for your workload.

Caching Strategy

KaireonAI uses Redis as its caching layer. All cache reads go through the getCache().getOrFetch() helper, which transparently handles cache misses by querying PostgreSQL and storing the result with a configurable TTL.

What gets cached

Data	Cache Key Pattern	Default TTL	Notes
Active offers	`t:{tenantId}:offers:active`	300s (5 min)	Includes creatives, categories, placements
Qualification rules	`t:{tenantId}:policies:eligibility`	300s (5 min)	Active rules ordered by priority
Contact policies	`t:{tenantId}:policies:contactPolicy`	300s (5 min)	Active policies ordered by priority
Guardrail rules	`t:{tenantId}:guardrails:active`	30s	Shorter TTL for faster policy iteration
Enrichment data	`enrich:{tenantId}:{customerId}:{schemaId}`	Configurable per source (default 60s)	Per-customer, per-schema-source
Rate limit counters	`ratelimit:{route}:{tenantId}:{identifier}`	Window duration + 1s	Redis sorted sets
Cap counters	`kaireon:cap:{key}`	End of current UTC day	Auto-expire at midnight UTC

Tuning TTLs

Enrichment sources support per-source TTL configuration in the Decision Flow’s enrichment stage:

{
  "enrichment": {
    "sources": [
      {
        "schemaId": "customer_profile",
        "fields": ["loan_amount", "credit_score"],
        "cacheTtlSeconds": 300,
        "prefix": "customer"
      },
      {
        "schemaId": "real_time_signals",
        "fields": ["last_page_viewed"],
        "cacheTtlSeconds": 10,
        "prefix": "signals"
      }
    ]
  }
}

Set cacheTtlSeconds lower for volatile data (real-time signals, session context) and higher for stable data (customer demographics, account details).

Cache Invalidation

Entity caches (offers, rules, policies) use a TTL-based expiration strategy. After updating an offer or policy via the CRUD API, changes propagate within the TTL window (up to 5 minutes for offers/policies, 30 seconds for guardrails). For immediate invalidation, restart the API process or reduce the TTL via environment configuration.

Rate Limiting

The Recommend API enforces per-tenant, per-endpoint rate limiting using a Redis sorted-set sliding window algorithm.

Algorithm

ZREMRANGEBYSCORE — remove entries outside the current window
ZADD — add the current request with its timestamp as score
ZCARD — count entries remaining in the window
EXPIRE — set key TTL to window duration + 1 second for cleanup

All four operations execute in a single Redis pipeline for atomicity. A 500ms timeout protects against Redis latency — if Redis does not respond in time, the limiter falls back to an in-memory sliding window.

Tier Configuration

Rate limits are tenant-scoped with three built-in tiers:

Tier	Requests per Minute
`free`	100
`standard`	1,000
`enterprise`	10,000

Configure per-tenant tiers via environment variables:

# Global default tier
RATE_LIMIT_TIER=standard

# Per-tenant override (tenant ID uppercased, hyphens to underscores)
RATE_LIMIT_TIER_ACME_CORP=enterprise

Response Headers

When a request is rate-limited, the API returns HTTP 429 with:

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed in the window
`X-RateLimit-Remaining`	Requests remaining (0 when limited)
`Retry-After`	Seconds until the client should retry

Fail-Open vs Fail-Closed

The rate limiter supports two failure modes when Redis is unavailable:

Fail-open (default): Falls back to in-memory rate limiting. Use for standard API endpoints where availability matters more than strict enforcement.
Fail-closed: Returns 429 when Redis is down. The Recommend API uses failOpen: false to prevent abuse when rate limit state is unavailable.

Edge-Layer Rate Limiting

For DDoS protection, layer edge-level rate limiting in front of the API: nginx limit_req_zone, AWS WAF rate rules, or Cloudflare rate limiting rules. The application-level limiter handles tenant-scoped business logic limits; the edge layer handles volumetric protection.

Circuit Breakers

The Decision Flow engine includes a per-model circuit breaker to prevent cascading failures when a scoring model is unhealthy.

Parameters

Parameter	Value	Source
Failure threshold	5 consecutive failures	`MODEL_CB_THRESHOLD`
Cooldown period	60 seconds	`MODEL_CB_COOLDOWN_MS`
Fallback score	0.5 (configurable)	`SCORING_FALLBACK_SCORE` env var
Fallback method	`priority_weighted` scoring	Uses offer priority and creative weight

Behavior

Each scoring model is tracked by its modelKey.
On a model error, the failure counter increments via recordModelFailure().
When failures reach the threshold (5), the circuit opens and sets a cooldown expiry at now + 60 seconds.
While open, all requests for that model skip the model call entirely and receive the fallback score (0.5 * weight * fitMultiplier).
After the cooldown expires, the next request acts as a probe — if it succeeds, recordModelSuccess() resets the counter (circuit closes). If it fails, the circuit re-opens for another cooldown period.
The response includes degradedScoring: true when any model was bypassed.

Monitoring

The scoringModelFailureTotal Prometheus counter tracks failures by model key. Monitor this metric to detect model health issues before they impact all decisions. The decisionFlowExecutionLatency histogram tracks end-to-end pipeline latency.

Atomic Cap Checking

Mandatory offer daily caps use Redis INCR for race-free atomic counting:

INCR kaireon:cap:mandatory:{customerId}:{YYYY-MM-DD} — single atomic read-and-increment
On first increment (current === 1), set EXPIRE to end of current UTC day
If current > cap, the offer is blocked
Default cap: 5 mandatory offers per customer per day (configurable via MAX_MANDATORY_OFFERS_PER_DAY env var)

When Redis is unavailable, falls back to a Prisma-based count query against interactionSummary records. This fallback has a small race window under concurrent requests but is acceptable for resilience. The cap check fails closed — if both Redis and the database count fail, mandatory offers are blocked entirely (safety-first design).

Connection Pooling

KaireonAI uses Prisma 7 with the @prisma/adapter-pg driver adapter. Connection pooling is handled by the underlying pg Pool.

Configuration

The database connection is configured in prisma.config.ts via the DATABASE_URL environment variable. Pool sizing is controlled through connection string parameters:

# Example with pool sizing parameters
DATABASE_URL="postgresql://user:pass@host:5432/kaireon?connection_limit=20&pool_timeout=10"

Sizing Guidance

Deployment Size	Suggested Pool Size	Notes
Single instance	5-10	Default pg Pool settings are sufficient
2-5 replicas	10-15 per replica	Total connections = replicas x pool size
10+ replicas	5-10 per replica	Use PgBouncer or RDS Proxy to multiplex

Rule of thumb: Total connections across all replicas should not exceed your database’s max_connections minus a buffer for admin/monitoring connections.

Horizontal Scaling

The KaireonAI API is stateless — all shared state lives in Redis and PostgreSQL. This means you can scale API instances horizontally with no coordination overhead.

Architecture

                    ┌─────────────┐
                    │   Load      │
                    │  Balancer   │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────┴─────┐ ┌───┴─────┐ ┌───┴─────┐
        │  API Pod   │ │ API Pod │ │ API Pod │
        │  (Node.js) │ │         │ │         │
        └─────┬──────┘ └───┬─────┘ └───┬─────┘
              │            │            │
              └────────────┼────────────┘
                           │
              ┌────────────┼────────────┐
              │                         │
        ┌─────┴─────┐           ┌──────┴──────┐
        │   Redis   │           │ PostgreSQL  │
        │ (shared)  │           │  (shared)   │
        └───────────┘           └─────────────┘

Key Properties

No session affinity required: Any API pod can handle any request. Rate limit state and caching are in Redis; all persistent state is in PostgreSQL.
In-memory circuit breakers are per-process: Each pod tracks its own model failure counts. This is intentional — a model failure on one pod does not cascade to others, and each pod independently probes recovery.
In-memory rate limit fallback is per-process: When Redis is down, each pod maintains its own rate limit counters. Effective limits become configured_limit x num_pods during Redis outages.
Scale API pods independently from worker pods: Decision API pods handle synchronous request/response. Data pipeline worker pods handle asynchronous ETL. Size each tier based on its workload.

Batch vs Streaming Pipelines

Data pipelines support two execution modes, configured per-pipeline in the executionConfig JSON field:

Batch Mode

For scheduled or on-demand data loads:

{
  "mode": "batch",
  "batchSize": 1000,
  "parallelism": 4,
  "partitioning": {
    "strategy": "hash",
    "key": "customer_id",
    "partitions": 8
  }
}

Parameter	Description	Default
`batchSize`	Records per processing chunk	1000
`parallelism`	Concurrent processing threads	1
`partitioning.strategy`	How to split data (`hash`, `range`, `round_robin`)	None
`partitioning.key`	Field to partition on	—
`partitioning.partitions`	Number of partitions	—

Streaming Mode (planned)

Streaming mode is a placeholder and not yet implemented. The execution mode field accepts streaming for forward-compatibility, but the platform does not currently spawn a long-lived consumer process. Kafka, Confluent, and (when shipped) Amazon Kinesis connectors run as batch polling — each pipeline run opens a consumer, reads up to maxMessages records, commits offsets, and closes. Schedule those pipelines on a cron cadence that matches your freshness target until a persistent worker is available.

The config shape below is reserved for the future streaming runtime — today it has no effect beyond being persisted on the pipeline record:

{
  "mode": "streaming",
  "batchSize": 100,
  "parallelism": 2,
  "checkpointIntervalMs": 30000
}

K8s Worker Pod Configuration

Pipeline execution runs on dedicated worker pods, separate from the API tier. Configure resource limits based on pipeline complexity:

CPU-bound transforms (expression evaluation, hashing): Scale parallelism up to available CPU cores.
I/O-bound transforms (external lookups, PII masking): Higher parallelism with moderate CPU allocation.
Memory-bound transforms (large batch joins): Increase pod memory limits and reduce batchSize.

Production Tuning Checklist

Under 1K decisions/day

Single API instance with default settings
In-memory rate limiting and caching are sufficient
Default connection pool (5-10 connections)
No Redis required (in-memory fallbacks handle the load)

1K — 100K decisions/day

Redis required for rate limiting, enrichment caching, and atomic cap checks
Tune enrichment TTLs: increase to 300s+ for stable customer data
Increase connection pool to 15-20 per instance
Enable decision tracing with a sample rate (e.g., 10%) rather than 100%
Monitor decisionLatencyMs and scoringModelFailureTotal metrics

100K+ decisions/day

Multiple API replicas behind a load balancer (3+ pods recommended)
Dedicated Redis instance (ElastiCache or equivalent) with sufficient memory for rate limit sorted sets + enrichment cache
PostgreSQL read replicas for offer/policy reads; primary for writes only
Use PgBouncer or RDS Proxy to multiplex database connections
Set MAX_ACTIVE_OFFERS env var to limit the offer scan set (default: 5,000)
Lower guardrail TTL if policy changes need sub-30s propagation
Configure edge-layer rate limiting (AWS WAF, Cloudflare) for DDoS protection
Enable budget pacing for high-volume offers to spread delivery across the day
Set SCORING_FALLBACK_SCORE to a value appropriate for your scoring distribution

Environment Variables Reference

Variable	Description	Default
`REDIS_URL`	Redis connection string	`redis://localhost:6379`
`DATABASE_URL`	PostgreSQL connection string	— (required)
`MAX_ACTIVE_OFFERS`	Max offers loaded per decision	`5000`
`MAX_MANDATORY_OFFERS_PER_DAY`	Daily mandatory cap per customer	`5`
`SCORING_FALLBACK_SCORE`	Score when model is unavailable	`0.5`
`RATE_LIMIT_TIER`	Global rate limit tier	`standard`

Architecture Overview

System architecture and module layout

Operations

Monitoring, dashboards, and alerting

Decision Engine

How the decision pipeline works

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Documentation Index

​Request Lifecycle

​Caching Strategy

​What gets cached

​Tuning TTLs

​Cache Invalidation

​Rate Limiting

​Algorithm

​Tier Configuration

​Response Headers

​Fail-Open vs Fail-Closed

​Edge-Layer Rate Limiting

​Circuit Breakers

​Parameters

​Behavior

​Monitoring

​Atomic Cap Checking

​Connection Pooling

​Configuration

​Sizing Guidance

​Horizontal Scaling

​Architecture

​Key Properties

​Batch vs Streaming Pipelines

​Batch Mode

​Streaming Mode (planned)

​K8s Worker Pod Configuration

​Production Tuning Checklist

​Under 1K decisions/day

​1K — 100K decisions/day

​100K+ decisions/day

​Environment Variables Reference

​Related