Request Lifecycle
Every Recommend API request passes through a series of stages. Understanding where caching and rate limiting apply helps you tune for your workload.Caching Strategy
KaireonAI uses Redis as its caching layer. All cache reads go through thegetCache().getOrFetch() helper, which transparently handles cache misses by
querying PostgreSQL and storing the result with a configurable TTL.
What gets cached
| Data | Cache Key Pattern | Default TTL | Notes |
|---|---|---|---|
| Active offers | t:{tenantId}:offers:active | 300s (5 min) | Includes creatives, categories, placements |
| Qualification rules | t:{tenantId}:policies:eligibility | 300s (5 min) | Active rules ordered by priority |
| Contact policies | t:{tenantId}:policies:contactPolicy | 300s (5 min) | Active policies ordered by priority |
| Guardrail rules | t:{tenantId}:guardrails:active | 30s | Shorter TTL for faster policy iteration |
| Enrichment data | enrich:{tenantId}:{customerId}:{schemaId} | Configurable per source (default 60s) | Per-customer, per-schema-source |
| Rate limit counters | ratelimit:{route}:{tenantId}:{identifier} | Window duration + 1s | Redis sorted sets |
| Cap counters | kaireon:cap:{key} | End of current UTC day | Auto-expire at midnight UTC |
Tuning TTLs
Enrichment sources support per-source TTL configuration in the Decision Flow’s enrichment stage:cacheTtlSeconds lower for volatile data (real-time signals, session context)
and higher for stable data (customer demographics, account details).
Cache Invalidation
Entity caches (offers, rules, policies) use a TTL-based expiration strategy. After updating an offer or policy via the CRUD API, changes propagate within the TTL window (up to 5 minutes for offers/policies, 30 seconds for guardrails). For immediate invalidation, restart the API process or reduce the TTL via environment configuration.Rate Limiting
The Recommend API enforces per-tenant, per-endpoint rate limiting using a Redis sorted-set sliding window algorithm.Algorithm
- ZREMRANGEBYSCORE — remove entries outside the current window
- ZADD — add the current request with its timestamp as score
- ZCARD — count entries remaining in the window
- EXPIRE — set key TTL to window duration + 1 second for cleanup
Tier Configuration
Rate limits are tenant-scoped with three built-in tiers:| Tier | Requests per Minute |
|---|---|
free | 100 |
standard | 1,000 |
enterprise | 10,000 |
Response Headers
When a request is rate-limited, the API returns HTTP 429 with:| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests allowed in the window |
X-RateLimit-Remaining | Requests remaining (0 when limited) |
Retry-After | Seconds until the client should retry |
Fail-Open vs Fail-Closed
The rate limiter supports two failure modes when Redis is unavailable:- Fail-open (default): Falls back to in-memory rate limiting. Use for standard API endpoints where availability matters more than strict enforcement.
- Fail-closed: Returns 429 when Redis is down. The Recommend API uses
failOpen: falseto prevent abuse when rate limit state is unavailable.
Edge-Layer Rate Limiting
For DDoS protection, layer edge-level rate limiting in front of the API: nginxlimit_req_zone, AWS WAF rate rules, or Cloudflare rate limiting rules.
The application-level limiter handles tenant-scoped business logic limits;
the edge layer handles volumetric protection.
Circuit Breakers
The Decision Flow engine includes a per-model circuit breaker to prevent cascading failures when a scoring model is unhealthy.Parameters
| Parameter | Value | Source |
|---|---|---|
| Failure threshold | 5 consecutive failures | MODEL_CB_THRESHOLD |
| Cooldown period | 60 seconds | MODEL_CB_COOLDOWN_MS |
| Fallback score | 0.5 (configurable) | SCORING_FALLBACK_SCORE env var |
| Fallback method | priority_weighted scoring | Uses offer priority and creative weight |
Behavior
- Each scoring model is tracked by its
modelKey. - On a model error, the failure counter increments via
recordModelFailure(). - When failures reach the threshold (5), the circuit opens and sets a
cooldown expiry at
now + 60 seconds. - While open, all requests for that model skip the model call entirely and
receive the fallback score (
0.5 * weight * fitMultiplier). - After the cooldown expires, the next request acts as a probe — if it
succeeds,
recordModelSuccess()resets the counter (circuit closes). If it fails, the circuit re-opens for another cooldown period. - The response includes
degradedScoring: truewhen any model was bypassed.
Monitoring
ThescoringModelFailureTotal Prometheus counter tracks failures by model key.
Monitor this metric to detect model health issues before they impact all decisions.
The decisionFlowExecutionLatency histogram tracks end-to-end pipeline latency.
Atomic Cap Checking
Mandatory offer daily caps use Redis INCR for race-free atomic counting:INCR kaireon:cap:mandatory:{customerId}:{YYYY-MM-DD}— single atomic read-and-increment- On first increment (
current === 1), set EXPIRE to end of current UTC day - If
current > cap, the offer is blocked - Default cap: 5 mandatory offers per customer per day (configurable via
MAX_MANDATORY_OFFERS_PER_DAYenv var)
interactionSummary records. This fallback has a small race window under
concurrent requests but is acceptable for resilience.
The cap check fails closed — if both Redis and the database count fail,
mandatory offers are blocked entirely (safety-first design).
Connection Pooling
KaireonAI uses Prisma 7 with the@prisma/adapter-pg driver adapter. Connection
pooling is handled by the underlying pg Pool.
Configuration
The database connection is configured inprisma.config.ts via the DATABASE_URL
environment variable. Pool sizing is controlled through connection string parameters:
Sizing Guidance
| Deployment Size | Suggested Pool Size | Notes |
|---|---|---|
| Single instance | 5-10 | Default pg Pool settings are sufficient |
| 2-5 replicas | 10-15 per replica | Total connections = replicas x pool size |
| 10+ replicas | 5-10 per replica | Use PgBouncer or RDS Proxy to multiplex |
max_connections minus a buffer for admin/monitoring connections.
Horizontal Scaling
The KaireonAI API is stateless — all shared state lives in Redis and PostgreSQL. This means you can scale API instances horizontally with no coordination overhead.Architecture
Key Properties
- No session affinity required: Any API pod can handle any request. Rate limit state and caching are in Redis; all persistent state is in PostgreSQL.
- In-memory circuit breakers are per-process: Each pod tracks its own model failure counts. This is intentional — a model failure on one pod does not cascade to others, and each pod independently probes recovery.
- In-memory rate limit fallback is per-process: When Redis is down, each pod
maintains its own rate limit counters. Effective limits become
configured_limit x num_podsduring Redis outages. - Scale API pods independently from worker pods: Decision API pods handle synchronous request/response. Data pipeline worker pods handle asynchronous ETL. Size each tier based on its workload.
Batch vs Streaming Pipelines
Data pipelines support two execution modes, configured per-pipeline in theexecutionConfig JSON field:
Batch Mode
For scheduled or on-demand data loads:| Parameter | Description | Default |
|---|---|---|
batchSize | Records per processing chunk | 1000 |
parallelism | Concurrent processing threads | 1 |
partitioning.strategy | How to split data (hash, range, round_robin) | None |
partitioning.key | Field to partition on | — |
partitioning.partitions | Number of partitions | — |
Streaming Mode
For real-time event processing from Kafka, Kinesis, or Confluent:K8s Worker Pod Configuration
Pipeline execution runs on dedicated worker pods, separate from the API tier. Configure resource limits based on pipeline complexity:- CPU-bound transforms (expression evaluation, hashing): Scale
parallelismup to available CPU cores. - I/O-bound transforms (external lookups, PII masking): Higher
parallelismwith moderate CPU allocation. - Memory-bound transforms (large batch joins): Increase pod memory limits
and reduce
batchSize.
Production Tuning Checklist
Under 1K decisions/day
- Single API instance with default settings
- In-memory rate limiting and caching are sufficient
- Default connection pool (5-10 connections)
- No Redis required (in-memory fallbacks handle the load)
1K — 100K decisions/day
- Redis required for rate limiting, enrichment caching, and atomic cap checks
- Tune enrichment TTLs: increase to 300s+ for stable customer data
- Increase connection pool to 15-20 per instance
- Enable decision tracing with a sample rate (e.g., 10%) rather than 100%
- Monitor
decisionLatencyMsandscoringModelFailureTotalmetrics
100K+ decisions/day
- Multiple API replicas behind a load balancer (3+ pods recommended)
- Dedicated Redis instance (ElastiCache or equivalent) with sufficient memory for rate limit sorted sets + enrichment cache
- PostgreSQL read replicas for offer/policy reads; primary for writes only
- Use PgBouncer or RDS Proxy to multiplex database connections
- Set
MAX_ACTIVE_OFFERSenv var to limit the offer scan set (default: 5,000) - Lower guardrail TTL if policy changes need sub-30s propagation
- Configure edge-layer rate limiting (AWS WAF, Cloudflare) for DDoS protection
- Enable budget pacing for high-volume offers to spread delivery across the day
- Set
SCORING_FALLBACK_SCOREto a value appropriate for your scoring distribution
Environment Variables Reference
| Variable | Description | Default |
|---|---|---|
REDIS_URL | Redis connection string | redis://localhost:6379 |
DATABASE_URL | PostgreSQL connection string | — (required) |
MAX_ACTIVE_OFFERS | Max offers loaded per decision | 5000 |
MAX_MANDATORY_OFFERS_PER_DAY | Daily mandatory cap per customer | 5 |
SCORING_FALLBACK_SCORE | Score when model is unavailable | 0.5 |
RATE_LIMIT_TIER | Global rate limit tier | standard |