Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

Deployment Tiers

Choose a tier based on your traffic volume, latency requirements, and budget.
TierComputeDatabaseCacheMonthly CostDecisions/secP99 Latency
HobbyApp Runner 0.25 vCPU / 0.5 GBSupabase Free (500 MB)Upstash Free (10K cmd/day)~$6-85-10~800ms
StartupApp Runner 1 vCPU / 2 GBNeon Pro or RDS t3.microUpstash Pro or ElastiCache t3.micro~$50-8050-100~200ms
GrowthECS 2x t3.mediumRDS r6g.large (multi-AZ)ElastiCache r6g.large~$300-500500-1K~100ms
EnterpriseEKS 4+ nodes (c6g.xlarge)Aurora r6g.xlarge (multi-AZ)ElastiCache cluster (3 shards)~$1,500-3K+5K-50K+~50ms
The Hobby tier is perfect for evaluation and proof-of-concept work. You can scale up to Startup with minimal configuration changes when you are ready for production traffic.

Background Jobs (BullMQ)

Several features run asynchronously through BullMQ queues backed by Redis: DSAR exports, model retraining, journey wait-step advancement, drift detection, batch decisioning, and sample-data seeding. Each job needs two halves to work:
  1. A producer — the API enqueues jobs into Redis. ✅ Always on, regardless of tier.
  2. A consumer — a Node.js process running BullMQ Worker instances that read jobs back out and execute the handlers. Required, otherwise queued jobs silently pile up forever in Redis.
KaireonAI ships two ways to run the consumer:
ModeWhen to useSetup
In-process worker (default for Hobby)Hobby & low-volume Startup tiersWORKER_INPROCESS=1 on the API service. The same Node.js process that serves /api/v1/* also consumes the queues. Zero extra infrastructure.
Dedicated worker containerGrowth & Enterprise tiers; any deployment that needs to scale workers independently of the APIRun kaireon-worker from ECR (or self-built). Set WORKER_INPROCESS=0 on the API so jobs aren’t double-consumed. Run 1 to N worker pods — BullMQ load-balances across them automatically.
Hobby tier guarantee: the $6-8/mo Hobby figure assumes WORKER_INPROCESS=1. The single App Runner service hosts both producer and consumer. No second container, no second bill.
If you run a dedicated kaireon-worker container, you must set WORKER_INPROCESS=0 on the API service. Otherwise both processes will compete for the same jobs (BullMQ prevents double-execution via job locks, but it’s wasteful).

Redis backend portability

Any Redis-protocol-compatible service works. The codebase has zero Upstash-specific dependencies — it uses plain ioredis. To switch providers, change REDIS_URL and restart:
ProviderNotes
Upstash (default)Serverless, free tier 10K cmd/day. Use rediss:// URLs (TLS).
Amazon ElastiCacheManaged Redis on AWS. Use redis:// (or rediss:// if TLS enabled).
AWS MemoryDBRedis-compatible, durable.
DragonflyMulti-threaded Redis-compatible drop-in.
GCP Memorystore / Azure CacheManaged Redis on respective clouds.
Self-hosted Redis ≥ 5Anything that supports Streams + Lua scripting.
For BullMQ Cluster or Sentinel HA setups, a small wrapper around new Redis(url) is needed — see Self-Hosted Deployment for the Helm value.

Optional Services

ML Worker (Python)

The ML Worker is a separate Python/FastAPI service required for training gradient_boosted models and for higher-accuracy scikit-learn analysis on datasets > 5K rows. Scoring trained models works without it — the tree ensemble runs in-process in Node.js, so the /recommend hot path never calls Python.
TierML Worker configIdle costHeavy-training peak
Hobby — skip itLLM fallback for analysis; cannot train GBMs$0
Hobby — add itApp Runner 0.25 vCPU / 0.5 GB~$1.60~$14
Startup+Same App Runner config or co-locate on ECS~$1.60~$14
Without the ML Worker:
  • /recommend works for all model types (scoring is in-process)
  • ✅ Scorecard, Bayesian, linear, and logistic models train in-process
  • gradient_boosted model training returns a 503 (ML_WORKER_URL is not configured)
  • ⚠️ Auto-Segmentation, Policy Recommender, and Content Intelligence fall back to LLM-based analysis (still functional, lower accuracy on > 5K rows)
See ML Worker Setup for deployment instructions.

Response Time Breakdown

What happens during a /api/v1/recommend call:
StageDescriptionTypical Duration
Request parsing & validationZod schema validation1-2ms
Inventory lookupLoad active offers from DB5-15ms
Enrichment (if configured)Query schema tables for customer data10-30ms
QualificationApply eligibility rules2-5ms
Contact policy checkFrequency cap, cooldown evaluation2-5ms
ScoringRun scoring model (scorecard/Bayesian/etc.)5-20ms
RankingSort and select top N1-3ms
Portfolio Optimization (if configured)Multi-objective optimization3-10ms
Response serializationBuild JSON response1-2ms
Total P5030-90ms
Total P9580-200ms
Total P99120-400ms
These timings are for a typical pipeline with 50-100 candidate offers. Latency scales roughly linearly with candidate count.

Scaling Levers

Add more App Runner or ECS instances. Each instance handles approximately 100-500 req/s depending on pipeline complexity. App Runner auto-scales based on concurrency — tune the auto-scaling configuration’s max-concurrency-per-instance value to control when new instances spin up.
A larger database instance reduces query time. This is most impactful when qualification rules or enrichment queries are complex. Moving from t3.micro to r6g.large can cut DB-bound latency by 60-70%.
Enrichment caching reduces database load by 80%+. A cache hit resolves in ~1-2ms versus ~10-30ms for a cache miss. Set the REDIS_URL environment variable to enable caching in production. This is the single highest-impact optimization.
Use read replicas for dashboard and analytics queries. Keep the primary instance dedicated to decision writes and real-time reads. RDS and Aurora both support up to 15 read replicas.
PgBouncer reduces PostgreSQL connection overhead and is essential at more than 50 concurrent connections. The Helm chart includes a PgBouncer sidecar by default. For App Runner, use Supabase’s built-in connection pooler or deploy PgBouncer separately.
For non-real-time use cases, use the /api/v1/decide endpoint with batch customer lists. Batch mode amortizes connection and parsing overhead across many decisions, achieving higher throughput at the cost of individual response latency.

Cost Optimization Tips

Start small

Begin with the Hobby tier for evaluation. Scale up only when you have real traffic that demands it.

Free database tiers

Supabase and Neon free tiers provide sufficient PostgreSQL capacity for development and small production workloads.

Enable Redis early

Redis caching is the single biggest performance win. Enable it before scaling compute.

Pay-per-request pricing

App Runner charges per request-second, making it cost-effective for unpredictable or bursty traffic patterns.

Graduate to containers

Move to ECS or EKS only when you need sustained throughput above 100 req/s. Container orchestration adds operational overhead.

Spot instances for workers

Use Spot instances for worker pods handling non-latency-sensitive batch processing. Spot pricing can reduce compute costs by 60-90%.