Cost & Performance Guide

Deployment Tiers

Choose a tier based on your traffic volume, latency requirements, and budget.

Tier	Compute	Database	Cache	Monthly Cost	Decisions/sec	P99 Latency
Hobby	App Runner 0.25 vCPU / 0.5 GB	Supabase Free (500 MB)	Upstash Free (10K cmd/day)	~$6-8	5-10	~800ms
Startup	App Runner 1 vCPU / 2 GB	Neon Pro or RDS t3.micro	Upstash Pro or ElastiCache t3.micro	~$50-80	50-100	~200ms
Growth	ECS 2x t3.medium	RDS r6g.large (multi-AZ)	ElastiCache r6g.large	~$300-500	500-1K	~100ms
Enterprise	EKS 4+ nodes (c6g.xlarge)	Aurora r6g.xlarge (multi-AZ)	ElastiCache cluster (3 shards)	~$1,500-3K+	5K-50K+	~50ms

The Hobby tier is perfect for evaluation and proof-of-concept work. You can scale up to Startup with minimal configuration changes when you are ready for production traffic.

Background Jobs (BullMQ)

Several features run asynchronously through BullMQ queues backed by Redis: DSAR exports, model retraining, journey wait-step advancement, drift detection, batch decisioning, and sample-data seeding. Each job needs two halves to work:

A producer — the API enqueues jobs into Redis. ✅ Always on, regardless of tier.
A consumer — a Node.js process running BullMQ Worker instances that read jobs back out and execute the handlers. Required, otherwise queued jobs silently pile up forever in Redis.

KaireonAI ships two ways to run the consumer:

Mode	When to use	Setup
In-process worker (default for Hobby)	Hobby & low-volume Startup tiers	`WORKER_INPROCESS=1` on the API service. The same Node.js process that serves `/api/v1/` also consumes the queues. Zero extra infrastructure.*
Dedicated worker container	Growth & Enterprise tiers; any deployment that needs to scale workers independently of the API	Run `kaireon-worker` from ECR (or self-built). Set `WORKER_INPROCESS=0` on the API so jobs aren’t double-consumed. Run 1 to N worker pods — BullMQ load-balances across them automatically.

Hobby tier guarantee: the $6-8/mo Hobby figure assumes WORKER_INPROCESS=1. The single App Runner service hosts both producer and consumer. No second container, no second bill.

If you run a dedicated kaireon-worker container, you must set WORKER_INPROCESS=0 on the API service. Otherwise both processes will compete for the same jobs (BullMQ prevents double-execution via job locks, but it’s wasteful).

Redis backend portability

Any Redis-protocol-compatible service works. The codebase has zero Upstash-specific dependencies — it uses plain ioredis. To switch providers, change REDIS_URL and restart:

Provider	Notes
Upstash (default)	Serverless, free tier 10K cmd/day. Use `rediss://` URLs (TLS).
Amazon ElastiCache	Managed Redis on AWS. Use `redis://` (or `rediss://` if TLS enabled).
AWS MemoryDB	Redis-compatible, durable.
Dragonfly	Multi-threaded Redis-compatible drop-in.
GCP Memorystore / Azure Cache	Managed Redis on respective clouds.
Self-hosted Redis ≥ 5	Anything that supports Streams + Lua scripting.

For BullMQ Cluster or Sentinel HA setups, a small wrapper around new Redis(url) is needed — see Self-Hosted Deployment for the Helm value.

Optional Services

ML Worker (Python)

The ML Worker is a separate Python/FastAPI service required for training gradient_boosted models and for higher-accuracy scikit-learn analysis on datasets > 5K rows. Scoring trained models works without it — the tree ensemble runs in-process in Node.js, so the /recommend hot path never calls Python.

Tier	ML Worker config	Idle cost	Heavy-training peak
Hobby — skip it	LLM fallback for analysis; cannot train GBMs	$0	—
Hobby — add it	App Runner 0.25 vCPU / 0.5 GB	~$1.60	~$14
Startup+	Same App Runner config or co-locate on ECS	~$1.60	~$14

Without the ML Worker:

✅ /recommend works for all model types (scoring is in-process)
✅ Scorecard, Bayesian, linear, and logistic models train in-process
❌ gradient_boosted model training returns a 503 (ML_WORKER_URL is not configured)
⚠️ Auto-Segmentation, Policy Recommender, and Content Intelligence fall back to LLM-based analysis (still functional, lower accuracy on > 5K rows)

See ML Worker Setup for deployment instructions.

Response Time Breakdown

What happens during a /api/v1/recommend call:

Stage	Description	Typical Duration
Request parsing & validation	Zod schema validation	1-2ms
Inventory lookup	Load active offers from DB	5-15ms
Enrichment (if configured)	Query schema tables for customer data	10-30ms
Qualification	Apply eligibility rules	2-5ms
Contact policy check	Frequency cap, cooldown evaluation	2-5ms
Scoring	Run scoring model (scorecard/Bayesian/etc.)	5-20ms
Ranking	Sort and select top N	1-3ms
Portfolio Optimization (if configured)	Multi-objective optimization	3-10ms
Response serialization	Build JSON response	1-2ms
Total P50		30-90ms
Total P95		80-200ms
Total P99		120-400ms

These timings are for a typical pipeline with 50-100 candidate offers. Latency scales roughly linearly with candidate count.

Scaling Levers

Horizontal scaling

Add more App Runner or ECS instances. Each instance handles approximately 100-500 req/s depending on pipeline complexity. App Runner auto-scales based on concurrency — tune the auto-scaling configuration’s max-concurrency-per-instance value to control when new instances spin up.

Vertical scaling

A larger database instance reduces query time. This is most impactful when qualification rules or enrichment queries are complex. Moving from t3.micro to r6g.large can cut DB-bound latency by 60-70%.

Redis caching

Enrichment caching reduces database load by 80%+. A cache hit resolves in ~1-2ms versus ~10-30ms for a cache miss. Set the REDIS_URL environment variable to enable caching in production. This is the single highest-impact optimization.

Read replicas

Use read replicas for dashboard and analytics queries. Keep the primary instance dedicated to decision writes and real-time reads. RDS and Aurora both support up to 15 read replicas.

Connection pooling

PgBouncer reduces PostgreSQL connection overhead and is essential at more than 50 concurrent connections. The Helm chart includes a PgBouncer sidecar by default. For App Runner, use Supabase’s built-in connection pooler or deploy PgBouncer separately.

Batch mode

For non-real-time use cases, use the /api/v1/decide endpoint with batch customer lists. Batch mode amortizes connection and parsing overhead across many decisions, achieving higher throughput at the cost of individual response latency.

Cost Optimization Tips

Start small

Begin with the Hobby tier for evaluation. Scale up only when you have real traffic that demands it.

Free database tiers

Supabase and Neon free tiers provide sufficient PostgreSQL capacity for development and small production workloads.

Enable Redis early

Redis caching is the single biggest performance win. Enable it before scaling compute.

Pay-per-request pricing

App Runner charges per request-second, making it cost-effective for unpredictable or bursty traffic patterns.

Graduate to containers

Move to ECS or EKS only when you need sustained throughput above 100 req/s. Container orchestration adds operational overhead.

Spot instances for workers

Use Spot instances for worker pods handling non-latency-sensitive batch processing. Spot pricing can reduce compute costs by 60-90%.

​Deployment Tiers

​Background Jobs (BullMQ)

​Redis backend portability

​Optional Services

​ML Worker (Python)

​Response Time Breakdown

​Scaling Levers

​Cost Optimization Tips

Start small

Free database tiers

Enable Redis early

Pay-per-request pricing

Graduate to containers

Spot instances for workers

Deployment Tiers

Background Jobs (BullMQ)

Redis backend portability

Optional Services

ML Worker (Python)

Response Time Breakdown

Scaling Levers

Cost Optimization Tips