KaireonAI includes a production-ready Helm chart for deploying to any Kubernetes cluster. This gives you full control over scaling, networking, monitoring, and security.Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- Kubernetes cluster (1.24+)
- Helm 3.x installed
kubectlconfigured for your cluster- PostgreSQL database (self-managed, RDS, or CloudNativePG)
- Redis (self-managed, ElastiCache, or included via Helm)
What’s Included
The Helm chart inhelm/ provides:
| Resource | Description |
|---|---|
| API Deployment | Main Next.js application with health checks and HPA |
| Worker Deployment | Background job processor for pipelines and model retraining |
| ML Worker Deployment | Python/FastAPI service for scikit-learn analysis (optional) |
| ConfigMaps | Application configuration (non-sensitive) |
| Secrets | Database URLs, API keys, encryption keys |
| Ingress | HTTPS ingress with TLS termination (ALB or nginx) |
| HPA | Horizontal Pod Autoscaler for API pods |
| PodDisruptionBudget | Ensures availability during node maintenance |
| NetworkPolicies | Restrict pod-to-pod and egress traffic |
| ServiceMonitor | Prometheus metrics scraping |
Quick Install
With ML Worker
To include the ML Worker for AI-powered segmentation, policy analysis, and content intelligence:mlWorker.enabled=true, the chart automatically injects ML_WORKER_URL into the API pods — no manual configuration needed.
Configuration
Key Helm values you can customize:--set to override any value:
Architecture
The API communicates with the ML Worker over an internal ClusterIP service (kaireon-ml-worker:8000). The ML Worker reads directly from PostgreSQL for schema data and analysis inputs.
Monitoring Stack
Whenmonitoring.prometheus.enabled=true, the chart deploys:
Prometheus Metrics
KaireonAI exposes metrics at/api/v1/metrics:
| Metric | Type | Description |
|---|---|---|
kaireon_decisions_total | Counter | Total decisions made |
kaireon_decision_latency_ms | Histogram | Decision latency distribution |
kaireon_pipeline_executions_total | Counter | Pipeline run counts |
kaireon_api_requests_total | Counter | API request counts by endpoint |
Grafana Dashboards
Pre-built dashboards are included inhelm/dashboards/:
- API Overview — Request rates, latency percentiles, error rates
- Decision Engine — Decision throughput, scoring latency, cache hit rates
- Infrastructure Health — CPU, memory, pod restarts, network
- Model Health — Model prediction distributions, feature drift
- Worker Queues — Queue depth, processing times, failure rates
Database Options
Self-Managed PostgreSQL
Deploy PostgreSQL inside the cluster. The chart includes an internal PostgreSQL StatefulSet by default:Amazon RDS (External)
Upgrading
Rate Limiting & Circuit Breakers
- Rate limiting — KaireonAI protects API endpoints with a sliding-window rate limiter backed by Redis. You configure limits per endpoint via environment variables or platform settings.
- Circuit breakers — External integrations (connectors, webhooks) use circuit breaker patterns to prevent cascade failures. States cycle: closed → open → half-open.
Troubleshooting
Secret creation errors ('already exists')
Secret creation errors ('already exists')
Kubernetes secrets are immutable by default once created. To update secrets, delete and recreate:Then restart the pods to pick up the new values:
API pods OOMKilled or CrashLoopBackOff
API pods OOMKilled or CrashLoopBackOff
The Next.js application requires at least 512Mi of memory. If pods are being OOMKilled, increase the memory limit:For the worker, allocate at least 1Gi. Check pod events for the specific reason:
Health probe failures (readiness/liveness)
Health probe failures (readiness/liveness)
The API pods expose a health endpoint at Startup can take 15-30 seconds as the app validates environment variables and connects to PostgreSQL. Check pod logs if probes continue to fail:
/api/v1/metrics. If probes fail during startup, increase the initialDelaySeconds:ECR image pull errors ('ImagePullBackOff')
ECR image pull errors ('ImagePullBackOff')
Ensure the node IAM role or service account has
ecr:GetAuthorizationToken and ecr:BatchGetImage permissions. For EKS, verify that the OIDC provider is configured and the service account is annotated:Cannot connect to PostgreSQL from pods
Cannot connect to PostgreSQL from pods
Verify the database is reachable from within the cluster. Common issues include missing VPC peering, security group rules, or incorrect hostnames. Test connectivity from a debug pod:
Next Steps
ML Worker
Configure the ML Worker for AI features.
Operations
Configure Prometheus metrics and Grafana dashboards.
Scaling Guide
Scaling guidance for high-throughput deployments.
Cloud Deployment
One-click deployment to AWS App Runner.