Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

Why self-host

Regulated industries — banking, healthcare, insurance, defense — often require decisioning data to stay inside an owned VPC or physical data center. KaireonAI is Apache 2.0 licensed and ships as a first-class Helm chart. There is no SaaS lock-in; the same code that runs on playground.kaireonai.com runs in your cluster. This page is the production runbook for:
  • Private VPC deployments (AWS / GCP / Azure) with no public ingress from the platform’s side
  • Air-gapped clusters (no outbound egress at all)
  • BYOK (bring-your-own-key) encryption and secrets

Prerequisites

RequirementVersion / Flavor
Kubernetes1.27+ (EKS, GKE, AKS, OpenShift, k3s, Rancher all supported)
Helm3.11+
PostgreSQL14+ (managed RDS / Cloud SQL / self-hosted)
Redis6.2+ (ElastiCache / Cloud Memorystore / self-hosted)
Container registryAny OCI-compliant (ECR, Artifact Registry, ACR, Harbor, Nexus)
Ingress controllernginx / AWS ALB / Istio / any L7

Architecture options

Option A — Private VPC, managed add-ons

  • Platform (API / worker / ml-worker): Helm chart → your K8s cluster
  • Database: managed Postgres (RDS / Cloud SQL / Neon VPC)
  • Cache / queue: managed Redis (ElastiCache / Memorystore / Upstash VPC)
  • Secrets: AWS Secrets Manager / GCP Secret Manager / Azure Key Vault, mounted via External Secrets Operator
  • Outbound traffic: optionally allowed to your LLM provider of choice; otherwise disabled

Option B — Fully air-gapped

  • All three images mirrored to an internal registry
  • Internal Postgres StatefulSet + internal Redis StatefulSet (the chart ships both)
  • LLM explanations feature either disabled or pointed at an in-VPC LLM endpoint (vLLM, Ollama, self-hosted Claude via Bedrock PrivateLink, on-prem GPU box, etc.)
  • All ML training happens in-cluster via the bundled kaireon-ml-worker pod
  • Zero outbound egress required

Step-by-step (Option A)

1. Mirror images to your registry

# Source images (public ECR)
API=422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-api:latest
WORKER=422500312304.dkr.ecr.eu-west-2.amazonaws.com/kaireon/worker:latest
ML=422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-ml:latest

# Pull, retag, push to your private registry
for IMG in $API $WORKER $ML; do
  docker pull $IMG
  docker tag $IMG your-registry.internal/$(basename ${IMG%:*}):latest
  docker push your-registry.internal/$(basename ${IMG%:*}):latest
done
Every image is published with provenance. If you require signed images:
# Requires cosign; verify each image before allow-listing in your admission policy
cosign verify your-registry.internal/kaireon-api:latest \
  --certificate-identity=https://github.com/kaireonai/platform/.github/workflows/build.yml@refs/heads/main \
  --certificate-oidc-issuer=https://token.actions.githubusercontent.com

3. Create a values override

values-prod.yaml:
namespace: kaireon-prod

api:
  image:
    repository: your-registry.internal/kaireon-api
    tag: "v1.0.0"
  replicas: 6
  hpa:
    enabled: true
    minReplicas: 6
    maxReplicas: 30
  readOnlyRootFilesystem: false   # Next.js writes cache to /app/.next/cache
  podSecurityContext:
    runAsUser: 1001
    runAsGroup: 1001
    fsGroup: 1001
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: api
  priorityClassName: kaireon-critical

worker:
  image:
    repository: your-registry.internal/kaireon-worker
    tag: "v1.0.0"
  replicas: 3
  keda:
    enabled: true
    minReplicas: 3
    maxReplicas: 20

mlWorker:
  enabled: true                    # required for gradient_boosted training
  image:
    repository: your-registry.internal/kaireon-ml
    tag: "v1.0.0"

database:
  mode: external
  external:
    host: kaireon-db.xxx.eu-west-1.rds.amazonaws.com
    port: 5432
    name: kaireon
    username: kaireon
    existingSecret: kaireon-db-credentials
    secretKey: password
    sslMode: require

redis:
  mode: external
  external:
    host: kaireon-cache.xxx.0001.euw1.cache.amazonaws.com
    port: 6379
    tls: true
    existingSecret: kaireon-redis-credentials

ingress:
  enabled: true
  className: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internal   # private VPC only
    alb.ingress.kubernetes.io/target-type: ip
  hosts:
    - host: kaireon.internal.example.com
      paths:
        - path: /
          pathType: Prefix

4. Apply the chart

helm upgrade --install kaireon ./helm \
  -n kaireon-prod --create-namespace \
  -f values-prod.yaml \
  --wait --timeout 10m

# Verify
kubectl -n kaireon-prod get pods
kubectl -n kaireon-prod rollout status deployment/kaireon-api
kubectl -n kaireon-prod exec deploy/kaireon-api -- curl -sf localhost:3000/api/health

5. Run migrations

kubectl -n kaireon-prod exec deploy/kaireon-api -- npx prisma db push

Hardening checklist

Before going live, confirm each item:
  • readOnlyRootFilesystem: true where workload permits (requires adding an emptyDir volume for /app/.next/cache on the API pod)
  • runAsNonRoot: true on every pod (enabled by default in the chart)
  • allowPrivilegeEscalation: false + all capabilities dropped (enabled by default)
  • topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule across 3+ zones
  • PodDisruptionBudget preserves N-1 availability during drains (shipped in templates/pdb.yaml)
  • NetworkPolicy blocks all pod-to-pod traffic except allow-listed flows (shipped in templates/networkpolicy.yaml; review for your CNI)
  • Secrets in External Secrets Operator, not plain kubernetes.io/Secret objects (enable secrets.provider: externalSecrets in values.yaml)
  • Database SSL mode require or higher
  • Redis TLS enabled with rediss:// and password
  • Admission controller enforcing signed images (cosign + Kyverno / OPA Gatekeeper)
  • Log sink configured (Fluent Bit → CloudWatch / Loki / Splunk)
  • Metrics scrape configured (Prometheus ServiceMonitor — the chart ships Grafana dashboards in helm/dashboards/)
  • Backup policy on the Postgres instance (PITR ≥ 7 days)
  • CONNECTOR_ENCRYPTION_KEY is a 32-byte random, rotated every 90 days, and stored in your secrets backend

Air-gapped (Option B) additions

  • Set database.mode: internal and redis.mode: internal — the chart provisions StatefulSets with local storage
  • Set config.EVENT_PUBLISHER: redis and config.INTERACTION_STORE: pg (no cloud-backed stores)
  • Disable llmExplanationsEnabled at the tenant level, OR deploy an in-VPC LLM and configure its endpoint via the AI provider settings (Ollama / vLLM / Bedrock PrivateLink)
  • Mirror the ml-worker Python dependencies to an internal PyPI proxy (the Dockerfile bakes them into the image, so this is only needed if you rebuild)

Upgrades

# Preview what will change
helm diff upgrade kaireon ./helm -n kaireon-prod -f values-prod.yaml

# Apply
helm upgrade kaireon ./helm -n kaireon-prod -f values-prod.yaml --wait

# Run any pending schema migrations
kubectl -n kaireon-prod exec deploy/kaireon-api -- npx prisma db push
A rolling upgrade drains one replica at a time (thanks to PDB). The API pod’s preStop lifecycle hook sleeps 15s so the load balancer has time to stop routing before SIGTERM.

Observability

The chart ships Grafana dashboards in helm/dashboards/. They cover:
  • API overview — request rate, p50/p95/p99 latency, 4xx/5xx split
  • Decision engine — recommend latency breakdown by stage (enrich / compute / filter / score / rank)
  • Decision performance — offer CTR, arbitration weight drift, experiment uplift
  • Model health — AUC trend per model, drift PSI, training sample count
  • Worker queues — BullMQ depth per queue, retry count, DLQ depth
  • Infrastructure — CPU / memory / disk / network per pod
Load these into Grafana via the bundled templates/grafana.yaml (enabled by default) or by importing the JSON files directly.

Support