Overview
The ML Worker is a standalone Python/FastAPI service that provides scikit-learn-based analysis for KaireonAI’s AI features. It handles computationally intensive tasks — K-Means clustering for segmentation, logistic regression for policy analysis, and TF-IDF for content analysis — that exceed what LLM-based analysis can do accurately.
When to Use the ML Worker
| Scenario | Without ML Worker | With ML Worker |
|---|
| Auto-Segmentation | LLM percentile-based grouping | K-Means on full dataset with silhouette scoring |
| Policy Recommender | Heuristic pattern recognition | Logistic regression and statistical analysis |
| Content Intelligence | CTR/CVR heuristics | TF-IDF + Random Forest feature importance |
| Dataset size | Works well under 5K rows | Required for 5K+ rows for accurate results |
The ML Worker is optional. All AI features work without it by falling back to LLM-based analysis. Add it when you need higher accuracy on large datasets.
Local Development
Set up environment
cd ml-worker
cp .env.example .env
Edit .env to set your local database URL:DATABASE_URL=postgresql://user:password@localhost:5432/kaireon
This must match the same PostgreSQL database the platform uses.Install dependencies
pip install -r requirements.txt
Start the ML Worker
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000
Configure the platform
Add to your platform/.env:ML_WORKER_URL=http://localhost:8000
Restart the Next.js dev server to pick up the change. Verify the connection
curl http://localhost:8000/health
Expected response:{"status":"ok","capabilities":["policy_analysis","segmentation","content_analysis"]}
In the KaireonAI UI, navigate to AI > Insights — the ML Worker status badge should show “Connected”.
Docker Setup
Standalone Docker
docker run -d \
--name kaireon-ml-worker \
-p 8000:8000 \
-e DATABASE_URL=postgresql://user:pass@host:5432/kaireon \
422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-ml:latest
Docker Compose
The platform’s docker-compose.yml includes the ML Worker under the ml profile:
# Start everything including the ML Worker
docker compose --profile ml up -d
# Start the platform without the ML Worker
docker compose up -d
The ML Worker automatically connects to PostgreSQL through PgBouncer using the same DATABASE_URL as the platform.
Environment Variables
| Variable | Required | Default | Description |
|---|
DATABASE_URL | Yes | — | PostgreSQL connection string (same database as the platform) |
ML_WORKER_PORT | No | 8000 | Port to listen on |
The ML Worker connects to the same PostgreSQL database as the platform to read schema data directly. It does not need Redis or any other external dependencies.
Kubernetes (Helm)
The Helm chart includes ML Worker deployment as an optional component. Enable it with:
helm install kaireon ./helm \
--namespace kaireon \
--set mlWorker.enabled=true \
--set mlWorker.image.repository=422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-ml \
--set mlWorker.image.tag=latest
When mlWorker.enabled=true, the chart automatically:
- Creates a Deployment and Service for the ML Worker
- Injects
ML_WORKER_URL into the API pods so the platform auto-connects
- Creates a ServiceAccount for the ML Worker pods
Helm Values
| Value | Default | Description |
|---|
mlWorker.enabled | false | Enable ML Worker deployment |
mlWorker.replicas | 1 | Number of replicas |
mlWorker.image.repository | 422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-ml | Container image |
mlWorker.image.tag | latest | Image tag |
mlWorker.resources.requests.cpu | 500m | CPU request |
mlWorker.resources.requests.memory | 1Gi | Memory request |
mlWorker.resources.limits.cpu | 2000m | CPU limit |
mlWorker.resources.limits.memory | 4Gi | Memory limit |
The ML Worker can be memory-intensive during clustering and model training. Allocate at least 2Gi of memory for production workloads with datasets over 100K rows.
There are two ways to connect the platform to the ML Worker:
1. Environment Variable (Recommended for local dev and Kubernetes)
Set ML_WORKER_URL in the platform’s environment. The Helm chart does this automatically when mlWorker.enabled=true.
ML_WORKER_URL=http://localhost:8000
2. Settings UI (Runtime configuration)
- Navigate to Settings > Integrations in the KaireonAI UI
- Find the ML Worker section
- Enter the ML Worker URL
- Click Test Connection to verify
- Save the configuration
The Settings UI configuration takes precedence over the environment variable.
API Endpoints
| Endpoint | Method | Description |
|---|
/health | GET | Health check with capabilities list |
/analyze/policies | POST | Submit policy analysis job |
/analyze/segments | POST | Submit segmentation job |
/analyze/content | POST | Submit content analysis job |
/status/{job_id} | GET | Poll job status and results |
All analysis endpoints are asynchronous — they return a jobId immediately and process in the background.
Large Dataset Warning Flow
When a dataset contains 5,000 or more rows, the KaireonAI UI shows a confirmation dialog before starting analysis. The dialog provides:
- Accuracy — ML Worker algorithms (K-Means, logistic regression, TF-IDF) are more accurate than LLM pattern matching on large datasets
- Cost estimate — Approximate token count and cost if the user proceeds with LLM analysis
- Speed — ML Worker processes data locally in seconds vs. LLM round-trip latency
The user can choose Use ML Worker or Proceed with LLM. If the ML Worker is not connected, the dialog still appears but explains that LLM analysis will sample the data.
For datasets over 5,000 rows, the ML Worker is strongly recommended. LLM-based analysis samples data (up to 1,000 rows for segmentation) which reduces accuracy. The ML Worker processes the entire dataset.
For details on configuring analysis parameters, see AI Configuration.
Troubleshooting
| Issue | Solution |
|---|
| Worker not detected | Check the URL in Settings > Integrations or verify ML_WORKER_URL env var. Ensure the worker is running and accessible. |
| Health check fails | Verify DATABASE_URL is correct and the worker can reach PostgreSQL. Check logs with docker logs kaireon-ml-worker. |
| Out of memory during analysis | Increase memory limits. For datasets over 500K rows, use at least 4Gi. |
ModuleNotFoundError: sklearn | Run pip install scikit-learn — the package name differs from the import name. |
Next Steps