Overview
KaireonAI’s AI features (chat assistant, insights, content intelligence, rule builder) can run against any OpenAI-compatible LLM endpoint. This guide covers self-hosted options for environments where external API calls are not permitted.
Quick Start with Ollama
Ollama is the fastest way to run a local LLM. It runs on Mac, Linux, and Windows.
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS (Homebrew)
brew install ollama
2. Pull a Model
# Recommended for tool calling (AI assistant features)
ollama pull qwen2.5:7b # 4.7GB, fast, good at tool calling
ollama pull llama3.1 # 4.7GB, general purpose
# For better quality (needs 16GB+ RAM)
ollama pull qwen2.5:14b # 9GB, excellent quality
ollama pull llama3.1:70b # 40GB, near-GPT-4 quality (needs 64GB RAM)
3. Start Ollama
Ollama runs on http://localhost:11434 by default.
Navigate to Settings > AI Configuration and set:
| Setting | Value |
|---|
| Provider | ollama |
| Model | qwen2.5:7b (or your chosen model) |
| Base URL | http://localhost:11434 |
| API Key | (leave empty for Ollama) |
Or via API:
# Set AI provider to Ollama
curl -X PUT /api/v1/platform-settings \
-H "Content-Type: application/json" \
-d '{
"category": "ai",
"settings": {
"ai_provider": "ollama",
"ai_model": "qwen2.5:7b",
"ai_base_url": "http://localhost:11434",
"ai_api_key": ""
}
}'
Other Self-Hosted Options
vLLM (GPU Server)
Best for production deployments with GPU instances.
# On a GPU instance (e.g., AWS g5.xlarge)
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
Configure in KaireonAI:
- Provider:
openai (vLLM is OpenAI-compatible)
- Base URL:
http://your-gpu-server:8000/v1
- Model:
meta-llama/Llama-3.1-8B-Instruct
LM Studio (Desktop)
Download from lmstudio.ai, load a model, and start the local server.
Configure in KaireonAI:
- Provider:
lmstudio
- Base URL:
http://localhost:1234/v1
- Model: (auto-detected)
HuggingFace Text Generation Inference
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct
Configure: Provider openai, Base URL http://localhost:8080/v1
Supported Providers
| Provider | Tool Calling | Streaming | Local | Cloud |
|---|
| Google (Gemini) | Yes | Yes | No | Yes |
| OpenAI (GPT) | Yes | Yes | No | Yes |
| Anthropic (Claude) | Yes | Yes | No | Yes |
| Ollama | Yes (qwen2.5, llama3.1) | Yes | Yes | No |
| LM Studio | Partial | Yes | Yes | No |
| vLLM | Yes | Yes | Yes | Yes |
| AWS Bedrock | Yes | Yes | No | Yes |
Bring Your Own Key (BYOK)
On the KaireonAI Playground (playground.kaireonai.com), each registered user can configure their own LLM provider:
- Go to Settings > AI Configuration
- Select your preferred provider
- Enter your API key (encrypted at rest, never shared)
- Your key is scoped to your tenant only
API keys are encrypted using AES-256 before storage. They are never returned in API responses — only **** masking is shown. Keys can be rotated at any time without affecting other tenants.
Model Recommendations
| Use Case | Model | RAM Required | Notes |
|---|
| Dev/testing | qwen2.5:7b | 8GB | Fast, good tool calling |
| Demo | llama3.1 | 8GB | Good general quality |
| Production (self-hosted) | qwen2.5:14b | 16GB | Best quality/speed balance |
| Enterprise | llama3.1:70b | 64GB | Near-cloud quality |
| Cloud (no infra) | gemini-2.5-flash | N/A | Free tier: 20 req/min |
Docker Deployment with Ollama
For Docker-based deployments, add Ollama as a sidecar:
# docker-compose.yml
services:
kaireon-api:
image: 422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-api:latest
environment:
- AI_PROVIDER=ollama
- AI_MODEL=qwen2.5:7b
- AI_BASE_URL=http://ollama:11434
depends_on:
- ollama
ollama:
image: ollama/ollama
volumes:
- ollama_data:/root/.ollama
# Pull model on first start:
# docker exec ollama ollama pull qwen2.5:7b
volumes:
ollama_data: