Self-Hosted LLM Setup

Overview

KaireonAI’s AI features (chat assistant, insights, content intelligence, rule builder) can run against any OpenAI-compatible LLM endpoint. This guide covers self-hosted options for environments where external API calls are not permitted.

Quick Start with Ollama

Ollama is the fastest way to run a local LLM. It runs on Mac, Linux, and Windows.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS (Homebrew)
brew install ollama

2. Pull a Model

# Recommended for tool calling (AI assistant features)
ollama pull qwen2.5:7b       # 4.7GB, fast, good at tool calling
ollama pull llama3.1          # 4.7GB, general purpose

# For better quality (needs 16GB+ RAM)
ollama pull qwen2.5:14b      # 9GB, excellent quality
ollama pull llama3.1:70b      # 40GB, near-GPT-4 quality (needs 64GB RAM)

3. Start Ollama

ollama serve

Ollama runs on http://localhost:11434 by default.

4. Configure in KaireonAI

Navigate to Settings > AI Configuration and set:

Setting	Value
Provider	`ollama`
Model	`qwen2.5:7b` (or your chosen model)
Base URL	`http://localhost:11434`
API Key	(leave empty for Ollama)

Or via API:

# Set AI provider to Ollama
curl -X PUT /api/v1/platform-settings \
  -H "Content-Type: application/json" \
  -d '{
    "category": "ai",
    "settings": {
      "ai_provider": "ollama",
      "ai_model": "qwen2.5:7b",
      "ai_base_url": "http://localhost:11434",
      "ai_api_key": ""
    }
  }'

Other Self-Hosted Options

vLLM (GPU Server)

Best for production deployments with GPU instances.

# On a GPU instance (e.g., AWS g5.xlarge)
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Configure in KaireonAI:

Provider: openai (vLLM is OpenAI-compatible)
Base URL: http://your-gpu-server:8000/v1
Model: meta-llama/Llama-3.1-8B-Instruct

LM Studio (Desktop)

Download from lmstudio.ai, load a model, and start the local server. Configure in KaireonAI:

Provider: lmstudio
Base URL: http://localhost:1234/v1
Model: (auto-detected)

HuggingFace Text Generation Inference

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct

Configure: Provider openai, Base URL http://localhost:8080/v1

Supported Providers

Provider	Tool Calling	Streaming	Local	Cloud
Google (Gemini)	Yes	Yes	No	Yes
OpenAI (GPT)	Yes	Yes	No	Yes
Anthropic (Claude)	Yes	Yes	No	Yes
Ollama	Yes (qwen2.5, llama3.1)	Yes	Yes	No
LM Studio	Partial	Yes	Yes	No
vLLM	Yes	Yes	Yes	Yes
AWS Bedrock	Yes	Yes	No	Yes

Bring Your Own Key (BYOK)

On the KaireonAI Playground (playground.kaireonai.com), each registered user can configure their own LLM provider:

Go to Settings > AI Configuration
Select your preferred provider
Enter your API key (encrypted at rest, never shared)
Your key is scoped to your tenant only

API keys are encrypted using AES-256 before storage. They are never returned in API responses — only **** masking is shown. Keys can be rotated at any time without affecting other tenants.

Model Recommendations

Use Case	Model	RAM Required	Notes
Dev/testing	qwen2.5:7b	8GB	Fast, good tool calling
Demo	llama3.1	8GB	Good general quality
Production (self-hosted)	qwen2.5:14b	16GB	Best quality/speed balance
Enterprise	llama3.1:70b	64GB	Near-cloud quality
Cloud (no infra)	gemini-2.5-flash	N/A	Free tier: 20 req/min

Docker Deployment with Ollama

For Docker-based deployments, add Ollama as a sidecar:

# docker-compose.yml
services:
  kaireon-api:
    image: 422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-api:latest
    environment:
      - AI_PROVIDER=ollama
      - AI_MODEL=qwen2.5:7b
      - AI_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

  ollama:
    image: ollama/ollama
    volumes:
      - ollama_data:/root/.ollama
    # Pull model on first start:
    # docker exec ollama ollama pull qwen2.5:7b

volumes:
  ollama_data:

Deploy

Configure

Operate

Architecture

Runbooks

Self-Hosted LLM Setup

Overview

Quick Start with Ollama

1. Install Ollama

2. Pull a Model

3. Start Ollama

4. Configure in KaireonAI

Other Self-Hosted Options

vLLM (GPU Server)

LM Studio (Desktop)

HuggingFace Text Generation Inference

Supported Providers

Bring Your Own Key (BYOK)

Model Recommendations

Docker Deployment with Ollama

Deploy

Configure

Operate

Architecture

Runbooks

Documentation Index

​Overview

​Quick Start with Ollama

​1. Install Ollama

​2. Pull a Model

​3. Start Ollama

​4. Configure in KaireonAI

​Other Self-Hosted Options

​vLLM (GPU Server)

​LM Studio (Desktop)

​HuggingFace Text Generation Inference

​Supported Providers

​Bring Your Own Key (BYOK)

​Model Recommendations

​Docker Deployment with Ollama

Overview

Quick Start with Ollama

1. Install Ollama

2. Pull a Model

3. Start Ollama

4. Configure in KaireonAI

Other Self-Hosted Options

vLLM (GPU Server)

LM Studio (Desktop)

HuggingFace Text Generation Inference

Supported Providers

Bring Your Own Key (BYOK)

Model Recommendations

Docker Deployment with Ollama