Skip to main content

Overview

KaireonAI’s AI features (chat assistant, insights, content intelligence, rule builder) can run against any OpenAI-compatible LLM endpoint. This guide covers self-hosted options for environments where external API calls are not permitted.

Quick Start with Ollama

Ollama is the fastest way to run a local LLM. It runs on Mac, Linux, and Windows.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS (Homebrew)
brew install ollama

2. Pull a Model

# Recommended for tool calling (AI assistant features)
ollama pull qwen2.5:7b       # 4.7GB, fast, good at tool calling
ollama pull llama3.1          # 4.7GB, general purpose

# For better quality (needs 16GB+ RAM)
ollama pull qwen2.5:14b      # 9GB, excellent quality
ollama pull llama3.1:70b      # 40GB, near-GPT-4 quality (needs 64GB RAM)

3. Start Ollama

ollama serve
Ollama runs on http://localhost:11434 by default.

4. Configure in KaireonAI

Navigate to Settings > AI Configuration and set:
SettingValue
Providerollama
Modelqwen2.5:7b (or your chosen model)
Base URLhttp://localhost:11434
API Key(leave empty for Ollama)
Or via API:
# Set AI provider to Ollama
curl -X PUT /api/v1/platform-settings \
  -H "Content-Type: application/json" \
  -d '{
    "category": "ai",
    "settings": {
      "ai_provider": "ollama",
      "ai_model": "qwen2.5:7b",
      "ai_base_url": "http://localhost:11434",
      "ai_api_key": ""
    }
  }'

Other Self-Hosted Options

vLLM (GPU Server)

Best for production deployments with GPU instances.
# On a GPU instance (e.g., AWS g5.xlarge)
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
Configure in KaireonAI:
  • Provider: openai (vLLM is OpenAI-compatible)
  • Base URL: http://your-gpu-server:8000/v1
  • Model: meta-llama/Llama-3.1-8B-Instruct

LM Studio (Desktop)

Download from lmstudio.ai, load a model, and start the local server. Configure in KaireonAI:
  • Provider: lmstudio
  • Base URL: http://localhost:1234/v1
  • Model: (auto-detected)

HuggingFace Text Generation Inference

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct
Configure: Provider openai, Base URL http://localhost:8080/v1

Supported Providers

ProviderTool CallingStreamingLocalCloud
Google (Gemini)YesYesNoYes
OpenAI (GPT)YesYesNoYes
Anthropic (Claude)YesYesNoYes
OllamaYes (qwen2.5, llama3.1)YesYesNo
LM StudioPartialYesYesNo
vLLMYesYesYesYes
AWS BedrockYesYesNoYes

Bring Your Own Key (BYOK)

On the KaireonAI Playground (playground.kaireonai.com), each registered user can configure their own LLM provider:
  1. Go to Settings > AI Configuration
  2. Select your preferred provider
  3. Enter your API key (encrypted at rest, never shared)
  4. Your key is scoped to your tenant only
API keys are encrypted using AES-256 before storage. They are never returned in API responses — only **** masking is shown. Keys can be rotated at any time without affecting other tenants.

Model Recommendations

Use CaseModelRAM RequiredNotes
Dev/testingqwen2.5:7b8GBFast, good tool calling
Demollama3.18GBGood general quality
Production (self-hosted)qwen2.5:14b16GBBest quality/speed balance
Enterprisellama3.1:70b64GBNear-cloud quality
Cloud (no infra)gemini-2.5-flashN/AFree tier: 20 req/min

Docker Deployment with Ollama

For Docker-based deployments, add Ollama as a sidecar:
# docker-compose.yml
services:
  kaireon-api:
    image: 422500312304.dkr.ecr.us-east-1.amazonaws.com/kaireon-api:latest
    environment:
      - AI_PROVIDER=ollama
      - AI_MODEL=qwen2.5:7b
      - AI_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

  ollama:
    image: ollama/ollama
    volumes:
      - ollama_data:/root/.ollama
    # Pull model on first start:
    # docker exec ollama ollama pull qwen2.5:7b

volumes:
  ollama_data: