Experiments - KaireonAI

Experiments list view in the Algorithms module

How experiments work

KaireonAI runs experiments using a champion/challenger pattern, not a generic control/treatment A/B test. The vocabulary maps as follows:

KaireonAI term	What it is
Champion	The model that is currently live. It scores candidates for whichever fraction of traffic isn’t allocated to a challenger or the holdout.
Challenger	A candidate model competing against the champion. Each challenger has a `trafficPct` (0–100) controlling how many requests it scores. Multiple challengers can be active at once.
Holdout	A fraction of customers that bypass the experiment entirely (`holdoutPercent`). They receive a control behavior — typically the baseline non-personalized response — and become the denominator in uplift math.
Variant assignment	The engine deterministically hashes `${experimentKey}:${customerId}` (FNV-1a, salted by experiment key to prevent cross-experiment correlation) to pick one bucket: `__champion__`, a specific challenger key, or `__holdout__`. Assignments are persisted in `variant_assignments` with a 30-day TTL so the same customer is always treated by the same model on every call within that window, even if the experiment’s traffic split is reconfigured.
Uplift	`treatmentConversionRate − holdoutConversionRate` (absolute) or `(treatment − holdout) / holdout` (relative). Significance is a two-proportion z-test with a pooled standard error; the platform reports the z-score, two-tailed p-value, and Wilson 95% CIs per variant.
Auto-promote	When `autoPromote: true`, if a challenger beats champion by `promoteThreshold` (AUC or conversion-rate delta) after `promoteAfterDays`, the platform swaps the challenger into the champion slot.

The full lifecycle is draft → active → paused → completed. Status transitions happen via PUT /api/v1/experiments/{id} and are guarded by RBAC (admin/editor).

End-to-end data flow

   Tenant traffic                 Experiment engine               Persistence
   ─────────────                 ─────────────────              ─────────────
   /recommend  ──►  Sticky-hash customerId
                    against (champion 50% + chl_a 30% + chl_b 10% + holdout 10%)
                              │
                              ├─► champion?  ──► score with model_001
                              ├─► chl_a?      ──► score with model_002       ─► variant_assignments row
                              ├─► chl_b?      ──► score with model_003          (tenantId, customerId,
                              └─► holdout?    ──► return baseline (skip exp)     experimentKey, variantName)
                              │
                              ▼
                       /respond cycle records outcome  ─► interaction_history row

   /experiments/{id}/results  ─►  COUNT(*) per variant + COUNT(positive outcomes) per variant
                                       │
                                       ▼
                              two-proportion z-test
                                       │
                                       ▼
                              { uplift, pValue, significant, Wilson 95% CI, samples }

Champion/challenger mode

This is the default mode. Traffic is split among the champion and one or more challengers. Each model scores the same candidates but on different customers; outcomes feed back through /respond to compare conversion rates head-to-head. To set it up:

Train a candidate model alongside your live champion (see Algorithms & Models). It must reach registryStatus = "production" or "challenger" to be eligible for promotion via auto-promote.
POST /api/v1/experiments with:
- championModelId: the current live model
- trafficSplit: { championPct: 50 } — what fraction of non-holdout traffic the champion gets
- challengers: [{ modelId: "model_xyz", trafficPct: 50 }] — the competing model and its allocation
- holdoutPercent: 10 — bypass 10% entirely for measurement
- status: "active"
Validation rule: championPct + sum(challengers[].trafficPct) must equal 100. The endpoint returns 400 otherwise.
Fire traffic. Variant assignment is deterministic per customer — the same customerId lands on the same variant on every call so personalization is consistent.
After enough samples (see requiredSampleSize in the results response), check /experiments/{id}/results for statistical significance.
Decide the winner: manually update the experiment via PUT /api/v1/experiments/{id} setting status: "completed" and (optionally) swap the championModelId. Or let autoPromote: true handle it after promoteAfterDays if the challenger crosses promoteThreshold.

Shadow mode

Shadow mode is a different mechanism that lives on the model registry itself, not on the Experiment resource. Use it when you want to evaluate a candidate model on real production traffic without changing any decision the customer sees. How it works:

A model with registryStatus = "shadow" (or any model listed in a ScoreNode.shadowModelKeys array) scores every candidate in parallel with the live champion.
Shadow scores are written to the decision trace’s scoringResults[].shadowScores map, keyed by model key.
Shadow scores never enter ranking, never enter /recommend responses, never affect what the customer is shown. They are recording-only.
After enough traffic, you can compare per-customer ranking similarity (Kendall tau, top-K overlap, expected lift on observed outcomes) between champion and shadow off-line.

When to use shadow mode vs champion/challenger:

Question	Use this mode
Will this new model break in production? Does it produce sane scores under real feature distributions?	Shadow — zero customer-facing risk.
Does this new model actually improve conversion compared to the current one?	Champion/challenger — needs real customers in the treatment arm to measure outcomes.
Want to “warm up” a model before A/B’ing it?	Shadow first, then champion/challenger. Shadow surfaces obvious failures cheap; the A/B measures the actual lift.

To promote a model into shadow mode: Models flow through a strict five-stage registry lifecycle: draft → shadow → challenger → champion → archived. Transitions are made through a dedicated promotion endpoint that enforces transition legality, the “one champion per family” invariant, and an auto-rollback guard, and writes an AuditLog row for every change. There is no direct PATCH on registryStatus — write attempts via PUT /algorithm-models/{id} are ignored.

curl -X POST https://playground.kaireonai.com/api/v1/algorithm-models/$MODEL_ID/promote \
  -H "Content-Type: application/json" \
  -d '{
    "toStatus": "shadow",
    "metricsSnapshot": { "auc_offline": 0.78 }
  }'

Body fields: toStatus (required, one of draft/shadow/challenger/champion/archived), family (optional grouping key — the “one champion per family” rule fires here), bypassRollbackGuard (optional, admin escape hatch), metricsSnapshot (optional key-value map recorded with the promotion). Returns 409 if the rollback guard trips, 400 for invalid transitions, 404 if the model isn’t in your tenant. The accompanying read endpoint GET /api/v1/algorithm-models/resolve-lifecycle returns the current champion plus all challengers and shadow models for the tenant — useful for diagnosing which models are wired into which lifecycle slot. Or attach the model to a specific decision flow’s score node:

{
  "type": "score",
  "config": {
    "scoringStrategy": "propensity",
    "shadowModelKeys": ["candidate-gbm-v4", "candidate-nn-v2"]
  }
}

Shadow scores appear in every decision trace produced by that flow. Read them from decision_traces.scoringResults[].shadowScores or aggregate them via the Decision Traces API.

Holdout group

The holdoutPercent field reserves a slice of customers who bypass the experiment entirely. The variant engine sticky-hashes them to __holdout__ and the platform delivers a baseline (no personalization, or whatever the kill_switch fallback is) for those calls. Outcomes still flow through /respond, which gives you the denominator for incrementality math:

incremental conversions per 1000 customers = (treatmentRate − holdoutRate) × 1000
revenue uplift = incremental conversions × average order value

A 10% holdout is the standard. Set it lower (5%) if traffic volume is small and you can’t afford to suppress recommendations; higher (20%) if you want tighter CIs on the holdout-side measurement. The holdout group is the SAME for the entire experiment. Don’t confuse it with per-challenger comparisons — those are champion-vs-challenger; the holdout is treatment-vs-no-treatment.

Statistical methods

The results endpoint uses three textbook procedures:

Two-proportion z-test for significance:

p̂_T = treatmentConversions / treatmentSamples
p̂_H = holdoutConversions / holdoutSamples
p̂_pooled = (treatmentConv + holdoutConv) / (treatmentSamples + holdoutSamples)
SE = √(p̂_pooled × (1 − p̂_pooled) × (1/n_T + 1/n_H))
z = (p̂_T − p̂_H) / SE
p_value = 2 × (1 − Φ(|z|))    [two-tailed]
significant ⇔ p_value < α (α = 0.05 at confidenceLevel = 0.95)

Φ is the standard normal CDF, approximated via Abramowitz–Stegun 7.1.26 with input z/√2 (accurate to ~1e-7). The platform’s implementation has been verified against textbook tables: at z = 1.96 it returns p = 0.0500 (matching the canonical 95% threshold), at z = 2.576 it returns p = 0.0100, etc.

Wilson 95% confidence interval for each variant’s conversion rate. Wilson CIs are preferred over the normal-approximation interval because they remain valid for small samples and rates near 0 or 1.
Required sample size estimate, using baseline rate × minimum detectable effect × number of variants:

n ≈ ((z_α/2 + z_β)² × p̂(1 − p̂)) / (MDE²)

Reported as requiredSampleSize so operators know when they have enough power to declare a result.

Mode comparison at a glance

Mode	Affects live decisions?	Needs holdout?	Measures lift?	When to use
Champion alone (no experiment)	Yes	No	No	Steady-state operation.
Shadow mode	No — silent recording only	No	No (offline comparison only)	Pre-flight: prove a model is sane on real distributions.
Champion/challenger	Yes — challenger scores real customers	Optional but recommended	Yes, head-to-head between models	Production A/B to pick a winner.
Champion/challenger + Holdout	Yes for treatment arm, no for holdout	Yes	Yes for model-vs-model AND treatment-vs-no-treatment	Want both “which model is best?” AND “is personalization beating baseline?” answered simultaneously.

GET /api/v1/experiments

List all experiments with their champion model and challengers. Supports cursor-based pagination.

Response

{
  "data": [
    {
      "id": "exp_001",
      "key": "cc-propensity-v2-test",
      "name": "Credit Card: Scorecard vs Bayesian",
      "status": "active",
      "trafficSplit": { "championPct": 50 },
      "autoPromote": true,
      "promoteThreshold": 0.02,
      "promoteAfterDays": 14,
      "championModel": { "id": "model_001", "name": "Scorecard v3" },
      "challengers": [
        { "modelId": "model_002", "trafficPct": 50, "model": { "name": "Bayesian v1" } }
      ],
      "createdAt": "2026-03-01T10:00:00.000Z"
    }
  ],
  "pagination": {
    "total": 3,
    "hasMore": false,
    "limit": 50,
    "cursor": null
  }
}

POST /api/v1/experiments

Create a new experiment. Traffic split must sum to 100%.

Request Body

Field	Type	Required	Description
`key`	string	Yes	Unique experiment key
`name`	string	Yes	Display name
`description`	string	No	Description
`status`	string	No	One of: `draft`, `active`, `paused`, `archived`. Default: `"draft"`
`championModelId`	string	No	Champion model ID
`trafficSplit`	object	No	`{ championPct: number }`. Default: `{ championPct: 80 }`
`challengers`	array	No	`[{ modelId, trafficPct }]`
`autoPromote`	boolean	No	Auto-promote challenger if it wins. Default: `false`
`promoteThreshold`	number	No	Minimum uplift for auto-promotion. Default: `0.02`
`promoteAfterDays`	number	No	Days to wait before auto-promotion. Default: `14`

Validation

Traffic split must sum to 100%: championPct + sum(challengers[].trafficPct) must equal exactly 100. Returns 400 if not.
Key must be unique per tenant. Duplicate key returns 400.

Example

curl -X POST https://playground.kaireonai.com/api/v1/experiments \
  -H "Content-Type: application/json" \
  -H "X-Tenant-Id: my-tenant" \
  -d '{
    "key": "cc-bayesian-test",
    "name": "Credit Card: Champion vs Bayesian",
    "championModelId": "model_001",
    "trafficSplit": { "championPct": 50 },
    "challengers": [{ "modelId": "model_002", "trafficPct": 50 }],
    "autoPromote": true,
    "promoteThreshold": 0.02,
    "promoteAfterDays": 14
  }'

Response: 201 Created

GET /api/v1/experiments/

Get experiment details with champion and challenger models.

PUT /api/v1/experiments/

Update an experiment. Challengers are replaced entirely if provided.

Request Body

All fields optional. Same as POST fields plus results (object) for storing outcome data.

DELETE /api/v1/experiments/

Delete an experiment and its challengers. Response: 204 No Content

DELETE also works at the collection level: DELETE /api/v1/experiments?id={experimentId}. Both the path parameter and query parameter forms are supported.

GET /api/v1/experiments//results

Returns uplift analysis and statistical significance for treatment vs holdout. The endpoint first checks for live variant assignment data. If no assignments exist, it falls back to stored JSON results.

Response

{
  "experimentId": "exp_001",
  "experimentName": "Credit Card: Champion vs Bayesian",
  "status": "active",
  "hasResults": true,
  "dataSource": "live",
  "treatment": {
    "samples": 5200,
    "conversions": 416,
    "conversionRate": 0.08,
    "ci95Lower": 0.0728,
    "ci95Upper": 0.0877
  },
  "holdout": {
    "samples": 520,
    "conversions": 31,
    "conversionRate": 0.0596,
    "ci95Lower": 0.0419,
    "ci95Upper": 0.0839
  },
  "uplift": {
    "absolute": 0.0204,
    "relative": 0.3423
  },
  "significance": {
    "zScore": 1.52,
    "pValue": 0.1286,
    "isSignificant": false,
    "confidenceLevel": 0.95
  },
  "requiredSampleSize": 15000,
  "upliftAnalysis": {
    "treatmentConversionRate": 0.08,
    "holdoutConversionRate": 0.0596,
    "uplift": 0.0204,
    "relativeUplift": 0.3423,
    "zScore": 1.52,
    "pValue": 0.1286,
    "significant": false
  },
  "variants": [
    { "label": "Champion", "modelName": "Scorecard v3", "samples": 2600, "conversionRate": 0.079 },
    { "label": "Challenger 1", "modelName": "Bayesian v1", "samples": 2600, "conversionRate": 0.081 }
  ]
}

Statistical Methods

Two-proportion z-test for significance testing (p < 0.05)
Wilson confidence intervals for per-variant conversion rates
Required sample size estimation based on baseline rate and minimum detectable effect

Roles

Endpoint	Allowed Roles
`GET /experiments`	any authenticated
`POST /experiments`	admin, editor
`PUT /experiments/{id}`	admin, editor
`DELETE /experiments/{id}`	admin, editor
`GET /experiments/{id}/results`	any authenticated

​How experiments work

​End-to-end data flow

​Champion/challenger mode

​Shadow mode

​Holdout group

​Statistical methods

​Mode comparison at a glance

​GET /api/v1/experiments

​Response

​POST /api/v1/experiments

​Request Body

​Validation

​Example

​GET /api/v1/experiments/

​PUT /api/v1/experiments/

​Request Body

​DELETE /api/v1/experiments/

​GET /api/v1/experiments//results

​Response

​Statistical Methods

​Roles

How experiments work

End-to-end data flow

Champion/challenger mode

Shadow mode

Holdout group

Statistical methods

Mode comparison at a glance

GET /api/v1/experiments

Response

POST /api/v1/experiments

Request Body

Validation

Example

GET /api/v1/experiments/

PUT /api/v1/experiments/

Request Body

DELETE /api/v1/experiments/

GET /api/v1/experiments//results

Response

Statistical Methods

Roles