Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

What it does

The Pega ADM Adaptive Boosting trainer claims it learns near the decision boundary by oversampling marginal predictions. Kaireon’s counterfactual trainer is the open-source equivalent:
  1. Score every row of the labeled training set with the current gradient_boosted model.
  2. Identify marginal rows — predicted probability in [0.5 - marginalBand .. 0.5 + marginalBand] (default band 0.1).
  3. Generate K synthetic neighbors per marginal row by perturbing each numeric feature with gaussian noise scaled to the observed feature standard deviation (default 0.5σ, K=4).
  4. Append the synthetics to the training set; the augmented set is what gets sent to the existing trainGBMRemote() pipeline.
The Python ml-worker is unchanged. Augmentation lives entirely in TypeScript so the trainer stays a thin wire.

Honest limits

  • Numeric only. Boolean / categorical features are held as-is on synthetic rows. Perturbation rules for them are undefined — silently perturbing them would corrupt training data.
  • Deterministic. mulberry32 RNG seeded by options.seed (default 7) so two runs with the same options produce the same synthetic data.
  • Bounded budget. maxSynthetic defaults to 10_000. The function surfaces a summary.syntheticRowsAdded so operators can audit how much data was added per training pass.
  • Not a feature-store substitute. This augments the training set at call time. It does not modify any persisted dataset.

Tenant-settings flag

Operators opt in per-tenant (the augmenter does not run by default — training time + cost rises proportionally to K × marginalCount):
{
  "aiAnalyzerSettings": {
    "ml": {
      "counterfactualTrainingEnabled": true,    // default false
      "marginalBand": 0.1,                      // 0..0.5
      "syntheticPerRow": 4,                     // 1..16
      "maxSynthetic": 10000                     // hard cap
    }
  }
}
The wire into lib/scoring/train.ts consumes these via the existing getImportSettings-pattern reader. With the flag off, the trainer bypasses the augmenter entirely.

API surface

The augmenter is a pure TS function. There is no HTTP endpoint — it runs inline at training time. Callers use it like:
import { augmentWithCounterfactuals } from "@/lib/ml/counterfactual-trainer";
import { trainGBMRemote } from "@/lib/ml-worker-client";
import { scoreGradientBoosted } from "@/lib/scoring/gradient-boosted";

const scorer = (features) => scoreGradientBoosted(currentModel, features).p;
const { augmented, summary } = augmentWithCounterfactuals({
  request,
  scorer,
  options: { marginalBand: 0.1, syntheticPerRow: 4, seed: 7 },
});
console.log("counterfactual augmentation:", summary);

const result = await trainGBMRemote(augmented);

Honest comparison with Pega’s claim

Pega documents Adaptive Boosting’s marginal-emphasis behavior at the algorithm level; the implementation lives behind their proprietary ADM. Kaireon’s augmenter takes the same approach but exposes the auditable function: every synthetic row is reproducible from the seed, the input set, and the scorer. What we don’t claim: Pega’s full ADM pipeline includes binned-Bayes predictor grouping, online incremental updates, and a learning-rate schedule that varies by row. Counterfactual augmentation is one of those mechanisms, not all of them. The composite ADM grade (§3.2 Adaptive learning) reflects this honestly.

Tests

platform/src/lib/ml/__tests__/counterfactual-trainer.test.ts — 8 tests cover: empty-set passthrough, numeric-vs-skipped feature classification, marginal-row identification, no-marginal-row passthrough, maxSynthetic cap, label preservation, boolean-feature unchanged, deterministic seed.