Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

What it does

The counterfactual trainer is a pre-train hook that sharpens the decision boundary of the gradient_boosted model by augmenting the training set with synthetic rows near low-confidence predictions. It runs entirely in TypeScript before the existing remote-GBM training call, so the Python ml-worker stays unchanged. The augmenter produces an enriched training set + a deterministic summary describing how many synthetic rows were added and which feature columns were perturbed.

Honest limits

  • Numeric only. Boolean / categorical features are held as-is on synthetic rows. Perturbation rules for them are undefined — silently perturbing them would corrupt training data.
  • Deterministic. mulberry32 RNG seeded by options.seed (default 7) so two runs with the same options produce the same synthetic data.
  • Bounded budget. maxSynthetic defaults to 10_000. The function surfaces a summary.syntheticRowsAdded so operators can audit how much data was added per training pass.
  • Not a feature-store substitute. This augments the training set at call time. It does not modify any persisted dataset.

Tenant-settings flag

Operators opt in per-tenant (the augmenter does not run by default — training time + cost rises proportionally to K × marginalCount):
{
  "aiAnalyzerSettings": {
    "ml": {
      "counterfactualTrainingEnabled": true,    // default false
      "marginalBand": 0.1,                      // 0..0.5
      "syntheticPerRow": 4,                     // 1..16
      "maxSynthetic": 10000                     // hard cap
    }
  }
}
The wire into lib/scoring/train.ts consumes these via the existing getImportSettings-pattern reader. With the flag off, the trainer bypasses the augmenter entirely.

API surface

The augmenter is a pure TS function. There is no HTTP endpoint — it runs inline at training time. Callers use it like:
import { augmentWithCounterfactuals } from "@/lib/ml/counterfactual-trainer";
import { trainGBMRemote } from "@/lib/ml-worker-client";
import { scoreGradientBoosted } from "@/lib/scoring/gradient-boosted";

const scorer = (features) => scoreGradientBoosted(currentModel, features).p;
const { augmented, summary } = augmentWithCounterfactuals({
  request,
  scorer,
  options: { marginalBand: 0.1, syntheticPerRow: 4, seed: 7 },
});
console.log("counterfactual augmentation:", summary);

const result = await trainGBMRemote(augmented);

Algorithm — what it does, what it doesn’t

What the augmenter does, step by step:
  1. Score every row of the labeled training set with the current gradient_boosted model via the supplied scorer.
  2. Identify marginal rows — predicted probability in [0.5 - marginalBand .. 0.5 + marginalBand] (default band 0.1).
  3. For each marginal row, generate K synthetic neighbors by perturbing each numeric feature with gaussian noise scaled to the observed feature standard deviation (default 0.5σ, K=4).
  4. Append the synthetic rows to the training set and return the augmented set with a reproducibility summary.
What the augmenter does not do:
  • It does not perturb boolean or categorical features. Those are held as-is on synthetic rows.
  • It does not perform binned-Bayes predictor grouping or online incremental updates. The augmenter only widens the training set; the gradient-boosted trainer itself does the learning.
  • It does not vary the learning-rate schedule per row. The augmented set is fed to the remote GBM trainer with the same hyperparameters as any other training pass.
  • It does not modify any persisted dataset. Augmentation happens at call time only.
Every synthetic row is reproducible from the seed, the input set, and the scorer.

Tests

The counterfactual-trainer test suite ships 8 cases covering: empty-set passthrough, numeric-vs-skipped feature classification, marginal-row identification, no-marginal-row passthrough, maxSynthetic cap, label preservation, boolean-feature unchanged, deterministic seed.