Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
What it does
The counterfactual trainer is a pre-train hook that sharpens the decision boundary of thegradient_boosted model by augmenting the
training set with synthetic rows near low-confidence predictions. It
runs entirely in TypeScript before the existing remote-GBM training
call, so the Python ml-worker stays unchanged.
The augmenter produces an enriched training set + a deterministic
summary describing how many synthetic rows were added and which
feature columns were perturbed.
Honest limits
- Numeric only. Boolean / categorical features are held as-is on synthetic rows. Perturbation rules for them are undefined — silently perturbing them would corrupt training data.
- Deterministic. mulberry32 RNG seeded by
options.seed(default 7) so two runs with the same options produce the same synthetic data. - Bounded budget.
maxSyntheticdefaults to 10_000. The function surfaces asummary.syntheticRowsAddedso operators can audit how much data was added per training pass. - Not a feature-store substitute. This augments the training set at call time. It does not modify any persisted dataset.
Tenant-settings flag
Operators opt in per-tenant (the augmenter does not run by default — training time + cost rises proportionally toK × marginalCount):
lib/scoring/train.ts consumes these via the existing
getImportSettings-pattern reader. With the flag off, the trainer
bypasses the augmenter entirely.
API surface
The augmenter is a pure TS function. There is no HTTP endpoint — it runs inline at training time. Callers use it like:Algorithm — what it does, what it doesn’t
What the augmenter does, step by step:- Score every row of the labeled training set with the current
gradient_boostedmodel via the suppliedscorer. - Identify marginal rows — predicted probability in
[0.5 - marginalBand .. 0.5 + marginalBand](default band 0.1). - For each marginal row, generate K synthetic neighbors by
perturbing each numeric feature with gaussian noise scaled to the
observed feature standard deviation (default
0.5σ, K=4). - Append the synthetic rows to the training set and return the
augmented set with a reproducibility
summary.
- It does not perturb boolean or categorical features. Those are held as-is on synthetic rows.
- It does not perform binned-Bayes predictor grouping or online incremental updates. The augmenter only widens the training set; the gradient-boosted trainer itself does the learning.
- It does not vary the learning-rate schedule per row. The augmented set is fed to the remote GBM trainer with the same hyperparameters as any other training pass.
- It does not modify any persisted dataset. Augmentation happens at call time only.