Skip to main content

Uplift Modeling

Propensity says “will this customer convert?”. Uplift says “will showing this offer cause the conversion, or would they have converted anyway?”. The first ranks sure things (always-takers) at the top — wasting budget on people who’d buy without a touch. The second ranks persuadables at the top — exactly what you want. Kaireon implements two canonical metalearners from Künzel, Sekhon, Bickel & Yu (PNAS 2019): the T-learner and X-learner. Both are exposed via an HTTP endpoint and a small math library that takes pluggable base learners.

The four uplift segments

For each (customer × offer) pair, the CATE τ = E[Y(1) − Y(0) | X = x] and the per-arm conversion rates μ_T = E[Y | X, T=1] and μ_C = E[Y | X, T=0] together classify the customer into one of four segments:
SegmentWhenDecisioning implication
Persuadableτ high, μ_C lowTreatment causes the conversion. Rank these first.
Sure thing (always-taker)τ ≈ 0, μ_C highWould convert anyway. Save the impression.
Lost cause (never-taker)τ ≈ 0, μ_T lowWon’t convert either way. Skip.
Sleeping dog (defier)τ negativeTreatment causes them NOT to convert. Hide the offer.
A fifth value, uncertain, is returned when a (τ, μ_T, μ_C) triple matches none of the four threshold rules — so the segment field is one of five values, not four. Pure-propensity scoring confuses persuadables with sure-things — both have high μ_T. Uplift modeling is the only way to tell them apart.

T-learner

Fit two regressions on disjoint subsets of interaction_history:
μ_T(x) = E[Y | X = x, T = 1]      // fit on treated rows
μ_C(x) = E[Y | X = x, T = 0]      // fit on control (holdout) rows
τ_T(x) = μ_T(x) − μ_C(x)
Strengths. Trivially decomposable, no propensity model needed. Weaknesses. Each base learner sees only half the data. Biased when treatment groups are heavily imbalanced (which is the realistic decisioning case — most customers are treated, the holdout is small).

X-learner

The X-learner addresses the imbalance weakness via a two-stage fit: Stage 1: same μ_T, μ_C as T-learner. Stage 2: compute imputed treatment effects:
D_T(i) = Y_i − μ_C(X_i)      for each treated observation i
D_C(j) = μ_T(X_j) − Y_j      for each control observation j
Fit τ_T(x) on (X_T, D_T) and τ_C(x) on (X_C, D_C). Stage 3: combine using propensity g(x) = P(T=1 | X=x):
τ_X(x) = g(x) · τ_C(x) + (1 − g(x)) · τ_T(x)
Strengths. Outperforms T-learner under imbalanced treatment assignment (Künzel et al. 2019, §3 Lemma 1 and §4 simulations). The weighting recovers the right CATE even when one arm is starved for data. Weaknesses. Needs a propensity model. Two stages of fitting.

API

GET /api/v1/algorithm-models/{modelId}/uplift?customerId={customerId}&method=t_learner
Query params:
ParamRequiredDefaultDescription
customerIdyesTarget customer.
methodnot_learnerOne of t_learner, x_learner.
offerIdsnoall activeComma-separated offer IDs to score.
modenomarginalmarginal (cheap offer-vs-category posteriors) or fitted (real per-row T-/X-learner trained on interaction_history).
channelIdnoScore-time channel context (only used by mode=fitted).
directionnoinboundScore-time direction: inbound or outbound (only used by mode=fitted).
Response:
{
  "customerId": "cust-abc",
  "modelId": "mdl-123",
  "method": "t_learner",
  "offers": [
    {
      "offerId": "off-platinum",
      "offerName": "Platinum Card",
      "tau": 0.18,
      "muT": 0.42,
      "muC": 0.24,
      "segment": "persuadable",
      "confidence": 0.82,
      "evidenceOffer": 1200,
      "evidenceCategory": 4800
    }
  ],
  "ate": { "ate": 0.05, "n": 8, "sd": 0.12 },
  "classify": { /* segment thresholds */ }
}
confidence is a sample-size heuristic: 1 − exp(−min(n_offer, n_category) / 50). At n=50 → 0.63, n=200 → 0.98.

Honest scope

The endpoint at /algorithm-models/{id}/uplift uses the platform’s existing ModelAdaptation posteriors as the base learners:
  • μ_T(offer) = offer-scope positiveRate
  • μ_C(offer) = category-scope positiveRate (control proxy: “what this customer would do if shown a different offer in the same category”)
  • g(offer) = offer-evidence / (offer-evidence + category-evidence)
This is the marginal CATE path (mode=marginal, the default) — per-offer, not per-customer-features. It’s the right starting point because we already have the data. Passing ?mode=fitted switches the same endpoint to a real per-row T-learner / X-learner: it fits μ_T / μ_C as separate logistic regressions on the treated vs. control (same-category) subsets of interaction_history — up to MAX_TRAINING_ROWS = 5000 most-recent outcome rows — scored at the request-time context (channelId, direction). For method=x_learner in fitted mode it additionally fits the stage-2 lift regressors and a real propensity model g(x), but only when both arms have ≥ 30 rows; otherwise it falls back to constant stage-2 closures + the evidence-fraction propensity. Fitted results are cached in-process (100-entry LRU, 5-minute TTL, keyed on the latest interaction_history timestamp so new training data invalidates the cache). To plug entirely custom base learners, call the math library directly — see platform/src/lib/experimentation/uplift-cate.ts. The math is identical; only the model fitting changes.

How it improves ranking

When the uplift weight is non-zero, the final decision score becomes:
score = propensity^Wp × relevance^Wr × impact^Wi × emphasis^We × upliftMultiplier^Wu
upliftMultiplier = max(0.01, 0.5 + τ/2)        // τ ∈ [−1, 1] → multiplier ∈ (0, 1]
Wu is the uplift weight (default 0 for backward compat). Set it through a Ranking Profile’s uplift weight key (range 0..1) or the inline Score-node formula.upliftWeight (range 0..2); both map to the same upliftWeight term. The profile uplift key is now actually wired into the formula — it was previously documented but stripped by validation. With Wu > 0, the persuadable segment (positive τ) dominates the top of the ranking and sleeping-dogs (negative τ) drop. The engine stamps upliftTau and upliftMultiplier on each candidate’s trace. Test this on a holdout cohort before raising Wu above ~0.1 in production — uplift is a more aggressive ranker than pure propensity and can suppress evergreen offers if mis-tuned.

Configuration

Per-tenant in Settings → Models → Uplift:
SettingDefaultEffect
upliftMethodDefaultt_learnerDefault method used when callers don’t pass method; the only persisted tenant-level uplift setting (GET/PUT /api/v1/tenant-settings).
The classification thresholds (persuadable / sure thing / lost cause / sleeping dog) fall back to a built-in default threshold set and can be supplied per call via the classifyConfig argument. The ranking weight Wu is not a tenant setting — it lives on the scoring config: set it via a Ranking Profile’s uplift weight key (range 0..1) or the Score node’s inline formula.upliftWeight (range 0..2). Both default to 0. Raise carefully.

References

  • Künzel, Sekhon, Bickel, Yu (2019). “Metalearners for estimating heterogeneous treatment effects using machine learning.” PNAS 116(10): 4156–4165. The canonical reference for both T-learner and X-learner. pnas.org/doi/10.1073/pnas.1804597116 · arxiv.org/abs/1706.03461
  • Athey & Wager (2017). “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” arxiv.org/abs/1510.04342. Causal Forests — the obvious next step after the metalearners.
  • Dudík, Langford, Li (2011). “Doubly Robust Policy Evaluation and Learning.” ICML. arxiv.org/abs/1103.4601. Off-policy evaluator for contextual bandits — pairs naturally with X-learner.
  • Radcliffe & Surry (2011). “Real-World Uplift Modelling with Significance-Based Uplift Trees.” Stochastic Solutions white paper. Practitioner-style intro; useful framing for marketing teams.

Code

  • Library: platform/src/lib/experimentation/uplift-cate.ts
  • HTTP route: platform/src/app/api/v1/algorithm-models/[id]/uplift/route.ts
  • Population-level uplift (pre-existing, two-proportion z-test): platform/src/lib/experimentation/uplift.ts