Perf baselines + regression gate

What ships

File	Purpose
`platform/perf/recommend-100rps.ts`	TS-first 100 RPS / 5 min sustained load test against `POST /api/v1/recommend`. No new dep — uses `node:http`.
`platform/perf/recommend-5krps.ts`	Wrapper that re-invokes the 100rps script with `--rps 5000 --duration-sec 300`. Operator-pending — running 5K RPS reliably needs a multi-node k6 cloud cluster or a self-hosted load fleet. The TS script is the auditable spec; the operator publishes the baseline.
`platform/perf/recommend.js`	Original k6 scenario for 1K / 5K / 10K. Recommended way to actually drive load — the TS scripts are the source-of-truth scenario definition.
`platform/perf/baselines/<date>-recommend-<scenario>.json`	One file per published baseline. Lexical sort = chronological because of the date prefix.
`platform/perf/compare-baselines.mjs`	CI regression gate. Reads the latest two baselines, fails on `p95 > +10%` or `error_rate > +0.5pp`.
`tools/qa/decisioning-bench/harness/run.mjs`	Decisioning-quality bench harness — now supports `--concurrency N` for parallel customer scans.

Running the 100 RPS scenario

# Start the dev server in another terminal
cd platform && npm run dev

# In a fresh terminal, point the script at it
TENANT_ID=<your-tenant-uuid> API_KEY=<your-api-key> \
    npx tsx platform/perf/recommend-100rps.ts \
        --target http://localhost:3000

# Default flags: --rps 100 --duration-sec 300. Output JSON path:
#   platform/perf/baselines/<UTC-date>-recommend-100rps.json

The output is also printed to stdout so the operator can eyeball the percentiles before committing.

Publishing a baseline

Run the scenario above against a stable environment (no concurrent dev work; warm caches). Numbers from a cold-start dev server are not representative of production.
Inspect the JSON. Sanity-check latencyMs.p95 and totals.errorRate against your service’s SLOs.
Commit the baseline file under platform/perf/baselines/.
Open a PR. The CI step Perf-baseline regression gate (#24) will compare your new baseline against the previous one and fail the build if either threshold is breached.

CI gate semantics

platform/perf/compare-baselines.mjs reads the two most recent baseline files matching *-recommend-100rps.json (lexical sort → chronological because of the date prefix), parses them, and computes:

p95DeltaPct  = ((current.latencyMs.p95 - previous.latencyMs.p95) / previous.latencyMs.p95) * 100
errDeltaPp   = (current.totals.errorRate - previous.totals.errorRate) * 100

The build fails when either delta exceeds the configured threshold:

Threshold	Default	Override flag
p95 latency growth	+10 %	`--p95-pct-tolerance`
error rate growth	+0.5 percentage points	`--error-rate-pp-tolerance`

Edge cases:

Zero baselines or directory missing. Pass with a [notice] log. First-time setup — the gate is a no-op until the operator publishes.
One baseline. Pass with a “no previous to compare” log.
Malformed JSON. Fail. A silent regression cannot slip through bad data.
Missing latencyMs.p95 or totals.errorRate field. Fail.

The gate is read-only — CI never generates a baseline (a CI runner is not load-representative; using it would yield noisy thresholds). Baselines are produced by operators against a representative environment.

Honest residual: 5K RPS is operator-pending

The user direction for #24 says explicitly: “5K RPS script ships but stays operator-pending (no AWS provisioning). If 5K RPS isn’t actually run, the baseline JSON file is operator-pending and the 3.1 grade reflects that honestly.” That is the case here. recommend-5krps.ts is the auditable load profile; running it reliably requires either:

A multi-node k6 cloud cluster (preferred — k6 already has the worker-pool primitives the TS script lacks for this scale), or
A self-hosted load fleet (one box per ~500 RPS budget).

When the 5K baseline is published, drop it under platform/perf/baselines/<date>-recommend-5krps.json and either run the gate manually with --scenario recommend-5krps or extend the CI step to invoke both scenarios.

Decisioning-bench `--concurrency`

tools/qa/decisioning-bench/harness/run.mjs now accepts an optional --concurrency N flag (default 1 = sequential, matching prior behaviour):

node tools/qa/decisioning-bench/harness/run.mjs \
    --dataset banking-cards \
    --target http://localhost:3000 \
    --concurrency 8

Internally this runs N async loops over a shared queue of holdout customers. Aggregation order remains deterministic — predictions is sorted by score before AUC + fairness computation. The output JSON gains a top-level concurrency: <N> field so downstream comparisons don’t conflate sequential vs parallel runs.

Roadmap

Auto-generate a [date]-recommend-100rps.json in CI on push to main against a long-running staging environment, so the gate catches regressions on every PR. Today’s manual-publish flow is the honest first cut.
Extend the gate to compare medians + p99 + actual achieved RPS, not just p95 + error rate. Trade-off is signal-vs-noise: short-window 100 RPS samples can show large p99 swings on cold caches.

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Perf baselines + regression gate

What ships

Running the 100 RPS scenario

Publishing a baseline

CI gate semantics

Honest residual: 5K RPS is operator-pending

Decisioning-bench `--concurrency`

Roadmap

Get Started

Deploy & Operate

Runbooks

Data Platform

Decisioning Studio

Execute & Optimize

Intelligence

Platform & Security

Integrations

Reports

Release Notes

Documentation Index

​What ships

​Running the 100 RPS scenario

​Publishing a baseline

​CI gate semantics

​Honest residual: 5K RPS is operator-pending

​Decisioning-bench --concurrency

​Roadmap

What ships

Running the 100 RPS scenario

Publishing a baseline

CI gate semantics

Honest residual: 5K RPS is operator-pending

Decisioning-bench `--concurrency`

Roadmap