Skip to main content
Alert rules monitor platform metrics against thresholds across a rolling time window. When a rule triggers, the platform fans out a notification to every destination the rule references.
Automatic firing requires the cron to be wired. Alert rules are evaluated only when POST /api/cron/tick is invoked. During pilot / initial deployment the cron is not wired to AWS EventBridge by default, so rules are defined but dormant — they won’t fire on their own.To run an evaluation on demand (for development, manual triggering, or smoke-testing a newly created rule), hit /api/cron/tick yourself — see Triggering evaluation manually below. To enable automatic every-minute evaluation, follow EventBridge Setup when you’re ready.
Every rule ships through the same pipeline:
  1. A caller (EventBridge, a manual curl, or any other scheduler) hits /api/cron/tick with the shared CRON_TOKEN.
  2. The tick iterates tenants and calls evaluateAllAlertRules per tenant.
  3. The evaluator computes the observed metric value over windowMinutes, compares against threshold using operator, and — if triggered — fans out a notification to every destination listed in channels (as long as the rule is outside its cooldownMinutes).
  4. lastFiredAt is updated and an audit log entry is written.

Supported metrics

MetricUnitWindow semantics
acceptance_raterate (0–1)positives / impressions over the window
ctrrate (0–1)clicks (click + convert) / impressions
revenuecurrency unitssum of outcome.conversionValue
selection_frequencyrate (0–1)decisions with at least one selected offer / total decision traces
latency_p99milliseconds99th percentile of DecisionTrace.totalLatencyMs
degraded_scoring_raterate (0–1)traces with degradedScoring=true / total traces
Each observation is computed against two windows: the current window (ending now) and the baseline window (same width, immediately preceding). The evaluator uses the baseline to derive severity — a bigger breach yields a higher severity tier.

Operators

OperatorMeaning
gtobserved > threshold
ltobserved < threshold
gteobserved ≥ threshold
lteobserved ≤ threshold
eqobserved = threshold

Severity

When a rule fires, the evaluator derives severity from the ratio |observed − threshold| / |threshold| (falling back to baseline when threshold = 0):
  • ≥ 1.0critical
  • ≥ 0.5warning
  • else → info
Severity is passed to the notification payload so adapters render the right visual treatment (e.g., themeColor in Teams, severity emoji in Slack, colored band in ops email).

Cooldown

Every rule has a cooldownMinutes knob. After a rule fires, subsequent evaluations that would otherwise trigger are recorded with status = "cooldown" and no notification is dispatched until lastFiredAt + cooldownMinutes has passed. This prevents paging storms when a metric bounces across the threshold.

Configure a rule

Open Settings → Alert Rules and click New Rule. Fields:
  • Name — free text; included in notification titles.
  • Metric — one of the supported metrics above.
  • Operator / Threshold — comparison to evaluate.
  • Window (minutes) — observation window.
  • Cooldown (minutes) — minimum gap between consecutive fires.
  • Destinations — multi-select of Notification Destinations; every selected destination receives the alert on fire.
  • Enabled — toggle to pause evaluation without deleting the rule.
Rules must reference at least one destination. If you delete a destination, rules pointing at it will log delivery_failed on their next fire.
Creating a rule in the UI does not cause it to start firing on its own. Until /api/cron/tick is invoked (manually or via EventBridge), the rule sits idle. See Triggering evaluation manually and EventBridge Setup.

Example rule payloads

A rule that pages when p99 decision latency crosses 500ms over a 10-minute window:
{
  "name": "p99 latency spike",
  "metric": "latency_p99",
  "operator": "gt",
  "threshold": 500,
  "windowMinutes": 10,
  "cooldownMinutes": 30,
  "channels": ["11111111-2222-3333-4444-555555555555"],
  "enabled": true
}
A rule that alerts when acceptance rate drops below 5% over 5 minutes:
{
  "name": "Acceptance rate drop",
  "metric": "acceptance_rate",
  "operator": "lt",
  "threshold": 0.05,
  "windowMinutes": 5,
  "cooldownMinutes": 60,
  "channels": [
    "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
    { "type": "webhook", "target": "https://legacy.example.com/hook" }
  ]
}
Each string in channels is a NotificationProvider UUID. The legacy {type, target} shape remains supported for backward compatibility; prefer provider IDs for new rules.

Rule lifecycle

  • status = "ok" — last evaluation did not trigger.
  • status = "fired" — last evaluation triggered and at least one destination accepted the dispatch.
  • status = "cooldown" — last evaluation triggered but the rule is still within cooldown.
  • status = "delivery_failed" — last evaluation triggered but every destination returned a failure.
  • status = "unsupported_metric" — metric name is not recognized (the rule never fires until fixed).
A rule that is never evaluated stays at whatever status value it last had (or the default). Dormant rules don’t transition states on their own — only a tick evaluation can move them.

Triggering evaluation manually

When the cron is not wired to EventBridge (e.g., during pilot, local development, or to smoke-test a newly created rule), you can invoke the evaluator directly:
curl -X POST https://your-deployment/api/cron/tick \
  -H "x-cron-token: $CRON_TOKEN"
This runs a single evaluation pass across every enabled rule for every tenant. Each invocation is independent — rules still respect cooldownMinutes, so two calls back-to-back will not double-fire. Response:
{
  "ok": true,
  "tenantsProcessed": 3,
  "rulesEvaluated": 12,
  "rulesFired": 2,
  "errors": [],
  "durationMs": 187,
  "timestamp": "2026-04-17T12:34:56.789Z"
}

Wire automatic evaluation

When you’re ready for rules to evaluate on a cadence without manual invocation, follow EventBridge Setup. That page is marked optional on purpose — automatic firing is a pilot graduation step, not a prerequisite.