1. SLO Summary
| Service Area | SLI | SLO Target | Measurement Window |
|---|---|---|---|
| Availability | Successful HTTP responses / total | 99.9% | Rolling 30 days |
| Decision Latency P99 | 99th percentile response time | < 200 ms | Rolling 30 days |
| Decision Latency P95 | 95th percentile response time | < 100 ms | Rolling 30 days |
| API Error Rate | 5xx responses / total responses | < 0.1% | Rolling 30 days |
| Pipeline Success | Successful pipeline runs / total | > 99% | Rolling 7 days |
2. SLI Definitions
2.1 Availability
Definition: The proportion of valid HTTP requests that return a non-5xx response code, measured at the load balancer (ALB/Ingress). Includes: All requests to/api/v1/* endpoints and the Next.js frontend.
Excludes: Health check probes (/api/health, /api/ready), synthetic monitoring requests.
Formula:
2.2 Decision Latency
Definition: Server-side duration from request receipt to response write for decision endpoints (/api/v1/decisions, /api/v1/offers/score).
Measurement point: Application-level histogram recorded via OpenTelemetry, exported to Prometheus.
Formulas:
2.3 API Error Rate
Definition: The proportion of all API responses that return a 5xx status code. Formula:2.4 Pipeline Success Rate
Definition: The proportion of pipeline executions that complete without error, measured from the pipeline orchestrator. Formula:3. Prometheus Alerting Rules
4. Error Budget Policy
4.1 Budget Calculation
Each SLO has an associated error budget equal to1 - SLO target over the measurement window.
| SLO | Target | Error Budget (30 days) |
|---|---|---|
| Availability | 99.9% | 43.2 minutes of downtime |
| API Error Rate | 0.1% | 1 in 1,000 requests may fail |
| Pipeline Success | 99% | 1 in 100 pipeline runs may fail |
4.2 Budget States
| Budget Remaining | State | Actions |
|---|---|---|
| > 50% | Green | Normal development velocity. Feature work proceeds normally. |
| 25% - 50% | Yellow | Increase monitoring. Defer risky deployments. |
| 5% - 25% | Orange | Freeze non-critical deployments. Prioritize reliability work. |
| < 5% | Red | Full deployment freeze. All engineering effort on reliability. |
4.3 Budget Exhaustion Protocol
When the error budget for any SLO is fully exhausted within the measurement window:- Immediate: Halt all non-emergency deployments to production.
- Within 1 hour: Conduct a rapid incident review to identify contributing factors.
- Within 24 hours: Publish a brief written analysis with remediation items.
- Ongoing: Reliability improvements take priority until budget recovers above 25%.
- Exception: Security patches and data-loss-prevention fixes are always permitted.
5. Escalation Procedures
5.1 Severity Levels
| Severity | Criteria | Response Time | Resolution Target |
|---|---|---|---|
| P0 | Platform unavailable or decision latency > 1s for all users | 5 minutes | 1 hour |
| P1 | SLO breach in progress, single component degraded | 15 minutes | 4 hours |
| P2 | Error budget consumption elevated but SLO not yet breached | 1 hour | 24 hours |
| P3 | Cosmetic or minor degradation, no SLO impact | Next business day | 1 week |
5.2 Escalation Chain
5.3 Communication
| Audience | Channel | Frequency During Incident |
|---|---|---|
| Engineering | #incidents Slack | Continuous updates |
| Stakeholders | Email / status page | Every 30 minutes for P0/P1 |
| Customers | Status page | Initial post + resolution |
6. Dashboards
Maintain the following Grafana dashboards to track SLO health:| Dashboard | Contents |
|---|---|
| SLO Overview | All SLIs with current values, budget remaining, trend lines |
| Decision Performance | Latency histograms, throughput, error breakdown by endpoint |
| Pipeline Health | Success/failure rates, duration distributions, queue depth |
| Error Budget Burn | Burn rate charts per SLO with projected exhaustion dates |
7. Review Cadence
- Weekly: Review SLO dashboards in team standup. Note any yellow/orange states.
- Monthly: Publish SLO report to stakeholders. Adjust targets if consistently over- or under-performing.
- Quarterly: Evaluate whether SLO targets remain appropriate for current business needs. Propose revisions through the architecture review process.