Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
This document defines the Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget policies for the KaireonAI Next-Best-Action platform.
1. SLO Summary
| Service Area | SLI | SLO Target | Measurement Window |
|---|
| Availability | Successful HTTP responses / total | 99.9% | Rolling 30 days |
| Decision Latency P99 | 99th percentile response time | < 200 ms | Rolling 30 days |
| Decision Latency P95 | 95th percentile response time | < 100 ms | Rolling 30 days |
| API Error Rate | 5xx responses / total responses | < 0.1% | Rolling 30 days |
| Pipeline Success | Successful pipeline runs / total | > 99% | Rolling 7 days |
2. SLI Definitions
2.1 Availability
Definition: The proportion of valid HTTP requests that return a non-5xx response code, measured at the load balancer (ALB/Ingress).
Includes: All requests to /api/v1/* endpoints and the Next.js frontend.
Excludes: Health check probes (/api/health, /api/ready), synthetic monitoring requests.
Formula:
availability = 1 - (count of 5xx responses / count of total responses)
Error budget: At 99.9% over 30 days, the platform tolerates approximately 43 minutes of total downtime or equivalent partial degradation per month.
2.2 Decision Latency
Definition: Server-side duration from request receipt to response write for decision endpoints (/api/v1/decisions, /api/v1/offers/score).
Measurement point: Application-level histogram recorded via OpenTelemetry, exported to Prometheus.
Formulas:
p99_latency = histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m]))
p95_latency = histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m]))
2.3 API Error Rate
Definition: The proportion of all API responses that return a 5xx status code.
Formula:
error_rate = rate(http_responses_total{status=~"5.."}[5m]) / rate(http_responses_total[5m])
2.4 Pipeline Success Rate
Definition: The proportion of pipeline executions that complete without error, measured from the pipeline orchestrator.
Formula:
pipeline_success = count(pipeline_runs{status="completed"}) / count(pipeline_runs{status=~"completed|failed"})
3. Prometheus Alerting Rules
groups:
- name: kaireon-slo-alerts
rules:
# --- Availability ---
- alert: HighErrorRate
expr: |
(
sum(rate(http_responses_total{status=~"5.."}[5m]))
/
sum(rate(http_responses_total[5m]))
) > 0.001
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "API error rate exceeds 0.1% SLO"
description: "Current error rate is {{ $value | humanizePercentage }}. SLO target is < 0.1%."
runbook: "docs/ops/runbooks/high-error-rate.md"
- alert: AvailabilityBurnRateFast
expr: |
(
1 - (sum(rate(http_responses_total{status!~"5.."}[5m])) / sum(rate(http_responses_total[5m])))
) > 0.01
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "Fast burn: error budget consumption rate is critical"
description: "At current rate, the monthly error budget will be exhausted in less than 4 hours."
- alert: AvailabilityBurnRateSlow
expr: |
(
1 - (sum(rate(http_responses_total{status!~"5.."}[1h])) / sum(rate(http_responses_total[1h])))
) > 0.002
for: 30m
labels:
severity: warning
team: platform
annotations:
summary: "Slow burn: error budget is being consumed above normal rate"
# --- Decision Latency ---
- alert: DecisionLatencyP99High
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m])) by (le)
) > 0.200
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Decision P99 latency exceeds 200ms SLO"
description: "Current P99 latency is {{ $value | humanizeDuration }}."
runbook: "docs/ops/runbooks/high-latency.md"
- alert: DecisionLatencyP95High
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m])) by (le)
) > 0.100
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Decision P95 latency exceeds 100ms SLO"
description: "Current P95 latency is {{ $value | humanizeDuration }}."
# --- Pipeline Success ---
- alert: PipelineFailureRateHigh
expr: |
(
sum(rate(pipeline_runs_total{status="failed"}[1h]))
/
sum(rate(pipeline_runs_total[1h]))
) > 0.01
for: 15m
labels:
severity: warning
team: data-engineering
annotations:
summary: "Pipeline failure rate exceeds 1% SLO"
description: "Current failure rate is {{ $value | humanizePercentage }}."
runbook: "docs/ops/runbooks/pipeline-failure.md"
4. Error Budget Policy
4.1 Budget Calculation
Each SLO has an associated error budget equal to 1 - SLO target over the measurement window.
| SLO | Target | Error Budget (30 days) |
|---|
| Availability | 99.9% | 43.2 minutes of downtime |
| API Error Rate | 0.1% | 1 in 1,000 requests may fail |
| Pipeline Success | 99% | 1 in 100 pipeline runs may fail |
4.2 Budget States
| Budget Remaining | State | Actions |
|---|
| > 50% | Green | Normal development velocity. Feature work proceeds normally. |
| 25% - 50% | Yellow | Increase monitoring. Defer risky deployments. |
| 5% - 25% | Orange | Freeze non-critical deployments. Prioritize reliability work. |
| < 5% | Red | Full deployment freeze. All engineering effort on reliability. |
4.3 Budget Exhaustion Protocol
When the error budget for any SLO is fully exhausted within the measurement window:
- Immediate: Halt all non-emergency deployments to production.
- Within 1 hour: Conduct a rapid incident review to identify contributing factors.
- Within 24 hours: Publish a brief written analysis with remediation items.
- Ongoing: Reliability improvements take priority until budget recovers above 25%.
- Exception: Security patches and data-loss-prevention fixes are always permitted.
5. Escalation Procedures
5.1 Severity Levels
| Severity | Criteria | Response Time | Resolution Target |
|---|
| P0 | Platform unavailable or decision latency > 1s for all users | 5 minutes | 1 hour |
| P1 | SLO breach in progress, single component degraded | 15 minutes | 4 hours |
| P2 | Error budget consumption elevated but SLO not yet breached | 1 hour | 24 hours |
| P3 | Cosmetic or minor degradation, no SLO impact | Next business day | 1 week |
5.2 Escalation Chain
L1: On-call engineer (PagerDuty primary)
|-- 15 min no ack --> L2: On-call engineer (PagerDuty secondary)
|-- 30 min no ack --> L3: Engineering manager
|-- 1 hour P0 unresolved --> L4: VP Engineering + stakeholder notification
5.3 Communication
| Audience | Channel | Frequency During Incident |
|---|
| Engineering | #incidents Slack | Continuous updates |
| Stakeholders | Email / status page | Every 30 minutes for P0/P1 |
| Customers | Status page | Initial post + resolution |
6. Dashboards
Maintain the following Grafana dashboards to track SLO health:
| Dashboard | Contents |
|---|
| SLO Overview | All SLIs with current values, budget remaining, trend lines |
| Decision Performance | Latency histograms, throughput, error breakdown by endpoint |
| Pipeline Health | Success/failure rates, duration distributions, queue depth |
| Error Budget Burn | Burn rate charts per SLO with projected exhaustion dates |
7. Review Cadence
- Weekly: Review SLO dashboards in team standup. Note any yellow/orange states.
- Monthly: Publish SLO report to stakeholders. Adjust targets if consistently over- or under-performing.
- Quarterly: Evaluate whether SLO targets remain appropriate for current business needs. Propose revisions through the architecture review process.