Skip to main content
This document defines the Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget policies for the KaireonAI Next-Best-Action platform.

1. SLO Summary

Service AreaSLISLO TargetMeasurement Window
AvailabilitySuccessful HTTP responses / total99.9%Rolling 30 days
Decision Latency P9999th percentile response time< 200 msRolling 30 days
Decision Latency P9595th percentile response time< 100 msRolling 30 days
API Error Rate5xx responses / total responses< 0.1%Rolling 30 days
Pipeline SuccessSuccessful pipeline runs / total> 99%Rolling 7 days

2. SLI Definitions

2.1 Availability

Definition: The proportion of valid HTTP requests that return a non-5xx response code, measured at the load balancer (ALB/Ingress). Includes: All requests to /api/v1/* endpoints and the Next.js frontend. Excludes: Health check probes (/api/health, /api/ready), synthetic monitoring requests. Formula:
availability = 1 - (count of 5xx responses / count of total responses)
Error budget: At 99.9% over 30 days, the platform tolerates approximately 43 minutes of total downtime or equivalent partial degradation per month.

2.2 Decision Latency

Definition: Server-side duration from request receipt to response write for decision endpoints (/api/v1/decisions, /api/v1/offers/score). Measurement point: Application-level histogram recorded via OpenTelemetry, exported to Prometheus. Formulas:
p99_latency = histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m]))
p95_latency = histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m]))

2.3 API Error Rate

Definition: The proportion of all API responses that return a 5xx status code. Formula:
error_rate = rate(http_responses_total{status=~"5.."}[5m]) / rate(http_responses_total[5m])

2.4 Pipeline Success Rate

Definition: The proportion of pipeline executions that complete without error, measured from the pipeline orchestrator. Formula:
pipeline_success = count(pipeline_runs{status="completed"}) / count(pipeline_runs{status=~"completed|failed"})

3. Prometheus Alerting Rules

groups:
  - name: kaireon-slo-alerts
    rules:
      # --- Availability ---
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_responses_total{status=~"5.."}[5m]))
            /
            sum(rate(http_responses_total[5m]))
          ) > 0.001
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API error rate exceeds 0.1% SLO"
          description: "Current error rate is {{ $value | humanizePercentage }}. SLO target is < 0.1%."
          runbook: "docs/ops/runbooks/high-error-rate.md"

      - alert: AvailabilityBurnRateFast
        expr: |
          (
            1 - (sum(rate(http_responses_total{status!~"5.."}[5m])) / sum(rate(http_responses_total[5m])))
          ) > 0.01
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Fast burn: error budget consumption rate is critical"
          description: "At current rate, the monthly error budget will be exhausted in less than 4 hours."

      - alert: AvailabilityBurnRateSlow
        expr: |
          (
            1 - (sum(rate(http_responses_total{status!~"5.."}[1h])) / sum(rate(http_responses_total[1h])))
          ) > 0.002
        for: 30m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Slow burn: error budget is being consumed above normal rate"

      # --- Decision Latency ---
      - alert: DecisionLatencyP99High
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m])) by (le)
          ) > 0.200
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Decision P99 latency exceeds 200ms SLO"
          description: "Current P99 latency is {{ $value | humanizeDuration }}."
          runbook: "docs/ops/runbooks/high-latency.md"

      - alert: DecisionLatencyP95High
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{route=~"/api/v1/decisions.*"}[5m])) by (le)
          ) > 0.100
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Decision P95 latency exceeds 100ms SLO"
          description: "Current P95 latency is {{ $value | humanizeDuration }}."

      # --- Pipeline Success ---
      - alert: PipelineFailureRateHigh
        expr: |
          (
            sum(rate(pipeline_runs_total{status="failed"}[1h]))
            /
            sum(rate(pipeline_runs_total[1h]))
          ) > 0.01
        for: 15m
        labels:
          severity: warning
          team: data-engineering
        annotations:
          summary: "Pipeline failure rate exceeds 1% SLO"
          description: "Current failure rate is {{ $value | humanizePercentage }}."
          runbook: "docs/ops/runbooks/pipeline-failure.md"

4. Error Budget Policy

4.1 Budget Calculation

Each SLO has an associated error budget equal to 1 - SLO target over the measurement window.
SLOTargetError Budget (30 days)
Availability99.9%43.2 minutes of downtime
API Error Rate0.1%1 in 1,000 requests may fail
Pipeline Success99%1 in 100 pipeline runs may fail

4.2 Budget States

Budget RemainingStateActions
> 50%GreenNormal development velocity. Feature work proceeds normally.
25% - 50%YellowIncrease monitoring. Defer risky deployments.
5% - 25%OrangeFreeze non-critical deployments. Prioritize reliability work.
< 5%RedFull deployment freeze. All engineering effort on reliability.

4.3 Budget Exhaustion Protocol

When the error budget for any SLO is fully exhausted within the measurement window:
  1. Immediate: Halt all non-emergency deployments to production.
  2. Within 1 hour: Conduct a rapid incident review to identify contributing factors.
  3. Within 24 hours: Publish a brief written analysis with remediation items.
  4. Ongoing: Reliability improvements take priority until budget recovers above 25%.
  5. Exception: Security patches and data-loss-prevention fixes are always permitted.

5. Escalation Procedures

5.1 Severity Levels

SeverityCriteriaResponse TimeResolution Target
P0Platform unavailable or decision latency > 1s for all users5 minutes1 hour
P1SLO breach in progress, single component degraded15 minutes4 hours
P2Error budget consumption elevated but SLO not yet breached1 hour24 hours
P3Cosmetic or minor degradation, no SLO impactNext business day1 week

5.2 Escalation Chain

L1: On-call engineer (PagerDuty primary)
    |-- 15 min no ack --> L2: On-call engineer (PagerDuty secondary)
    |-- 30 min no ack --> L3: Engineering manager
    |-- 1 hour P0 unresolved --> L4: VP Engineering + stakeholder notification

5.3 Communication

AudienceChannelFrequency During Incident
Engineering#incidents SlackContinuous updates
StakeholdersEmail / status pageEvery 30 minutes for P0/P1
CustomersStatus pageInitial post + resolution

6. Dashboards

Maintain the following Grafana dashboards to track SLO health:
DashboardContents
SLO OverviewAll SLIs with current values, budget remaining, trend lines
Decision PerformanceLatency histograms, throughput, error breakdown by endpoint
Pipeline HealthSuccess/failure rates, duration distributions, queue depth
Error Budget BurnBurn rate charts per SLO with projected exhaustion dates

7. Review Cadence

  • Weekly: Review SLO dashboards in team standup. Note any yellow/orange states.
  • Monthly: Publish SLO report to stakeholders. Adjust targets if consistently over- or under-performing.
  • Quarterly: Evaluate whether SLO targets remain appropriate for current business needs. Propose revisions through the architecture review process.