Flagship Case Study

SRE Automation Pipeline

Google-style site reliability engineering applied to a 4-node homelab — because even personal infrastructure deserves production-grade reliability.

Problem

Manual monitoring doesn't scale. Even on a 4-node homelab, I was spending time SSH-ing into machines to check if services were healthy. When something broke, I'd discover it hours later. No structured way to track reliability, no history of incidents, no way to prove operational maturity to a hiring manager.

Real constraint: This is a homelab, not a cloud environment. No Datadog, no PagerDuty, no SaaS monitoring. Everything had to be built from scratch, self-hosted, and cost $0.

Architecture

Snapshots → SLI Sources → SLI Compute → Error Budget
                                                 ↓
                               Burn Rate → Alerts → Incidents
                                                 ↓
                  Timeline → Evidence → Postmortem → Dashboard

Each box is a Python module. The pipeline runs on every SLO evaluation tick and feeds into the safety gatekeeper to block risky automation when reliability is degraded.

Implementation

SLO Engine
  • sli_sources.py — Load raw signals from snapshots
  • sli_compute.py — Compute SLIs per rolling window
  • budget.py — Track error budgets (allowed/consumed)
  • burn_rate.py — Multi-window burn rate alerting
  • slo_eval.py — Unified evaluation orchestrator
  • slo_render.py — Markdown + text report rendering
  • slo_publish.py — Dashboard + Telegram publishing
Incident Commander
  • incident_state.py — Lifecycle (create/update/close)
  • incident_manager.py — CLI + auto-trigger engine
  • incident_render.py — Postmortem generation
  • Timeline with timestamped events
  • Evidence pack linking
  • Cooldown to prevent alert storms
  • Auto-close after configurable TTL

Results

6
SLOs tracked
5
Rolling windows
38
Tests passing
13m
Avg incident TTR
  • SLO targets: Gateway 99.9%, Ollama 99.5%, Dashboard 99%, Node connectivity 99.5%
  • Burn rate thresholds: fast (14x/1h), medium (6x/6h), slow (2x/24h)
  • Incidents auto-created within 2 minutes of SLO degradation
  • Postmortems auto-generated with root cause, timeline, lessons, action items

Lessons Learned

  • SLOs without error budgets are just dashboards. The budget math is what makes SLOs actionable — it tells you when to stop shipping and start fixing.
  • Burn rate > threshold alerting. A single threshold fires too late or too often. Multi-window burn rates catch both fast spikes and slow degradation.
  • Postmortems are documentation you'll actually read. They're not just for compliance — they're the best way to learn from incidents and build organizational memory.
  • Safety gates save you from yourself. The gatekeeper blocking risky actions during high burn prevented me from making incidents worse.
Next Steps
  • Add latency SLIs (not just availability)
  • Integrate with Telegram for alert delivery
  • Build a weekly SLO review report for trend analysis
SRE dashboard showing SLO status bars, burn-rate gauges, and incident count for a 4-node homelab cluster

Redacted sample (structure accurate; sensitive values removed).

What to look at: SLO budget bars per service, burn-rate gauges across alert windows, and active incident count with average TTR.

Evidence: SLO Configuration (sanitized)

# slo_targets.yaml (excerpt)
gateway_availability:
  target: 0.999
  windows: [1h, 6h, 24h, 7d, 30d]
  burn_thresholds:
    fast:   { rate: 14, window: 1h }
    medium: { rate: 6,  window: 6h }
    slow:   { rate: 2,  window: 24h }

ollama_availability:
  target: 0.995
  windows: [1h, 6h, 24h, 7d, 30d]
6
SLOs tracked
5
Rolling windows
38
Tests passing
13m
Avg TTR
0
False positives

Related artifacts: Proof Pack →