SRE Automation Pipeline
Google-style site reliability engineering applied to a 4-node homelab — because even personal infrastructure deserves production-grade reliability.
Problem
Manual monitoring doesn't scale. Even on a 4-node homelab, I was spending time SSH-ing into machines to check if services were healthy. When something broke, I'd discover it hours later. No structured way to track reliability, no history of incidents, no way to prove operational maturity to a hiring manager.
Real constraint: This is a homelab, not a cloud environment. No Datadog, no PagerDuty, no SaaS monitoring. Everything had to be built from scratch, self-hosted, and cost $0.
Architecture
↓
Burn Rate → Alerts → Incidents
↓
Timeline → Evidence → Postmortem → Dashboard
Each box is a Python module. The pipeline runs on every SLO evaluation tick and feeds into the safety gatekeeper to block risky automation when reliability is degraded.
Implementation
SLO Engine
sli_sources.py— Load raw signals from snapshotssli_compute.py— Compute SLIs per rolling windowbudget.py— Track error budgets (allowed/consumed)burn_rate.py— Multi-window burn rate alertingslo_eval.py— Unified evaluation orchestratorslo_render.py— Markdown + text report renderingslo_publish.py— Dashboard + Telegram publishing
Incident Commander
incident_state.py— Lifecycle (create/update/close)incident_manager.py— CLI + auto-trigger engineincident_render.py— Postmortem generation- Timeline with timestamped events
- Evidence pack linking
- Cooldown to prevent alert storms
- Auto-close after configurable TTL
Results
- SLO targets: Gateway 99.9%, Ollama 99.5%, Dashboard 99%, Node connectivity 99.5%
- Burn rate thresholds: fast (14x/1h), medium (6x/6h), slow (2x/24h)
- Incidents auto-created within 2 minutes of SLO degradation
- Postmortems auto-generated with root cause, timeline, lessons, action items
Lessons Learned
- SLOs without error budgets are just dashboards. The budget math is what makes SLOs actionable — it tells you when to stop shipping and start fixing.
- Burn rate > threshold alerting. A single threshold fires too late or too often. Multi-window burn rates catch both fast spikes and slow degradation.
- Postmortems are documentation you'll actually read. They're not just for compliance — they're the best way to learn from incidents and build organizational memory.
- Safety gates save you from yourself. The gatekeeper blocking risky actions during high burn prevented me from making incidents worse.
Next Steps
- Add latency SLIs (not just availability)
- Integrate with Telegram for alert delivery
- Build a weekly SLO review report for trend analysis
Redacted sample (structure accurate; sensitive values removed).
What to look at: SLO budget bars per service, burn-rate gauges across alert windows, and active incident count with average TTR.
Evidence: SLO Configuration (sanitized)
# slo_targets.yaml (excerpt)
gateway_availability:
target: 0.999
windows: [1h, 6h, 24h, 7d, 30d]
burn_thresholds:
fast: { rate: 14, window: 1h }
medium: { rate: 6, window: 6h }
slow: { rate: 2, window: 24h }
ollama_availability:
target: 0.995
windows: [1h, 6h, 24h, 7d, 30d]
Related artifacts: Proof Pack →