Infrastructure · Automation · Reliability Engineering — Fairfield, CA. Get in touch →

Back to Projects

Flagship Case Study

SRE Automation Pipeline

Google-style site reliability engineering applied to a 4-node homelab — because even personal infrastructure deserves production-grade reliability.

Problem

Manual monitoring doesn't scale. Even on a 4-node homelab, I was spending time SSH-ing into machines to check if services were healthy. When something broke, I'd discover it hours later. No structured way to track reliability, no history of incidents, no way to prove operational maturity to a hiring manager.

Real constraint: This is a homelab, not a cloud environment. No Datadog, no PagerDuty, no SaaS monitoring. Everything had to be built from scratch, self-hosted, and cost $0.

Architecture

                                        Snapshots → SLI Sources → SLI Compute → Error Budget

                                                                                         ↓

                                                                       Burn Rate → Alerts → Incidents

                                                                                         ↓

                                                          Timeline → Evidence → Postmortem → Dashboard

Each box is a Python module. The pipeline runs on every SLO evaluation tick and feeds into the safety gatekeeper to block risky automation when reliability is degraded.

Implementation

SLO Engine

sli_sources.py — Load raw signals from snapshots
sli_compute.py — Compute SLIs per rolling window
budget.py — Track error budgets (allowed/consumed)
burn_rate.py — Multi-window burn rate alerting
slo_eval.py — Unified evaluation orchestrator
slo_render.py — Markdown + text report rendering
slo_publish.py — Dashboard + Telegram publishing

Incident Commander

incident_state.py — Lifecycle (create/update/close)
incident_manager.py — CLI + auto-trigger engine
incident_render.py — Postmortem generation
Timeline with timestamped events
Evidence pack linking
Cooldown to prevent alert storms
Auto-close after configurable TTL

Results

SLOs tracked

Rolling windows

Tests passing

13m

Avg incident TTR

SLO targets: Gateway 99.9%, Ollama 99.5%, Dashboard 99%, Node connectivity 99.5%
Burn rate thresholds: fast (14x/1h), medium (6x/6h), slow (2x/24h)
Incidents auto-created within 2 minutes of SLO degradation
Postmortems auto-generated with root cause, timeline, lessons, action items

Lessons Learned

SLOs without error budgets are just dashboards. The budget math is what makes SLOs actionable — it tells you when to stop shipping and start fixing.
Burn rate > threshold alerting. A single threshold fires too late or too often. Multi-window burn rates catch both fast spikes and slow degradation.
Postmortems are documentation you'll actually read. They're not just for compliance — they're the best way to learn from incidents and build organizational memory.
Safety gates save you from yourself. The gatekeeper blocking risky actions during high burn prevented me from making incidents worse.

Next Steps

Add latency SLIs (not just availability)
Integrate with Telegram for alert delivery
Build a weekly SLO review report for trend analysis

SRE dashboard showing SLO status bars, burn-rate gauges, and incident count for a 4-node homelab cluster

Redacted sample (structure accurate; sensitive values removed).

What to look at: SLO budget bars per service, burn-rate gauges across alert windows, and active incident count with average TTR.

Evidence: SLO Configuration (sanitized)

# slo_targets.yaml (excerpt)
gateway_availability:
  target: 0.999
  windows: [1h, 6h, 24h, 7d, 30d]
  burn_thresholds:
    fast:   { rate: 14, window: 1h }
    medium: { rate: 6,  window: 6h }
    slow:   { rate: 2,  window: 24h }

ollama_availability:
  target: 0.995
  windows: [1h, 6h, 24h, 7d, 30d]

SLOs tracked

Rolling windows

Tests passing

13m

Avg TTR

False positives

Related artifacts: Proof Pack →

View Code on GitHub View Proof Pack Explore Architecture