The Alert Fatigue Problem
My first week as a DevOps engineer at Google, I received 47 alert emails. Most were false alarms. By day three, I was ignoring all alerts. By day five, a real issue slipped through because it looked like every other noisy alert.
Bad monitoring is worse than no monitoring. It creates a "boy who cried wolf" scenario where real issues get lost in the noise.
The Three-Alert Rule
Here's my controversial opinion: if you have more than three types of alerts, you're probably doing it wrong. Focus on what actually matters:
- Is it down? (Your service isn't responding)
- Is it slow? (Response times are awful)
- Is it broken? (Error rates are spiking)
Everything else is nice to know, but shouldn't wake you up at 3 AM.
Start with the Golden Signals
Google's Site Reliability Engineering book talks about four golden signals. I simplified it to three for my team, because we're busy humans, not robots.
Latency (Speed)
// Simple latency monitoring in Go
func instrumentHandler(handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
handler(w, r)
duration := time.Since(start)
// Alert if 95th percentile > 500ms
if duration > 500*time.Millisecond {
log.Printf("SLOW REQUEST: %s took %v", r.URL.Path, duration)
}
}
}
Error Rate
// Count errors vs. successful requests
var (
totalRequests = 0
errorRequests = 0
)
func trackRequest(statusCode int) {
totalRequests++
if statusCode >= 400 {
errorRequests++
}
errorRate := float64(errorRequests) / float64(totalRequests)
if errorRate > 0.05 { // 5% error rate threshold
sendAlert("Error rate spike: %.2f%%", errorRate*100)
}
}
Dashboards for Humans
Most monitoring dashboards look like NASA mission control. They're impressive but useless during an outage when you need answers fast.
My Dashboard Philosophy:
- Big numbers: Status should be visible from across the room
- Traffic lights: Green = good, Yellow = watch, Red = fix now
- Time ranges that matter: Last hour, last day, last week
- One-click drilldown: Click to see what's actually broken
"If you need to squint at your dashboard to understand if things are working, your dashboard needs work."
Logs That Tell Stories
Stop logging everything. Start logging stories. When something breaks, you want to understand what happened, not dig through 10,000 debug statements.
Good Logging Practice:
// Instead of this:
log.Debug("Processing user data")
log.Debug("Validating input")
log.Debug("Calling database")
log.Debug("Returning response")
// Do this:
log.Info("Processing login request",
"user_id", userID,
"ip", clientIP,
"duration", duration)
log.Error("Login failed",
"user_id", userID,
"reason", "invalid_password",
"attempts_today", attemptCount)
Health Checks That Actually Work
A health check that just returns "OK" is useless. Test the things that actually matter for your service to work.
func healthCheck(w http.ResponseWriter, r *http.Request) {
status := struct {
Database string `json:"database"`
Cache string `json:"cache"`
API string `json:"external_api"`
Overall string `json:"status"`
}{}
// Check database connection
if err := db.Ping(); err != nil {
status.Database = "failing"
status.Overall = "degraded"
} else {
status.Database = "healthy"
}
// Check cache
if err := cache.Ping(); err != nil {
status.Cache = "failing"
// Cache failure might not be critical
if status.Overall == "" {
status.Overall = "degraded"
}
} else {
status.Cache = "healthy"
}
if status.Overall == "" {
status.Overall = "healthy"
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(status)
}
Alert Fatigue Solutions
Smart Grouping
If five things break at once, send one alert, not five. Most cascading failures have a root cause.
Escalation Paths
Not every alert needs to page the on-call engineer immediately. Try:
- Send to Slack channel (team sees it)
- If not acknowledged in 15 minutes, email on-call
- If not acknowledged in 30 minutes, call on-call
Time-Based Sensitivity
A 30-second outage at 3 PM is different from 3 AM. Adjust your thresholds based on business impact.
Tools That Don't Suck
You don't need enterprise monitoring solutions. Start simple:
- Uptime monitoring: UptimeRobot or Pingdom
- Error tracking: Sentry (best free tier ever)
- Basic metrics: Prometheus + Grafana
- Log aggregation: Start with structured logging
- Alerting: PagerDuty or Slack webhooks
The Human Side of Monitoring
Remember that alerts interrupt real humans with real lives. Design your monitoring like you're designing a user interface – optimize for the human experience, not just technical correctness.
Good Alert Principles:
- Every alert should be actionable
- Include context in the alert message
- Link to runbooks or debugging guides
- Set clear expectations for response time
The Kawaii DevOps Philosophy
Good monitoring should make your team feel confident and supported, not stressed and overwhelmed. When alerts fire, the response should be "I know exactly what to do" not "oh no, what's broken now?"
My monitoring setup has cute names (our database health check is called "db-chan") and friendly error messages. Why? Because 3 AM debugging is stressful enough without hostile tooling.
Start Small, Iterate
Don't try to build the perfect monitoring system on day one. Start with basic uptime checks and one dashboard. Add complexity only when you understand what questions you're trying to answer.
The best monitoring system is the one your team actually uses during incidents. Keep it simple, keep it human, and keep iterating based on real experiences.
Your future 3 AM self will thank you for building monitoring that helps instead of hinders. And maybe you'll even get to keep your kawaii pajamas on while fixing things.