← Back to Blog

Monitoring That Doesn't Make You Cry

How to set up observability that actually helps your team sleep better at night. No complex dashboards required.

The Alert Fatigue Problem

My first week as a DevOps engineer at Google, I received 47 alert emails. Most were false alarms. By day three, I was ignoring all alerts. By day five, a real issue slipped through because it looked like every other noisy alert.

Bad monitoring is worse than no monitoring. It creates a "boy who cried wolf" scenario where real issues get lost in the noise.

The Three-Alert Rule

Here's my controversial opinion: if you have more than three types of alerts, you're probably doing it wrong. Focus on what actually matters:

  1. Is it down? (Your service isn't responding)
  2. Is it slow? (Response times are awful)
  3. Is it broken? (Error rates are spiking)

Everything else is nice to know, but shouldn't wake you up at 3 AM.

Start with the Golden Signals

Google's Site Reliability Engineering book talks about four golden signals. I simplified it to three for my team, because we're busy humans, not robots.

Latency (Speed)

// Simple latency monitoring in Go
func instrumentHandler(handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        handler(w, r)
        duration := time.Since(start)
        
        // Alert if 95th percentile > 500ms
        if duration > 500*time.Millisecond {
            log.Printf("SLOW REQUEST: %s took %v", r.URL.Path, duration)
        }
    }
}

Error Rate

// Count errors vs. successful requests
var (
    totalRequests = 0
    errorRequests = 0
)

func trackRequest(statusCode int) {
    totalRequests++
    if statusCode >= 400 {
        errorRequests++
    }
    
    errorRate := float64(errorRequests) / float64(totalRequests)
    if errorRate > 0.05 { // 5% error rate threshold
        sendAlert("Error rate spike: %.2f%%", errorRate*100)
    }
}

Dashboards for Humans

Most monitoring dashboards look like NASA mission control. They're impressive but useless during an outage when you need answers fast.

My Dashboard Philosophy:

"If you need to squint at your dashboard to understand if things are working, your dashboard needs work."

Logs That Tell Stories

Stop logging everything. Start logging stories. When something breaks, you want to understand what happened, not dig through 10,000 debug statements.

Good Logging Practice:

// Instead of this:
log.Debug("Processing user data")
log.Debug("Validating input")
log.Debug("Calling database")
log.Debug("Returning response")

// Do this:
log.Info("Processing login request", 
    "user_id", userID, 
    "ip", clientIP, 
    "duration", duration)

log.Error("Login failed", 
    "user_id", userID, 
    "reason", "invalid_password", 
    "attempts_today", attemptCount)

Health Checks That Actually Work

A health check that just returns "OK" is useless. Test the things that actually matter for your service to work.

func healthCheck(w http.ResponseWriter, r *http.Request) {
    status := struct {
        Database string `json:"database"`
        Cache    string `json:"cache"`
        API      string `json:"external_api"`
        Overall  string `json:"status"`
    }{}
    
    // Check database connection
    if err := db.Ping(); err != nil {
        status.Database = "failing"
        status.Overall = "degraded"
    } else {
        status.Database = "healthy"
    }
    
    // Check cache
    if err := cache.Ping(); err != nil {
        status.Cache = "failing"
        // Cache failure might not be critical
        if status.Overall == "" {
            status.Overall = "degraded"
        }
    } else {
        status.Cache = "healthy"
    }
    
    if status.Overall == "" {
        status.Overall = "healthy"
    }
    
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(status)
}

Alert Fatigue Solutions

Smart Grouping

If five things break at once, send one alert, not five. Most cascading failures have a root cause.

Escalation Paths

Not every alert needs to page the on-call engineer immediately. Try:

  1. Send to Slack channel (team sees it)
  2. If not acknowledged in 15 minutes, email on-call
  3. If not acknowledged in 30 minutes, call on-call

Time-Based Sensitivity

A 30-second outage at 3 PM is different from 3 AM. Adjust your thresholds based on business impact.

Tools That Don't Suck

You don't need enterprise monitoring solutions. Start simple:

The Human Side of Monitoring

Remember that alerts interrupt real humans with real lives. Design your monitoring like you're designing a user interface – optimize for the human experience, not just technical correctness.

Good Alert Principles:

The Kawaii DevOps Philosophy

Good monitoring should make your team feel confident and supported, not stressed and overwhelmed. When alerts fire, the response should be "I know exactly what to do" not "oh no, what's broken now?"

My monitoring setup has cute names (our database health check is called "db-chan") and friendly error messages. Why? Because 3 AM debugging is stressful enough without hostile tooling.

Start Small, Iterate

Don't try to build the perfect monitoring system on day one. Start with basic uptime checks and one dashboard. Add complexity only when you understand what questions you're trying to answer.

The best monitoring system is the one your team actually uses during incidents. Keep it simple, keep it human, and keep iterating based on real experiences.

Your future 3 AM self will thank you for building monitoring that helps instead of hinders. And maybe you'll even get to keep your kawaii pajamas on while fixing things.