Alert Threshold Calculator
Calculate optimal alert thresholds based on baseline metrics
Understanding Alert Thresholds
Setting appropriate alert thresholds is crucial for effective monitoring. Too sensitive, and you'll suffer from alert fatigue. Too lenient, and you'll miss critical issues.
Two-Tier Alert System
Warning Alerts
Indicate potential issues that need attention but aren't immediately critical:
- Response: Investigate during business hours
- Escalation: Email, Slack notification
- Purpose: Early detection, trend analysis
- Example: Latency 1.5x normal, CPU at 70%
Critical Alerts
Indicate severe issues requiring immediate action:
- Response: Immediate investigation, 24/7
- Escalation: PagerDuty, phone call, SMS
- Purpose: Prevent/mitigate outages
- Example: Latency 2x normal, CPU at 90%
Threshold Strategies
Static Thresholds
Fixed values based on known limits or requirements:
Pros: Simple, predictable, easy to understand
Cons: Doesn't adapt to patterns, may miss gradual degradation
Best for: Hard limits (disk space, memory), SLO targets
Cons: Doesn't adapt to patterns, may miss gradual degradation
Best for: Hard limits (disk space, memory), SLO targets
Dynamic Thresholds (Statistical)
Based on historical data and standard deviation:
Pros: Adapts to normal patterns, catches anomalies
Cons: More complex, requires historical data
Best for: Traffic patterns, latency, error rates
Cons: More complex, requires historical data
Best for: Traffic patterns, latency, error rates
Rate of Change
Alert on rapid changes rather than absolute values:
Pros: Catches sudden problems, works across scales
Cons: May miss slow degradation
Best for: Error rate spikes, traffic surges
Cons: May miss slow degradation
Best for: Error rate spikes, traffic surges
Sensitivity Levels
Low Sensitivity
- Warning: 2x baseline
- Critical: 3x baseline
- Use for: Stable systems, established services
- Benefit: Fewer false positives
- Risk: May miss gradual degradation
Medium Sensitivity
- Warning: 1.5x baseline
- Critical: 2x baseline
- Use for: Most production systems
- Benefit: Balanced approach
- Risk: Moderate alert volume
High Sensitivity
- Warning: 1.2x baseline
- Critical: 1.5x baseline
- Use for: Critical systems, new deployments
- Benefit: Early detection
- Risk: More false positives, alert fatigue
Metric-Specific Guidelines
Latency/Response Time
- Baseline: P95 or P99 latency
- Warning: 1.5x baseline
- Critical: 2x baseline or SLO violation
- Direction: Above threshold
Error Rate
- Baseline: Normal error rate (often < 0.1%)
- Warning: 50% of error budget consumed
- Critical: 100% of error budget or SLO violation
- Direction: Above threshold
CPU/Memory Usage
- Warning: 70-80% utilization
- Critical: 90% utilization
- Direction: Above threshold
- Note: Consider sustained usage (5+ minutes)
Throughput
- Warning: 30% below baseline
- Critical: 50% below baseline
- Direction: Below threshold
- Note: Account for time-of-day patterns
Best Practices
- Start conservative: Begin with low sensitivity and adjust based on experience
- Use duration: Require condition to persist (e.g., 5 minutes) to avoid flapping
- Consider context: Different thresholds for different times (peak vs off-peak)
- Document clearly: Explain why thresholds were chosen
- Review regularly: Adjust based on system changes and alert patterns
- Alert on symptoms: Not root causes (alert on slow response, not high CPU)
- Make alerts actionable: Include runbooks and context
- Track alert quality: Monitor false positive rates
Avoiding Alert Fatigue
- Every alert should be actionable
- Critical alerts should wake someone up
- Warning alerts can wait for business hours
- Mute or fix noisy alerts immediately
- Use aggregation to reduce duplicate alerts
- Implement alert dependencies (don't alert on everything when DB is down)
- Regular alert audit and cleanup
Quick Guide
Alert Tiers
- Info: FYI, no action needed
- Warning: Investigate soon
- Critical: Immediate action
Response Times
- Warning: < 4 hours
- Critical: < 15 minutes
Common Mistakes
- Too many alerts (fatigue)
- Non-actionable alerts
- No duration requirement
- Alerting on causes not symptoms
- Same threshold for all times
- Not reviewing/tuning alerts