Understanding Error Rates & SLOs
Error rates and Service Level Objectives (SLOs) are fundamental metrics for measuring and managing service reliability. They help teams balance innovation velocity with system stability.
Key Concepts
Error Rate
The percentage of requests that result in errors. Calculated as:
Error Rate = (Error Requests / Total Requests) × 100
Common error types include:
- 5xx server errors
- 4xx client errors (depending on context)
- Timeouts
- Connection failures
Success Rate (SLI)
The percentage of successful requests. This is your Service Level Indicator:
Success Rate = 100 - Error Rate
SLO (Service Level Objective)
Your target for reliability. Common SLO targets:
- 99.9%: 1 error per 1,000 requests allowed
- 99.95%: 1 error per 2,000 requests allowed
- 99.99%: 1 error per 10,000 requests allowed
Error Budget
The allowed amount of errors derived from your SLO:
Error Budget = (100% - SLO Target) × Total Requests
For example, with a 99.9% SLO and 1 million requests:
- Error budget = 0.1% = 1,000 allowed errors
- If you have 500 errors, 50% of budget is consumed
- If you have 1,500 errors, you've exceeded budget by 50%
Using Error Budgets
Budget Remaining: Deploy Freely
When error budget is healthy (< 50% consumed):
- Push new features aggressively
- Take calculated risks
- Focus on innovation
- Experiment with new technologies
Budget Low: Slow Down
When error budget is concerning (50-80% consumed):
- Increase code review rigor
- Require more testing
- Implement canary deployments
- Monitor deployments closely
Budget Exhausted: Freeze
When error budget is exceeded (> 100% consumed):
- Pause feature development
- Focus entirely on reliability
- Root cause analysis of errors
- Implement fixes and improvements
- Add monitoring and alerting
Setting SLO Targets
Factors to Consider
- User expectations: What do users need?
- Business impact: Cost of downtime or errors
- Current performance: What are you achieving now?
- Competition: What do competitors offer?
- Dependencies: Reliability of external services
Common SLO Targets by Service Type
Consumer Web Applications
- 99.9% - Standard tier
- 99.95% - Premium tier
Enterprise SaaS
- 99.9% - Basic plans
- 99.95% - Professional plans
- 99.99% - Enterprise plans
Financial Services
- 99.95% - Minimum acceptable
- 99.99% - Standard
- 99.999% - Mission-critical
Best Practices
- Set SLOs based on user experience, not technical limits
- Make error budgets visible to entire team
- Use error budget policy to guide development pace
- Track error budget burn rate for early warning
- Include all error types that impact users
- Review and adjust SLOs quarterly
- Alert when error budget is rapidly depleting
- Document what counts as an error
Measuring Error Rates
Different approaches for different contexts:
- Request-based: Percentage of failed requests (APIs, web services)
- Time-based: Percentage of time system is available (uptime)
- User-based: Percentage of users experiencing errors
- Event-based: Percentage of events processed successfully (queues, streams)