Split HTTP error metrics into separate 4xx and 5xx counters#959
Split HTTP error metrics into separate 4xx and 5xx counters#959
Conversation
Signed-off-by: Radoslav Dimitrov <[email protected]>
|
@claude - hey, can you review that |
|
Claude finished @rdimitrov's task in 54s —— View job Code ReviewTasks
SummaryThis PR successfully splits HTTP error metrics into separate 4xx and 5xx counters, improving observability by distinguishing client errors from server errors. The implementation is clean and follows Prometheus best practices. ✅ What's Good1. Clear separation of concerns
2. Correct implementation in middleware ( if statusCode >= 500 {
metrics.ErrorCount.Add(ctx.Context(), 1, metric.WithAttributes(attrs...))
} else if statusCode >= 400 {
metrics.ClientErrorCount.Add(ctx.Context(), 1, metric.WithAttributes(attrs...))
}The logic correctly handles the full range of status codes:
3. Updated descriptions
🔍 Observations1. Metric naming convention
This is a breaking change for existing dashboards/alerts that rely on Impact:
2. Alternative naming consideration
This would make the naming more symmetric, though the current naming follows the common convention where "errors" implies server errors. 💡 Suggestions1. Consider documenting the breaking change Since this changes the semantics of an existing metric, it would be helpful to:
2. Optional: Add a combined metric for backward compatibility If you want to maintain backward compatibility, you could add a third metric that combines both: // TotalErrorCount tracks all errors (4xx + 5xx)
TotalErrorCount metric.Int64CounterHowever, given that this is an internal registry and the split provides better observability, the breaking change seems justified. ✨ VerdictLGTM with minor documentation suggestion! The implementation is correct, clean, and improves observability. The only consideration is documenting the breaking change for anyone monitoring the old Great work on improving the metrics! 🎉 |
|
@rdimitrov In order to identify server issues, we can use status_code label in metric Generally standard is to use status_code to distinguish different kind of issues instead of creating separate metric until we are facing high cardinality issues with single metric. |
@pree-dew - Thanks for the feedback! 🙏 That does make the PR redundant so I'll close it 👍 Is there something on the dashboard side that we should do? Also as a side note what do you think about moving the dashboards to a more gitopsy way of managing them, i.e. configmaps stored in the repo? |
|
@rdimitrov I will add panel for server side and client side error on the dashboard to help with this issue. |
Thanks!
Sounds good! 🙏 |
Motivation and Context
The following PR split the metrics into separate 4xx and 5xx counters
Details:
How Has This Been Tested?
Breaking Changes
Types of changes
Checklist
Additional context