Skip to content

Split HTTP error metrics into separate 4xx and 5xx counters#959

Closed
rdimitrov wants to merge 1 commit intomainfrom
split-metrics
Closed

Split HTTP error metrics into separate 4xx and 5xx counters#959
rdimitrov wants to merge 1 commit intomainfrom
split-metrics

Conversation

@rdimitrov
Copy link
Member

Motivation and Context

The following PR split the metrics into separate 4xx and 5xx counters

Details:

  • Separate mcp_registry_http_errors_total to only count 5xx server errors
  • Add new mcp_registry_http_client_errors_total counter for 4xx client errors
  • Previously all 4xx and 5xx were counted together under errors making it harder to distinguish real server issues from client validation errors (e.g. 422s were 93% of "errors")

How Has This Been Tested?

Breaking Changes

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Signed-off-by: Radoslav Dimitrov <[email protected]>
@rdimitrov
Copy link
Member Author

@claude - hey, can you review that

@claude
Copy link
Contributor

claude bot commented Feb 15, 2026

Claude finished @rdimitrov's task in 54s —— View job


Code Review

Tasks

  • Read and analyze the changed files
  • Review the metrics implementation
  • Provide comprehensive feedback

Summary

This PR successfully splits HTTP error metrics into separate 4xx and 5xx counters, improving observability by distinguishing client errors from server errors. The implementation is clean and follows Prometheus best practices.

✅ What's Good

1. Clear separation of concerns

  • The new ClientErrorCount metric properly tracks 4xx errors
  • The existing ErrorCount metric now correctly tracks only 5xx errors
  • Metric names follow Prometheus naming conventions (http.errors for server errors, http.client_errors for client errors)

2. Correct implementation in middleware (router.go:80-84)

if statusCode >= 500 {
    metrics.ErrorCount.Add(ctx.Context(), 1, metric.WithAttributes(attrs...))
} else if statusCode >= 400 {
    metrics.ClientErrorCount.Add(ctx.Context(), 1, metric.WithAttributes(attrs...))
}

The logic correctly handles the full range of status codes:

  • 5xx → ErrorCount
  • 4xx → ClientErrorCount
  • 2xx/3xx → no error metric (correct)

3. Updated descriptions

  • Comments and metric descriptions clearly document what each counter tracks
  • This helps with metric interpretation in monitoring dashboards

🔍 Observations

1. Metric naming convention
The new metric names are:

  • mcp_registry.http.errors (5xx only)
  • mcp_registry.http.client_errors (4xx only)

This is a breaking change for existing dashboards/alerts that rely on mcp_registry_http_errors_total (Prometheus exported name). The old metric counted both 4xx and 5xx together.

Impact:

  • Good: More precise alerting (e.g., alert only on 5xx server errors)
  • ⚠️ Breaking: Existing queries for mcp_registry_http_errors_total will now only show 5xx errors
  • 📊 Migration needed: Dashboards/alerts need updating to use both metrics if they want total error count

2. Alternative naming consideration
The metric names could be more parallel:

  • Current: http.errors and http.client_errors
  • Alternative: http.server_errors and http.client_errors

This would make the naming more symmetric, though the current naming follows the common convention where "errors" implies server errors.

💡 Suggestions

1. Consider documenting the breaking change

Since this changes the semantics of an existing metric, it would be helpful to:

  • Add a note in the PR description about the breaking change
  • Consider if this warrants a changelog entry
  • Check if any existing dashboards/alerts need updating

2. Optional: Add a combined metric for backward compatibility

If you want to maintain backward compatibility, you could add a third metric that combines both:

// TotalErrorCount tracks all errors (4xx + 5xx)
TotalErrorCount metric.Int64Counter

However, given that this is an internal registry and the split provides better observability, the breaking change seems justified.

✨ Verdict

LGTM with minor documentation suggestion!

The implementation is correct, clean, and improves observability. The only consideration is documenting the breaking change for anyone monitoring the old http_errors_total metric. The separation of 4xx and 5xx errors will make it much easier to distinguish actual server problems from client validation errors.

Great work on improving the metrics! 🎉


| Branch: split-metrics

@pree-dew
Copy link
Contributor

@rdimitrov In order to identify server issues, we can use status_code label in metric
mcp_registry_http_errors_total{status_code=~"5.*"}

Generally standard is to use status_code to distinguish different kind of issues instead of creating separate metric until we are facing high cardinality issues with single metric.

@rdimitrov
Copy link
Member Author

@rdimitrov In order to identify server issues, we can use status_code label in metric mcp_registry_http_errors_total{status_code=~"5.*"}

Generally standard is to use status_code to distinguish different kind of issues instead of creating separate metric until we are facing high cardinality issues with single metric.

@pree-dew - Thanks for the feedback! 🙏 That does make the PR redundant so I'll close it 👍 Is there something on the dashboard side that we should do? Also as a side note what do you think about moving the dashboards to a more gitopsy way of managing them, i.e. configmaps stored in the repo?

@rdimitrov rdimitrov closed this Feb 15, 2026
@pree-dew
Copy link
Contributor

@rdimitrov I will add panel for server side and client side error on the dashboard to help with this issue.
I think, if we have very limited access to dashboard via UI then we can continue with that as it doesn’t change so frequently and allows quick fixes to noise, allow easy testing and give quick feedback for changes, config for dashboard might make it bit hard as there is no visible way to check the configuration apart from local.

@rdimitrov
Copy link
Member Author

@rdimitrov I will add panel for server side and client side error on the dashboard to help with this issue.

Thanks!

I think, if we have very limited access to dashboard via UI then we can continue with that as it doesn’t change so frequently and allows quick fixes to noise, allow easy testing and give quick feedback for changes, config for dashboard might make it bit hard as there is no visible way to check the configuration apart from local.

Sounds good! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants