Conversation
Adds support for OpenTelemetry tracing, as described in the included docs. Custom instrumentation is not yet supported.
|
Review requested:
|
legendecas
left a comment
There was a problem hiding this comment.
AFAICT, it would create confusion that the built-in support does not work with the @opentelemetry/api package, which is intended to be the global provider for otel components, like propagators and exporters.
|
I personally find the opentelemetry apis really cumbersome to work with. Maybe there is something better that can be done. But I also concur that some kind of high level compatibility api would need to exist. |
|
It would be really nice to get ahold of https://www.npmjs.com/package/otel and add this new core module without the prefix. |
| otel.start({ endpoint: 'http://collector.example.com:4318' }); | ||
|
|
||
| // ... application code ... | ||
| otel.stop(); |
There was a problem hiding this comment.
More a style nit... feel free to ignore. I find this start()/stop() pattern a bit cumbersome. I'd be happier something like...
const stop = otel.start();
stop();which also allows for...
{
using session = otel.start();
// automatically stops
}There was a problem hiding this comment.
While using is great, I don't think it's actually appropriate here. This is for enable and disabling an entire subsystem. stop() would probably actually be pretty rare, probably only used for some kind of error recovery / process-teardown. It definitely doesn't fit the usual use case for using, where we'd expect a resource to be disposed at the end of a variable scope.
| * `options` {Object} | ||
| * `endpoint` {string} The OTLP collector endpoint URL. | ||
| **Default:** `'http://localhost:4318'`. | ||
| * `filter` {string|string\[]} Optional comma-separated string or array of |
There was a problem hiding this comment.
Nit: can we pick one? I'd prefer just string[].
There was a problem hiding this comment.
It's the way it is for consistency with the environment variable. It could be an array here in the API and then comma-separated in the environment variable.
Agree that an OpenTelemetry SDK is not built-in to the Node.js instrumentation. However, this implementation is not fully compliant with the OpenTelemetry design. The instrumentation should not implement the OpenTelemetry SDK components, like exporters and tracer SDK. The three OpenTelemetry layers, API, SDK, and instrumentations, have their clear scopes and allows open customization without any vendor lock-in. That is, instrumentations will use the API to create spans and metrics, but do not directly depend on the SDK. This allows instrumentations to be vendor-agnostic. And observability vendors could provide their SDK (or just an exporter) to do whatever propriatary works, like exporting. I think a Node.js built-in |
Qard
left a comment
There was a problem hiding this comment.
Some small questions/ideas, but otherwise LGTM!
| ### `OTEL_SERVICE_NAME` | ||
|
|
||
| Standard OpenTelemetry environment variable used to set the service name in | ||
| exported resource attributes. Defaults to `node-<pid>`. |
There was a problem hiding this comment.
We already have to load the package.json during startup, don't we? So perhaps we could make this use the name field from it, if present, and fallback to that?
| let timeOriginNs; | ||
| function getTimeOriginNs() { | ||
| if (timeOriginNs === undefined) { | ||
| // getTimeOriginTimestamp() returns ms since Unix epoch. | ||
| // Multiply by 1e6 to get nanoseconds. | ||
| const originMs = getTimeOriginTimestamp(); | ||
| timeOriginNs = BigInt(MathRound(originMs * 1e6)); | ||
| } | ||
| return timeOriginNs; | ||
| } | ||
|
|
||
| function hrTimeToNanos() { | ||
| const relativeMs = now(); | ||
| const ns = getTimeOriginNs() + BigInt(MathRound(relativeMs * 1e6)); | ||
| return `${ns}`; | ||
| } |
There was a problem hiding this comment.
I'm wondering a bit about the perf of this. Did you try a high/low pair like process.hrtime() returns to be able to do the 64-bit timestamp rather than this string approach? I'm a bit worried about the perf implications of lots of bigint timestamps being generated all the time. 🤔
There was a problem hiding this comment.
The OTEL hrtime, as far as I have been able to determine, is not compatible with process.hrtime. It is based upon performance.timeOrigin -- https://github.com/newrelic/node-newrelic/blob/39d1d9a59ca467d71d4a7db4cf437d743a24939b/lib/otel/normalize-timestamp.js#L56-L85
There was a problem hiding this comment.
Aside from performance concerns, is it a good idea to depend entirely on a nonmonotonic clock? This may result in negative duration or out of order spans.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #61907 +/- ##
==========================================
+ Coverage 89.73% 89.75% +0.01%
==========================================
Files 675 680 +5
Lines 204674 205883 +1209
Branches 39330 39565 +235
==========================================
+ Hits 183667 184781 +1114
- Misses 13283 13387 +104
+ Partials 7724 7715 -9
🚀 New features to boost your workflow:
|
| urlStr + '/v1/traces'; | ||
| } | ||
|
|
||
| const parsed = new URL(urlStr); |
There was a problem hiding this comment.
This condition could be simplified to the following:
const parsed = new URL('/v1/traces', endpoint);Tests with the Node REPL:
> new URL("/v1/traces", "http://localhost:4318/hello?foo=bar").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1/traces").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1/traces/").href
'http://localhost:4318/v1/traces'|
I have a lot of thoughts about this so I'll try to keep it succinct and direct. In the interest of full disclosure, I have been one of the maintainers of OTel JS for years and that will likely color some of my opinions. I am not on the OTel Governance Committee and the opinions here should be taken as my own and not as the official stance of the project. I'll start with the things I like. First, I think having built-in support for OTel is great, and focusing on HTTP first makes sense. The lack of interoperability with existing instrumentation means the use of this will be very limited for now, but that's probably fine. Anyone who needs other instrumentation has the option to use the existing otel libraries. Second, i think it is reasonable to only support OTLP. Supporting other export formats is a well intentioned part of OTel, but it was a lot more important in the earlier days when OTLP support was limited. Now all major vendors support OTLP, and the collector serves as a mature routing layer for people who need to use other formats. I would prefer to have OTLP protobuf as an option since it is by far the most used OTLP format, but JSON makes sense as the de facto standard format for JS remote calls and I understand the desire to avoid the protobuf dependency. Third, I do see the value in a reimplementation which is node-specific and avoids the need for various platform-specific workarounds. Because this is shipped with node itself, there is no need to retain workarounds for old node versions or non-node platforms like browsers. Finally, this seems really easy to use. The lack of configuration, customization, and manual instrumentation definitely makes this simpler to understand and use for end users, and because it is built in there is no need to make sure your otel dependency versions align with each other. I do have a couple issues with the way this is implemented. This seems to have completely done away with most of the features required by the OTel specification. For example, it is missing all of the normal extension points (configuration, propagators, samplers, id generators, span processors, exporters, resource, entities). I'm not sure if this was an unintentional side-effect of vibe coding a replacement SDK or an intentional choice to "avoid bloat," but it really undercuts a lot of the value of OTel. I might even go so far as to say it's not quite accurate to call this "OpenTelemetry support" without those things. It might be better to call this "OTLP span export" or something like that. At the very least I'd make sure you reach out to the OTel Governance Committee to obtain permission to use the OpenTelemetry trademark. I'm not a lawyer and not on the GC, but this is an area where I think it pays to be careful. Finally, and this is my biggest concern, I agree with @legendecas that this seems like it would fracture the otel ecosystem on node. I am most concerned about the complete lack of interoperability with existing otel libraries. It seems to me that there is definitely a way that this could be done that would retain interoperability and allow a gradual transition to a built in implementation without such a fracture. I understand that obviously node can't take a dependency on an NPM module, but I'm certain the otel JS SDK would not have any issue taking a dependency on built-in modules where the experience could be improved for users. |
|
The PR description probably could have used a bit more details. Sorry about that. First, let me address compatibility/fracturing and support (or lack thereof) for Second, I agree that calling this PR "OpenTelemetry support" could be misleading if users expect full API support. I'm happy to rename things as appropriate, or move the programmatic activation to some other module (though, I kept it as a new module here knowing that I'd eventually want to export a |
| The tracing subsystem automatically propagates [W3C Trace Context][] across HTTP | ||
| boundaries: | ||
|
|
||
| * **Incoming requests**: The `traceparent` header is read from incoming HTTP |
There was a problem hiding this comment.
I think this should either include tracestate handling otherwise it's not complete from W3C spec point of view.
There was a problem hiding this comment.
With tracestate unused by this internally it may be as simple as propagating it with the span so it can be attached on client spans. No need to parse it for now.
|
|
||
| This module is only available under the `node:` scheme. It requires the | ||
| `--experimental-otel` CLI flag or one of the `NODE_OTEL` / `NODE_OTEL_ENDPOINT` | ||
| environment variables. |
There was a problem hiding this comment.
I think we should clearly mention the limitations of the current state like
- incompatible with OTel components from the OpenTelemetry project
- trace.id/span.id duplication
- likely overwrite of W3C tracecontext headers in outgoing requests
I guess a user has to decide to use either the one or the other OTel variant and avoid to use both in the same node.js process.
|
|
||
| > Stability: 1 - Experimental | ||
|
|
||
| When set to a non-empty value, enables the built-in OpenTelemetry tracing |
There was a problem hiding this comment.
really any value? e.g. having NODE_OTEL=disable or NODE_OTEL=0 to enable it seems a bit uncommon.
Quite some other node env vars check on the value 1.
| Uint8Array, | ||
| } = primordials; | ||
|
|
||
| const { randomFillSync } = require('internal/crypto/random'); |
There was a problem hiding this comment.
Is it required to use cryptographically strong pseudorandom here? Or is the perf impact don't care?
There was a problem hiding this comment.
Not required to be crypto random by spec
| } | ||
| } | ||
|
|
||
| function flush() { |
There was a problem hiding this comment.
Shouldn't flush return a promise to allow users to wait until all data is sent? e.g. before shutdown.
| } | ||
|
|
||
| if (isModuleEnabled('node:http')) { | ||
| sub('http.server.request.start', onHttpServerRequestStart); |
There was a problem hiding this comment.
better to check is it's active and module enabled before subscribing
|
|
||
| // Generate W3C traceparent header value. | ||
| // Format: {version}-{trace-id}-{span-id}-{trace-flags} | ||
| inject() { |
There was a problem hiding this comment.
maybe rename to getTraceParent(). This function doesn't inject anything as of now.
| if (typeof value === 'boolean') { | ||
| return { boolValue: value }; | ||
| } | ||
| return { stringValue: String(value) }; |
There was a problem hiding this comment.
Is there intentionally no array support?
| lastExportWarningTime = now; | ||
| const suffix = exportFailureCount > 1 ? | ||
| ` (${exportFailureCount} total failures)` : ''; | ||
| process.emitWarning( |
There was a problem hiding this comment.
Maybe otel events should be emitted on the otel object instead process.
|
|
||
| const transport = parsed.protocol === 'https:' ? https : http; | ||
|
|
||
| const req = transport.request({ |
There was a problem hiding this comment.
consider to use an http agent to keep the connection alive for some time.
| } | ||
| } | ||
|
|
||
| function sendToCollector(body) { |
There was a problem hiding this comment.
There is no throttling, so if a lot spans are ended and/or connection is slow quite a lot parallel requests might be created.
| request[kSpan] = span; | ||
|
|
||
| const storage = getSpanStorage(); | ||
| storage.enterWith(span); |
There was a problem hiding this comment.
Why do request[kSpan] if you're also using AsyncLocalStorage? Is this just a performance optimization to avoid calling ALS again on response finish?
A very welcome goal. We are extremely interested in this moving forward. We haven't even technically released/announced our support for bridging OTEL signals into our agent, and we have already had complaints about the insane bloat the OTEL dependencies add as disk size, and complaints about the memory impacts of just loading them. But we really don't have the time to go off and re-implement the OTEL API+SDK just to elide all of the things the reference implementations bring in. If we can set up a GitHub project of issues to resolve, that would make it far easier for those interested to contribute and get this feature fully implemented. |
Adds support for OpenTelemetry tracing, as described in the included docs. Custom instrumentation is not yet supported.
This PR is intended to lay the groundwork for future PRs where API support for custom instrumentations is added. The OpenTelemetry SDK is explicitly not used, mainly to avoid bloat.
A programmatic API to enable and disable tracing is added as a new module,
node:otel, but I'm very open to suggestions in terms of where to add it instead, keeping in mind that the intent is to eventually expose an API for custom instrumentation alongside it.Disclaimer: As an experiment, Claude Code was used to build this PR, based on a hand-written prototype I built (but did not publish) about a year ago. Everything was thoroughly reviewed by me prior to PR submission, including some hand-edits.