feat(medcat-trainer): Add OTEL tracing. Add docker build cache#364
feat(medcat-trainer): Add OTEL tracing. Add docker build cache#364alhendrickson merged 11 commits intomainfrom
Conversation
mart-r
left a comment
There was a problem hiding this comment.
Looks good to me.
A comment about all the dependencies. I.e since I don't think they're main dependencies, we may want to group them together as an optional extra. But not sure its worth it.
| "django-health-check>=3.22", | ||
| "requests>=2.31", | ||
| "opentelemetry-sdk>=1.40.0", | ||
| "opentelemetry-distro[otlp]>=0.61b0", |
There was a problem hiding this comment.
Can these perhaps be set up as an optional extra? I.e deep-telemetry or something? And if that's enabled, you get all those along with all the detailed telemetry instead of the basic ones?
That would - of course - involve setting up another env var for this, reading it, and changing the install command based on it in the Dockerfile - so may not be worth it.
There was a problem hiding this comment.
Hey I get this idea, but I don't think it's really worth it it in this exact scenario
I 100% agree with doing this if this if this was a library. However here I kind of don't want to support trainer being installed in any other way outside of the docker image we build, and I'd rather not make different docker images for these deps. Looking around most things just bake it into the image (eg keycloak, solr, prefect), they all just have it ready to be turned on in the image). The libraries promise to be small and noop though
In future I'd like to add tracing into the medcat library itself, so when that happens then I'd definitely go with your idea and make them an optional dependency there. May go with langfuse & would really not want to bring that dep in for the 99.999% of people that dont need it.
There was a problem hiding this comment.
My main idea was to split into multiple images (with and without these extras). But you're right, it doesn't really make a lot of sense for us to manage multiple different types of images for the same thing (at least not at this time or for this specific reason).
With that said, we could still separate these to an optional extra. And since we control the way the image is built, we can just always include them.
Doesn't really change much other than just separating a bunch of these dependencies into a separate list. While this might be a little confusing as well, to me it would clear up the visible dependencies of the project. I.e the project doesn't necessarily depend on all of these. They're nice to haves, not necessities. We'll bake them in anyway, but it might be nice to separate the true dependencies (i.e something this wouldn't work without) from the nice-to-haves.
But this is still a matter of opinion. Having a separate list for packages that always get included anyway might also become a burden (i.e "which list do I add this new thing to?").
There was a problem hiding this comment.
Cool! Yeah I see your point here, it makes a lot of sense to separate these a bit
I've updated here - I've put them into a different group (noting to include the deps directly imported), but still hardcoded in the image to include the extra ones. I'm thinking this is the mid path you're looking for
I've tried to follow https://opentelemetry.io/docs/concepts/instrumentation/libraries/#libraries-should-only-use-the-opentelemetry-api with the required deps. It says to bring in only opentelemetry-api at version 1.0.0 or higher. To note, this is likely to be the pattern followed if we were to also do this inside medcat, the 2 deps would probably be added for all users but they promise that they do nothing unless the opentelemetry-sdk is included too.
There was a problem hiding this comment.
I've tried to follow https://opentelemetry.io/docs/concepts/instrumentation/libraries/#libraries-should-only-use-the-opentelemetry-api with the required deps. It says to bring in only opentelemetry-api at version 1.0.0 or higher. To note, this is likely to be the pattern followed if we were to also do this inside medcat, the 2 deps would probably be added for all users but they promise that they do nothing unless the opentelemetry-sdk is included too.
This might not be as relevant here as this is trainer - it's not a library someone else will install. But certainly will be for future MedCAT use cases.
Looks good to me overall!
…-service calls - cleanup pyproject.toml
…-service calls - cleanup pyproject.toml
…-service calls - cleanup pyproject.toml -uv lock
…-service calls - split deps into observability optional group
| torch = { index = "pytorch-cpu" } | ||
|
|
||
| [[tool.uv.index]] | ||
| name = "pytorch-cpu" |
There was a problem hiding this comment.
For future reference - This was added to fix this error. Open to this being the wrong approach, but this does fix it at least. Can't see a wheel being used in medcat-v2 to best understand what the right pattern is for it - but it feels reasonable to have to opt in to using that index, instead of eveyrthing trying to use it by default
(medcattrainer-webapp) vscode ➜ .../CogStack/cogstack-nlp/medcat-trainer/webapp (main) $ uv add opentelemetry-sdk
Resolved 105 packages in 2.68s
error: Distribution `markupsafe==3.0.3 @ registry+https://download.pytorch.org/whl/cpu/` can't be installed because it doesn't have a source distribution or wheel for the current platform
hint: You're using CPython 3.13 (`cp313`), but `markupsafe` (v3.0.3) only has wheels with the following Python implementation tag: `cp314`
There was a problem hiding this comment.
I think trainer is unlikely to be running in GPU enabled situations (at least for now). Plus, this will also make the image smaller since it won't be installing all the CUDA stuff.
With that said, it's one of those things I've seen before. Where uv picks a version for you to use that's not actually usable in your environment. If I recall, this is because uv (in order to be as quick as it is at resolving dependencies) will skip some (many? not sure) checks regarding specific versions that it picks for usage.
To me, it feels like a lot of newer python stuff (like uv) is built around people who are always on the latest python version. E.g when you build a new website from scratch that depends on very few things other than your web framework. And for those this will not be an issue. But because we depend on medcat, which in turn depends on a dozen other things, which brings in a lot of transitive dependencies as well, we end up a few versions behind on python.
What does this do?
This adds optional tracing to medcat-trainer
It makes a span for every API call recieved, and every request out that is made. Can optionally enable all the sql calls but it's super noisy.
I've also added more spans/detail/events for calls to SOLR, Medcat library, and medcat-service calls
Example
See calls to medcat-service. The trace propagates through so can see the underlying call to

cat.get_entitiesis actually coming from the remote serviceCall to SOLR - showing solr returns 404, but our code has a JSONDecodeError
