Skip to content

feat(greenhouse): adds dashboards for alerts and plugins#1737

Open
olandr wants to merge 33 commits intomainfrom
feat/dashboards-issue-1302
Open

feat(greenhouse): adds dashboards for alerts and plugins#1737
olandr wants to merge 33 commits intomainfrom
feat/dashboards-issue-1302

Conversation

@olandr
Copy link
Member

@olandr olandr commented Jan 16, 2026

This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number.

The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params.


Signed-off-by: Simon Olander [email protected]

This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number.

The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params.

---------

Signed-off-by: Simon Olander <[email protected]>
@olandr olandr marked this pull request as ready for review January 16, 2026 15:59
@olandr olandr requested a review from a team as a code owner January 16, 2026 15:59
@olandr
Copy link
Member Author

olandr commented Jan 16, 2026

#Plugins

image image image

@olandr
Copy link
Member Author

olandr commented Jan 16, 2026

Alerts

image image image image

(chore): retain catalog secret err in inventory status

(chore): revert error propagation in source/objectReadiness



(chore): reflect aggregated error in statusCondition



(chore): format aggregated errs in catalog



(chore): increase e2e timeout to 3m



(chore): move error struct to source

Signed-off-by: abhijith-darshan <[email protected]>
@IvoGoman
Copy link
Contributor

Hey Simon, thanks for driving this.

For a few panel groups in the alerts dashboard I see some redundancy to existing dashboards. The Operator Alerts on the Alerts Plugin were modelled after the controller runtime metrics dashboard. Same for the Proxy alerts which are covered by the Proxy Overview.

When writing the original issue I was thinking more in the direction of having an overview by Greenhouse resources/ group of resources.

As an example for the Team Alerts. This could be it's own Organization dashboard. Showing the overall status of the organisation (e.g. greenhouse_organization_ready & greenhouse_scim_access_ready). Then for Team & Team RBAC panels that show the absolute number of TeamMembers, the status of the TeamRoleBindings. All with filters for Organization, Team & Cluster to filter the panels.

The plugin dashboard is going into this direction. For the panels Plugin Reconciliation is Constantly Failing & Plugins Not Ready for over 15 minutes, how about modelling these as Time Series Charts? The dashboard could then give the operator, who is looking at overall status or investigating an alert, the change over time instead of showing the same information from the alert. The alert should already have a link to the prometheus with the exact query of the alert.

abhijith-darshan and others added 3 commits January 19, 2026 16:12
(chore): increase e2e timeout to 5m in actions



(chore): e2e workflow - revert to main branch

Signed-off-by: abhijith-darshan <[email protected]>
(chore): fix lint

Signed-off-by: abhijith-darshan <[email protected]>
* Authorization webhook with insecure config (WIP)

On-behalf-of: @SAP [email protected]

* fix authz chart, setup insecure authorization webhook server

On-behalf-of: @SAP [email protected]

* sets the local image for the authz deployment

On-behalf-of: @SAP [email protected]

* apply authz service manifest in local setup, set up the service as NodePort

On-behalf-of: @SAP [email protected]

* mount custom resolv config for kube-apiserver in kind cluster to use CoreDNS, use ClusterIP service for authz

On-behalf-of: @SAP [email protected]

* enable secure mTLS connection to authorization webhook from apiserver

On-behalf-of: @SAP [email protected]

* handle authorization of requests for greenhouse resources based on owned-by label; add serviceaccount with rbac to authz

On-behalf-of: @SAP [email protected]

* add greenhouse prefix to authz resources, tidy up logs

On-behalf-of: @SAP [email protected]

* filter verbs in requests redirected to authorizer webhook

On-behalf-of: @SAP [email protected]

* update verbs in authz webhook matchConditions

On-behalf-of: @SAP [email protected]

* add authz certs generation to setup-manager target, remove custom resolv.conf and use NodePort service with loopback IP instead

On-behalf-of: @SAP [email protected]

* disable authz by default in greenhouse chart, revert chart versions, fix lint suggestions

On-behalf-of: @SAP [email protected]

* extract code to init func in authz

On-behalf-of: @SAP [email protected]

* fix helm lint

On-behalf-of: @SAP [email protected]

* use the client.Client with RESTMapper instead of dynamic client in authz

On-behalf-of: @SAP [email protected]

* put authz-certs target as dependency to make setup-manager

On-behalf-of: @SAP [email protected]

* template authz deployment and service values; replace authz modification in local setup with chart overrides; remove unnecessary match condition

On-behalf-of: @SAP [email protected]

* feat: add insecure option for running authz server, update chart

On-behalf-of: @SAP [email protected]

* set v1.33.4 for e2e test action, remove v1.32.8 from nightly e2e

On-behalf-of: @SAP [email protected]

* set v1.33.4 for e2e tests; fix unhandled error

On-behalf-of: @SAP [email protected]

* revert v1.32.8 version to e2e matrix for remote cluster

On-behalf-of: @SAP [email protected]

* change admin cluster config for e2e tests action to not include authorization webhook

On-behalf-of: @SAP [email protected]

* fixes after code review, change env to AUTHZ_TLS_DISABLED with default false, set insecure authz webhook for local setup

On-behalf-of: @SAP [email protected]
@olandr
Copy link
Member Author

olandr commented Jan 21, 2026

Hey Simon, thanks for driving this.

For a few panel groups in the alerts dashboard I see some redundancy to existing dashboards. The Operator Alerts on the Alerts Plugin were modelled after the controller runtime metrics dashboard. Same for the Proxy alerts which are covered by the Proxy Overview.

Agreed, the other dashboards look a lot nicer. I deleted these duplicates.

When writing the original issue I was thinking more in the direction of having an overview by Greenhouse resources/ group of resources.

As an example for the Team Alerts. This could be it's own Organization dashboard. Showing the overall status of the organisation (e.g. greenhouse_organization_ready & greenhouse_scim_access_ready). Then for Team & Team RBAC panels that show the absolute number of TeamMembers, the status of the TeamRoleBindings. All with filters for Organization, Team & Cluster to filter the panels.

I added a new Dashboard called "Organization". To me it is a nice start, but it looks a bit empty tbh
image

The plugin dashboard is going into this direction. For the panels Plugin Reconciliation is Constantly Failing & Plugins Not Ready for over 15 minutes, how about modelling these as Time Series Charts? The dashboard could then give the operator, who is looking at overall status or investigating an alert, the change over time instead of showing the same information from the alert. The alert should already have a link to the prometheus with the exact query of the alert.

That sounds like a good idea to me as well. I started with the alert-angle to have something to start with, but I agree that it is redundant as you can render the exact alert in Prometheus/thanos.

Only drawback with a TimeSeriesChart however is that it looks a bit cluttered. I agree that it gives an operator way more information than a simple "boolean"-table. But It is not super easy to make render it in a nice looking way.
image

@trouaux
Copy link
Contributor

trouaux commented Jan 21, 2026

Screenshot 2026-01-21 at 5 01 29 PM

those gno plugins are not part of obs-eu-nl-1

IvoGoman and others added 3 commits January 27, 2026 09:19
(chore): e2e test to verify label propagation



(chore): catalog propagate only owned-by label



(chore): remove extraction test cases



(chore): check suspended status condition



(chore): remove nested eventually

Signed-off-by: abhijith-darshan <[email protected]>
Flux is deployed into the local environment by default
Expressions can be used by default as well
@olandr
Copy link
Member Author

olandr commented Feb 2, 2026

Screenshot 2026-01-21 at 5 01 29 PM those gno plugins are not part of obs-eu-nl-1

they are according to the metrics 🤔

@olandr
Copy link
Member Author

olandr commented Feb 2, 2026

Handing over this PR to someone else for the time-being.

Zaggy21 and others added 6 commits February 2, 2026 17:10
* fix: set HelmReconcileFailedCondition if HelmRelease creation fails for Plugin

On-behalf-of: @SAP [email protected]

* fix flaky E2E test from catalog scenario

On-behalf-of: @SAP [email protected]
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This helper will update the version comment to use semver in case of
just a major version being used
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
@uwe-mayer uwe-mayer linked an issue Feb 3, 2026 that may be closed by this pull request
1 task
Zaggy21 and others added 10 commits February 3, 2026 17:39
…very tool to helm for helm controller tests (#1772)

On-behalf-of: @SAP [email protected]
currently the Plugin controller will set .spec.test.enable=false
this is overwritten by flux and results in updates to the helmrelease
(chore): add PD and CPD API type methods



(chore): add ready condition to printer columns

remove description from printer column



(chore): unified methods for PD and CPD controllers

utils unifies setupManager with watches for both CPD and PD controllers

Generic PluginDefinition interface for extracting spec, chart name, setting conditions, re-use in other controllers

unifies helm repository and helm chart resource create / update for PD and CPD



(chore): use unified methods in PD and CPD controllers



(chore): update test cases



(chore): refactor ready condition



(chore): simplify method signature



(chore): fix lint errors



(chore): force reconcile until kustomization ready



(chore): use update instead of patch for annotations



(chore): remove nesting eventually



(chore): rename enqueue func



(chore): remove nested eventually



(chore): apply suggestions from code review



(chore): remove redundant predicateFunc



(chore): remove SetUnknownCondition and rename ChartName() to FluxHelmChartResourceName()



(chore): initialize conditions



(chore): pass gomega to underlying helper



(chore): use owns and controller reference on helm charts



(chore): rename enqueue event handler method

Signed-off-by: abhijith-darshan <[email protected]>
(chore): removes unused manifests



(chore): adds license headers



(chore): adds local registry ca mount to source-controller



(chore): adds local registry repository const



(chore): rename org name



(chore): rebase Makefile

Signed-off-by: abhijith-darshan <[email protected]>
(chore): use CreateOrPatch instead of update



(chore): use slices sort

Signed-off-by: abhijith-darshan <[email protected]>
* fix: predicates for Plugins in PluginPresetController

On-behalf-of: @SAP [email protected]

* move IgnoreDeletingResources to its own predicate func, fix linter suggestion

On-behalf-of: @SAP [email protected]

* remove unnecessary DeleteFunc line from predicate, add UID check to the test

On-behalf-of: @SAP [email protected]
* fix(deps): update kubernetes packages

* adapt webhooks to Generic Validator and Defaulter from controller-runtime v0.23.0

On-behalf-of: @SAP [email protected]

* adapt events to the events API migration introduced by controller-runtime v0.23.0

On-behalf-of: @SAP [email protected]

* upgrade mocks with mockery

On-behalf-of: @SAP [email protected]

* set events.k8s.io group to events

On-behalf-of: @SAP [email protected]

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Zaggy21 <[email protected]>
* (chore): separate authz commands from root Makefile

Signed-off-by: abhijith-darshan <[email protected]>

* Apply suggestions from code review

Co-authored-by: Krzysztof Zagorski <[email protected]>

---------

Signed-off-by: abhijith-darshan <[email protected]>
Co-authored-by: Krzysztof Zagorski <[email protected]>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
@ibakshay ibakshay self-assigned this Feb 9, 2026
Zaggy21 and others added 7 commits February 10, 2026 09:20
* enclose oauth clients fetching in Eventually block, fix assertions

On-behalf-of: @SAP [email protected]

* fix flaky teamrolebinding test

On-behalf-of: @SAP [email protected]
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
this bumps dexidp to the commit fixing the issue with net.URL > 1.25.2, 1.24.8
currently there is no new release this is to resolve security issues
in the GHAS
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
@ibakshay ibakshay requested a review from a team as a code owner February 10, 2026 12:33
@github-actions github-actions bot added size/L documentation Improvements or additions to documentation core-apis dependencies size/XXL and removed size/XXL labels Feb 10, 2026
Copy link
Contributor

@IvoGoman IvoGoman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the big rework @olandr 🎉
A couple of things which would be cool:

Organization

  • the Variable Organization does not behave as intended. Can you use greenhouse_organization_ready as the Series selector. (This only being greenhouse will be addressed soon)
  • Number of Team Members would be nice as an Timeseries Chart with {{team}} in the legend

For the other two let's get them in use them and understand where we can tweak them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] - Greenhouse resource dashboard

7 participants