├── .gitignore ├── 2021-11_Steinberg └── README.md ├── 2022-09_Hirschegg └── README.md ├── 2023-05_Leverkusen ├── README.md ├── gardener-node-agent │ ├── design.drawio.svg │ └── readme.md └── masterful-shoot │ ├── MasterfulShoot.png │ └── summary.md ├── 2023-11_Schelklingen ├── README.md ├── discussions │ ├── gardener_node_agent_rolling_update.md │ └── rolling_update_proposal.drawio.svg └── gardener-operator-deploys-gardenlet │ └── gardenlet-via-gardener-operator.png ├── 2024-05_Schelklingen └── README.md ├── 2024-12_Schelklingen └── README.md ├── 2025-06_Schelklingen ├── README.md ├── cluster-network-observability │ └── prometheus-traffic-chart.png └── otel-transport-shoot-metrics │ ├── otel-flow.png │ └── prometheus-chart.png └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | -------------------------------------------------------------------------------- /2021-11_Steinberg/README.md: -------------------------------------------------------------------------------- 1 | # Hack The Garden 11/2021 Wrap Up 2 | 3 | We, the gardener and metal-stack team, were thinking of organizing a hackathon together since autumn 2019, but then the pandemic made this impossible. 4 | After all, we managed to arrange such a hackathon this November in the beautiful Tyrolean Alps in an even more beautiful hut, the [Mesnerhof-C](https://www.mesnerhof-c.at/). 5 | 6 | This location is equipped with all you need for a serious hacking event, like kicker, dart, excellent WLAN, projector and lots of space to hang around and hack together. 7 | It is also a perfect place to go outside hiking between intense sessions to get your brain and body refreshed. 8 | 9 | ## Preparation 10 | 11 | First we collected a list of projects suitable for implementation during this hackathon and then voted for the winners. 12 | Two major topics have been identified and a list of smaller topics which could be done in small teams in `fast tracks`. 13 | 14 | The topics we would implement with two bigger teams were: 15 | 16 | - Implement the ability to create shoot clusters without worker nodes, aka `kubernetes control plane as a service`. 17 | - Allow shoot cluster users to specify the kubernetes version per worker pool. 18 | 19 | And the fast track contents have been: 20 | 21 | - allow shoot clusters on metal-stack to be hibernated 22 | - port gardener dev setup to the `darwin/arm64` platform 23 | - migrate the metal-stack mini-lab away from vagrant to containerlabs 24 | - simplify the gardener local-setup and allow developers to run a fully functional garden setup locally without any cloud provider (kind cluster provider) 25 | 26 | ### Kubernetes Control plane as a Service 27 | 28 | Or in other words create a shoot cluster without any worker nodes. This reduces the amount of required fields in the shoot spec dramatically. 29 | Use case for this would be to offer just a kubernetes api for applications that just rely on the kubernetes api. Metal-Stack would use this feature to create "Firewall as a Service". Or if you want to add servers as nodes to a control plane which are not controlled by gardener. 30 | 31 | The following portions of the shoot spec become optional when using the feature: 32 | 33 | - Infrastructure 34 | - Networking 35 | - Worker 36 | 37 | Also, workerless mode doesn't require a lot of components that are usually deployed (i.e. `coredns`, `vpn`, etc). It runs very stripped version of the Kubernetes control plane with only following services: kube-apiserver, kube-controller-manager and ETCD. 38 | 39 | This stream was implemented to a point where it was basically usable and a lot of open points have been identified. 40 | As we all agreed that the given goal is worth the effort, we decided to write a dedicated GEP (Gardener Enhancement Proposal) which describes the goals and non-goals of this feature. The PR of the proposal will follow soon. 41 | 42 | ### Kubernetes Version per Worker Pool 43 | 44 | This is a feature request coming from different groups of cluster users. One scenario which makes this feature invaluable is for clusters that require careful update operations. 45 | Storage or big database workloads would like to migrate their pods gradually to worker nodes which are already on the next version of kubernetes and keep the rest of the workloads still running on the current version of kubernetes. Think of a blue/green deployment, but for kubernetes workload in one cluster. 46 | 47 | This feature was completely implemented and we raised two pull requests: 48 | 49 | - kube-proxy Deployment must match kubelet version: [https://github.com/gardener/gardener/pull/4969](https://github.com/gardener/gardener/pull/4969) 50 | - Kubernetes version configurable per worker pool: [https://github.com/gardener/gardener/pull/4971](https://github.com/gardener/gardener/pull/4971) 51 | 52 | The first PR was raised because during the development we identified an issue in the way gardener updates the workers and the system components. 53 | The kubernetes version skew policy requires that the `kube-proxy` running on a worker node must be at the exact same version as the `kubelet`. The first PR ensures that. 54 | 55 | The second PR allows shoot cluster users to set the kubernetes version per worker pool. If not set, the worker pool simply inherits the version from the control plane as before. 56 | It is allowed to specify the same version of kubernetes for the worker pool as the control plane, and later stay up to two minor versions behind the control plane version. 57 | 58 | ### Fast Tracks 59 | 60 | Also progress was made during the fast track sessions: 61 | 62 | - allow shoot clusters on metal-stack to be hibernated [https://github.com/metal-stack/gardener-extension-provider-metal/pull/221](https://github.com/metal-stack/gardener-extension-provider-metal/pull/221) 63 | - port gardener dev setup to the `darwin/arm64` platform [https://github.com/gardener/gardener/pull/4955](https://github.com/gardener/gardener/pull/4955) 64 | - migrate the metal-stack mini-lab from vagrant to containerlabs [https://github.com/metal-stack/mini-lab/pull/74](https://github.com/metal-stack/mini-lab/pull/74) 65 | - simplify the gardener local-setup and allow developers to run a fully functional garden setup locally without any cloud provider [https://github.com/gardener/gardener/issues/5024](https://github.com/gardener/gardener/issues/5024) 66 | 67 | ## Conclusion 68 | 69 | This is the beauty of Open Source collaboration! I can't imagine such a smooth and friendly work together with the same goal in mind otherwise. 70 | I'm looking forward repeating this type of hackathon on a regular basis because I think there is no more funny and more productive way of working. 71 | My impression is that this feeling is common to all attendees. 72 | -------------------------------------------------------------------------------- /2022-09_Hirschegg/README.md: -------------------------------------------------------------------------------- 1 | # Hack The Garden 09/2022 Wrap Up 2 | 3 | ## IPv6 4 | 5 | - There is an open PR which adapts some assumptions about the IPv4 CIDRs in `gardener/gardener` (changes are rather minimal). 6 | - Shoot creation currently stops at 83%. 7 | - `DNSRecord` extension needs to be adapted to support `AAAA` records. 8 | - Focus was on the local setup only for now (which comes with its own issues). **As a follow-up, we want to try the same with real infrastructures**. 9 | - **Everybody is invited to try the IPv6 setup at home (where you have IPv6 internet access) and provide feedback in the open PR.** 10 | - With IPv6, the network CIDRs cannot be defined before shoot cretion (which is different compared to how it works with IPv4 today). **We need to discuss the target picture.** 11 | 12 |
13 | 14 | ## Canary Rollout 15 | 16 | - There is a PoC which marks `Seed`s as canaries and works with both extensions and new gardenlet versions. 17 | - A new freeze controller would stop gardenlets from continuing reconciling `Shoot`s in case too many clusters became unhealthy after the update. 18 | - The `ManagedSeed` controller now also uses `ControllerDeployment`s for rolling out gardenlets. 19 | - There are still some open questions, e.g. "how to validate provider extensions" (given that a seed cluster is typically only responsible for one provider) and "how to determine the canary seeds". 20 | - **We have to discuss the approach and solve the open questions before writing up a GEP**. 21 | 22 |
23 | 24 | ## Registry Cache Extension 25 | 26 | - There is a new controller (`gardener/gardener-extension-registry-cache`) for `Extension` resources which deploys the Docker registries (one per upstream repository) into the shoot cluster. 27 | - The `containerd` configuration for the mirrors is stored in a `ConfigMap` and picked up by a `DaemonSet` from there which applies it and restarts the `containerd.` 28 | - The `DaemonSet` pods are quite privileged and **we need to discuss how to improve security concerns here** (don't use it in production for now). 29 | - **Some unit and integration tests (and docs) are missing and have to be added**, however Prow is already connected to it. 30 | - There is an issue with `containerd` which only allows mirrors to be configured from one file. 31 | - There was also a bug in `gardener/gardener` which prevented additional `containerd` imports from working properly. 32 | - Another issue in `gardener/gardener` was found regarding removing the extension from shoot clusters (still needs to be investigated). 33 | - Potentially, we should refactor the current implementation to rely on `OperatingSystemConfig` webhooks. 34 | 35 |
36 | 37 | ## Access Control For Shoot Clusters 38 | 39 | - The open PR from Gerrit in `gardener/gardener` was closed in favor of an extension developed by StackIT. 40 | - **It needs to be adapted to a new approach using `EnvoyFilter`s instead of `AuthorizationPolicy`s** (otherwise, we would risk `kube-apiserver` outages). 41 | - End-users should not be bothered with specifying the "right" IPs (source IPs vs. proxy protocol). For this, **the provider extension need to be adapted to make this convenient for the users.** (This is only for convenience - the extension can operate on its own). 42 | - The 23T colleagues created a new GitHub organization called `gardener-community` which might be a good place to host this new extension. 43 | 44 |
45 | 46 | ## Cilium Support For Metal Stack 47 | 48 | - The Cilium extension was enhanced to support some more configuration options which make Cilium now possible on Metal Stack. 49 | - In addition, Cilium was tested successfully with `provider-local` (works out of the box). 50 | 51 |
52 | 53 | ## Kubelet Server Certificate 54 | 55 | - The kubelet was still using a self-signed server certificate (valid for `1y`) and was now adapted to create `CertificateSigningRequest` resources (same process compared to what it does to retrieve its client certificate). 56 | - There is a new controller in `gardener-resource-manager` which approves such CSRs so that `kube-controller-manager` can issue the certificate. 57 | - With this, the server certificate is only valid for `30d` and rotated regularly. 58 | - Documentation and tests are still missing and need to be adapted. 59 | - OpenShift follows a similar approach (they have their own CSR approver controller). 60 | 61 |
62 | 63 | ## Replacement Of The Last Remaining Shell Script In Metal Stack 64 | 65 | - Awesome 66 | 67 |
68 | 69 | ## Resource Manager Health Check Watches 70 | 71 | - Previously, the health controller was polling periodically which caused quite some delays in reporting the actual status of the world. 72 | - This prolonged various operations unnecessarily. 73 | - Now, it watches all resources it detects in the `status` of the `ManagedResource` which makes it reacting MUCH faster. 🎉 74 | - This only works for the `ManagedResource`s in the seed since we currently disable the cache for shoot clusters (however, enabling it comes with quite more memory usage, so we have to check). 75 | -------------------------------------------------------------------------------- /2023-05_Leverkusen/README.md: -------------------------------------------------------------------------------- 1 | # Hack The Garden 05/2023 Wrap Up 2 | 3 | ## 🕵️ Introduction Of `gardener-node-agent` 4 | 5 | **Problem Statement:** `cloud-config-downloader` is an ever-growing shell script running on all worker nodes of all shoot clusters. It is templated via Golang and has a high complexity and development burden. It runs every `60s` and checks whether new systemd units or files have to be downloaded. There are several scalability drawbacks due to less flexibility with shell scripts compared to a controller-based implementation, for example unnecessary restarts of systemd units (e.g., `kubelet`) just because the shell script changed (which often results in short interrupts/hick-ups for end-users). 6 | 7 | **Motivation/Benefits**: 💰 Reduction of costs due to less traffic, 📈 better scalability due to less API calls, ⛔️ prevented interrupts/hick-ups due to undesired `kubelet` restarts, 👨🏼‍💻 improved developer productivity, 🔧 reduced operation complexity 8 | 9 | **Achievements:** A new Golang implementation for the `gardener-node-agent` based on `controller-runtime` has been developed next to the current `cloud-config-downloader` shell script. It gets deployed onto the nodes via an initial `gardener-node-init` script executed via user data which only downloads the `gardener-node-agent` binary from the OCI registry. From then on, only the `node-agent` runs as systemd unit and manages everything related to the node, including the `kubelet` and its own binaries (self-upgrade is possible). 10 | 11 | **Next Steps:** Open an issue explaining the motivation about the change. Add missing integration tests, documentation, and reasonably split the changes into smaller, manageable, reviewable PRs. Add dependencies between config files and corresponding units such that only units get restarted when there are relevant changes. 12 | 13 | **Code**: https://github.com/metal-stack/gardener/tree/gardener-node-agent, https://github.com/metal-stack/gardener/tree/gardener-node-agent-2 14 | 15 |
16 | 17 | ## 🌐 IPv6 On Cloud Provider 18 | 19 | **Problem Statement:** With [GEP-21](https://github.com/gardener/gardener/issues/7051), we have started to tackle IPv6 in the local setup. What needs to be done on cloud providers has not been covered or concretely evaluated so far, however this is obviously where eventually we have to go to. 20 | 21 | **Motivation/Benefits**: 🔭 Future-proofness of Gardener, ✨ support more use-cases/scenarios 22 | 23 | **Achievements:** With the help of an IPv6-only seed cluster provisioned via `kubeadm` on a GCP dev box, an IPv6-only shoot cluster could successfully be created. However, this required quite a lot of manual steps that were hard to automate for now. As expected, qualities on the various infrastructures differ quite a lot which increases the complexity of the entire topic. 24 | 25 | **Next Steps:** The many manually performed steps must be documented. The changes already prepared by StackIT for the local setup in the scope of GEP-21 have to be merged (no open PR yet). The infrastructure providers have to be extended to incorporate the current learnings and to achieve an automated setup. 26 | Generally, we have to discuss whether we really want to go to IPv6-only clusters. Instead, it is more likely that we require dual-stack clusters so that we don't have to setup separate landscapes just to provision IPv6 clusters. This comes with its own complexity and has not been tackled yet. 27 | 28 | **Code:** https://github.com/stackitcloud/gardener/tree/hackathon-ipv6 29 | 30 |
31 | 32 | ## 🌱 Bootstrapping "Masterful Clusters" aka "Autonomous Shoots" 33 | 34 | **Problem Statement:** Gardener requires an initial cluster such that it can be deployed and spin up further seed or shoot clusters. This initial cluster can only be setup via different means than `Shoot` clusters, hence it has different qualities, hence it makes the landscapes more heterogenous, hence it eventually results in additional operational complexity. 35 | 36 | **Motivation/Benefits**: 🔗 Drop third-party tooling dependency, 🔧 reduced operation complexity, ✨ support more use-cases/scenarios 37 | 38 | **Achievements:** Gardener produces the needed static pod manifests for the control plane components (e.g., `kube-apiserver`) via existing code and puts them into `ConfigMap`s in the shoot namespace in the seed (eventually, the goal is to add them as units/files in the `OperatingSystemConfig`). Hence, an initial cluster for bootstrapping a first temporary seed and shoot would still be required. However, the idea is to drop it again after control plane has been moved into the created shoot cluster. 39 | 40 | **Next Steps:** There are still many open questions and issues, i.e. how to apply the static pods for the control plane components automatically? How to get valid kubeconfigs for the scenario where `kube-apiserver` runs as static pod in the cluster itself? Generally, much more further brainstorming how to design the overall concept is required, and more people need to be involved. We should document all our current findings for future explorations. 41 | 42 | **Code**: https://github.com/MartinWeindel/gardener/tree/initial-cluster-poc 43 | 44 |
45 | 46 | ## 🔑 Garden Cluster Access For Extensions In Seed Clusters 47 | 48 | **Problem Statement:** Extensions run in seed clusters and have no access to the garden cluster. In some scenarios, such access is required to cover certain use-cases. There are extension implementations that work around this limitation by re-using `gardenlet`'s kubeconfig for talking to the garden cluster. This is obviously bad since it prevents auditibility. 49 | 50 | **Motivation/Benefits**: 🛡️ Increased security due to prevention of privilege escalation in the garden cluster by extensions, 👀 better auditing and transparency for garden access, 💡 basis for dropping technical debts (`extensions.gardener.cloud/v1alpha1.Cluster` resource), ✨ support more use-cases/scenarios 51 | 52 | **Achievements:** The `tokenrequestor` controller part of `gardener-resource-manager` has been made reusable and is now also registered as part of `gardenlet`. This way, when `gardenlet`'s `ControllerInstallation` controller deploys extensions, it can automatically create dedicated "access token secrets" for the garden cluster. Then its `tokenrequestor` controller can easily provision them and maintain them in the extension namespaces in the seed cluster. All `Deployment`s and similar workload resources get automatically a volume mount for the garden kubeconfig and an environment variable pointing to the location of the kubeconfig. The RBAC privileges for all such extensions are at least similarly restricted as for `gardenlet`, i.e. only resources related to the seed cluster they are responsible for can be accessed. 53 | 54 | **Next Steps:** Collect feedback in the proposal issue. Technically, the changes are ready and only integration tests and documentation are missing, but let's wait for more feedback on the issue first. 55 | 56 | **Issue**: [gardener/gardener#8001](https://github.com/gardener/gardener/issues/8001) 57 | 58 | **Code:** https://github.com/rfranzke/gardener/tree/hackathon/extensions-garden-access 59 | 60 |
61 | 62 | ## 💾 Replacement Of `ShootState`s With Data In Backup Buckets 63 | 64 | **Problem Statement:** The `ShootState`s are used to persist secrets and relevant extension data (infrastructure state/machine status, etc.) such that they can be restored into the new seed cluster during a shoot control plane migration. However, this resource results in quite a lot of frequent updates (hence network I/O and costs) on both the seed and garden cluster API servers since every change in the shoot worker nodes must be persisted (to prevent losing nodes during a migration). Besides, the approach has scalability concerns since the data does not really qualify for being stored in ETCD. 65 | 66 | **Motivation/Benefits**: 💰 Reduction of costs due to less traffic, 📈 better scalability due to less API calls, 😅 simplified code and handling of the Shoot control plane migration process due to elimination of the `ShootState` resource and controllers 67 | 68 | **Achievements:** Two new resources in the `extensions.gardener.cloud` API group were introduced: `BackupUpload` and `BackupDownload`. Corresponding controller implementations were added to the extensions library and to `provider-local`. The control plane migration process was adapted to use the new resources for uploading and downloading the relevant shoot state information to and from object storage buckets. A flaw in the `BackupEntry` API design related to the deletion of entries was discovered and an appropriate redesign was decided on. 69 | 70 | **Next Steps:** Technically, tests and documentation are still needed. Also, we should regularly upload (every hour or so) a backup of `ShootState`. It probably makes sense to write up the proposal in a GEP and agree on the API changes. 71 | 72 | **GEP Document**: [gardener/gardener#8012](https://github.com/gardener/gardener/pull/8012) 73 | 74 | **Issue**: [gardener/gardener#8073](https://github.com/gardener/gardener/issues/8073) 75 | 76 | **Code:** https://github.com/rfranzke/gardener/tree/hackathon/shootstate-s3 77 | 78 |
79 | 80 | ## 🔐 Introducing `InternalSecret` Resource In Gardener API 81 | 82 | **Problem Statement:** End-users can read or write `Secret`s in their project namespaces in the garden cluster. This prevents Gardener components from storing certain "Gardener-internal" secrets related to the `Shoot` in the respective project namespace (since end-users would have access to them). Also, the Gardener API server uses `ShootState`s for the `adminkubeconfig` subresource of `Shoot`s (needed for retrieving the client CA key used for signing short-lived client certificates). 83 | 84 | **Motivation/Benefits**: ✨ Support more use-cases/scenarios, 😅 drop workarounds and dependencies on other resources in the garden cluster 85 | 86 | **Achievements:** The Gardener `core.gardener.cloud` API gets a new `InternalSecret` resource which is very similar to the `core/v1.Secret`. The secrets manager has been made generic such that it can handle both `core/v1.Secret`s and `core.gardener.cloud/v1beta1.InternalSecret`s. `gardenlet` populates the client CA to an `InternalSecret` in the project namespace. This can be used by the API server to issue client certificates for the `adminkubeconfig` subresource and opens up for dropping the `ShootState` API. 87 | 88 | **Next Steps:** Collect feedback in the proposal issue. Adapt missing tests and documentation, but otherwise implementation is ready. 89 | 90 | **Issue**: [gardener/gardener#7999](https://github.com/gardener/gardener/issues/7999) 91 | 92 | **Code**: https://github.com/rfranzke/gardener/tree/hackathon/internal-secrets 93 | 94 |
95 | 96 | ## 🤖 Moving `machine-controller-manager` Deployment Responsibility To `gardenlet` 97 | 98 | **Problem Statement:** The `machine-controller-manager` is deployed by all provider extensions even though it is always the same manifest/specifciation. The only provider-specific part is the sidecar container which implements the interface to the cloud provider API. This increases development efforts since changes have to replicated again and again for all extensions. 99 | 100 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity, 🔧 reduced operation complexity 101 | 102 | **Achievements:** `gardenlet` takes over deployment of the generic part of `machine-controller-manager` and provider extensions inject their specific sidecar containers via a wehook. The MCM version is managed centrally in `gardener/gardener`, i.e. provider extensions are now only responsible for their specific sidecar image. 103 | 104 | **Next Steps:** Add missing tests, adopt documentation, and open pull request. All provider extensions have to be revendored and adapted after this PR has been merged and released. 105 | 106 | **Issue**: [gardener/gardener#7594](https://github.com/gardener/gardener/issues/7594) 107 | 108 | **Code**: https://github.com/rfranzke/gardener/tree/hackathon/mcm 109 | 110 |
111 | 112 | ## 🎯 Improved E2E Test Accuracy For Local Control Plane Migration 113 | 114 | **Problem Statement:** Essential logic for migrating worker nodes during a shoot control plane migration was not tested/skipped in the local scenario which is the basis for e2e tests in Gardener. This effectively increased development effort and complexity. 115 | 116 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity 117 | 118 | **Achievements:** `provider-local` now makes use of the essential logic in the extensions library which increases confidence for changes. The related PR has been opened and got merged already. 119 | 120 | **Next Steps:** None. 121 | 122 | **Code**: https://github.com/gardener/gardener/pull/7981 123 | 124 |
125 | 126 | ## 🏗️ Moving `apiserver-proxy-pod-mutator` Webhook Into `gardener-resource-manager` 127 | 128 | **Problem Statement:** The `kube-apiserver` pods of shoot clusters were running a webhook server as sidecar container which was used to inject an environment variable into workload pods for preventing unnecessary network hops. While all this is still quite valid, the webhook server should not be ran as a sidecar container in the `kube-apiserver` pod as this violates best practices and comes which other downsides (e.g., when the webhook server needs to be scaled up or down, the entire `kube-apiserver` pod must be restarted). 129 | 130 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity, ⛔️ prevented interrupts/hick-ups due to undesired `kube-apiserver` restarts 131 | 132 | **Achievements:** The logic of the webhook server was moved into `gardener-resource-manager` which already serves other similar webhook handlers for workload pods in shoot clusters. Missing integration tests and documentation has been added. The PR has been opened and is now under review. 133 | 134 | **Next Steps:** None. 135 | 136 | **Code:** https://github.com/gardener/gardener/pull/7980 137 | 138 |
139 | 140 | ## 🗝️ ETCD Encryption For Custom Resources 141 | 142 | **Problem Statement:** Currently, only `Secret`s are encrypted [at rest](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/) in the ETCDs of shoots. However, users might want to configure more resources that shall be encrypted in ETCD since they might contain confidential data similar to `Secret`s. Also, when `gardener-operator` deploys the Gardener control plane, additional resources like `ControllerRegistration`s or `ShootState`s must be encrypted as well. 143 | 144 | **Motivation/Benefits**: 🔏 Improve cluster security, ✨ support new use cases/scenarios, 🔁 reuse code for the Gardener control plane management via `gardener-operator` 145 | 146 | **Achievements:** The `Shoot` and `Garden` APIs were augmented to allow configuration of to-be-encrypted resources. First preparations in the corresponding handling in code were taken and an approach for implementing the feature was decided on. 147 | 148 | **Next Steps:** Continue the implementation as part of the `gardener-operator` story. The complex part is to orchestrate dynamically added or removed resources such that they either become encrypted or decrypted. 149 | 150 | **Issue:** [gardener/gardener#4606](https://github.com/gardener/gardener/issues/4606) 151 | 152 | **Code**: https://github.com/rfranzke/gardener/tree/hackathon/etcd-encryption 153 | -------------------------------------------------------------------------------- /2023-05_Leverkusen/gardener-node-agent/design.drawio.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 |
11 |
12 | store imagevector of node-agent-tar 13 |
14 |
15 |
16 |
17 | 18 | store imagevector of node-agent-tar 19 | 20 |
21 |
22 | 23 | 24 | 25 | 26 |
27 |
28 |
29 | Control Plane 30 |
31 |
32 |
33 |
34 | 35 | Control Plane 36 | 37 |
38 |
39 | 40 | 41 | 42 | 43 |
44 |
45 |
46 | Registry 47 |
48 |
49 | - kubelet 50 |
51 | - gardener-node-agent 52 |
53 |
54 |
55 |
56 | 57 | Registry... 58 | 59 |
60 |
61 | 62 | 63 | 64 | 65 |
66 |
67 |
68 | Worker 69 |
70 |
71 |
72 |
73 | 74 | Worker 75 | 76 |
77 |
78 | 79 | 80 | 81 | 82 | 83 | 84 |
85 |
86 |
87 | OSC 88 |
89 |
90 |
91 |
92 | 93 | OSC 94 | 95 |
96 |
97 | 98 | 99 | 100 | 101 | 102 | 103 |
104 |
105 |
106 | MCM 107 |
108 |
109 |
110 |
111 | 112 | MCM 113 | 114 |
115 |
116 | 117 | 118 | 119 | 120 | 121 |
122 |
123 |
124 | gardener-node-agent bootstrap 125 |
126 |
127 |
128 |
129 | 130 | gardener-node-agent bootstrap 131 | 132 |
133 |
134 | 135 | 136 | 137 | 138 | 139 |
140 |
141 |
142 | pull gardener-node-agent 143 |
144 |
145 |
146 |
147 | 148 | pull gardener-node-agent 149 | 150 |
151 |
152 | 153 | 154 | 155 | 156 |
157 |
158 |
159 | gardener-node-init 160 |
161 |
162 |
163 |
164 | 165 | gardener-node-init 166 | 167 |
168 |
169 | 170 | 171 | 172 | 173 |
174 |
175 |
176 | gardener-node-agent 177 |
178 |
179 |
180 | 181 | reconciles: 182 | 183 |
184 |
185 | 186 | - node 187 | 188 |
189 |
190 | 191 | - kubelet 192 | 193 |
194 |
195 | 196 | - osc 197 | 198 |
199 |
200 | 201 | - self update 202 | 203 |
204 |
205 |
206 |
207 |
208 | 209 | gardener-node-agent... 210 | 211 |
212 |
213 | 214 | 215 | 216 | 217 | 218 |
219 |
220 |
221 | nodeagent API 222 |
223 |
224 |
225 |
226 | 227 | nodeagent API 228 | 229 |
230 |
231 | 232 | 233 | 234 | 235 | 236 |
237 |
238 |
239 | Watch 240 |
241 |
242 |
243 |
244 | 245 | Watch 246 | 247 |
248 |
249 | 250 | 251 | 252 | 253 |
254 |
255 |
256 | Node 257 |
258 |
259 |
260 |
261 | 262 | Node 263 | 264 |
265 |
266 | 267 | 268 | 269 | 270 |
271 |
272 |
273 | Secret 274 |
275 |
276 |
277 |
278 | 279 | Secret 280 | 281 |
282 |
283 | 284 | 285 | 286 | 287 | 288 |
289 |
290 |
291 | creates own 292 |
293 | systemd-unit 294 |
295 |
296 |
297 |
298 | 299 | creates own... 300 | 301 |
302 |
303 |
304 | 305 | 306 | 307 | 308 | Text is not SVG - cannot display 309 | 310 | 311 | 312 |
-------------------------------------------------------------------------------- /2023-05_Leverkusen/gardener-node-agent/readme.md: -------------------------------------------------------------------------------- 1 | # Gardener Node Agent 2 | 3 | This track finally implemented what was on the list of ideas since the first hackathon in 2021. The goal was to get rid of two important components which are responsible to bootstrap a machine into a worker node, the `cloud-config-downloader` and the `cloud-config-executor`, both written in `bash` which is even templated. The sheer complexity of these two scripts, combined with scalability and performance issues urges their removal. 4 | 5 | ## Basic Design 6 | 7 | The basic idea how we wanted to remove the cloud-config-downloader is like this: 8 | 9 | - write a very small bash script called `gardener-node-init`, which is carried with the cloud-init data, and then has the sole purpose of downloading and starting a kubernetes controller `gardener-node-agent` responsible for the logic to make the machine a worker node. 10 | - the kubernetes controller on the machine called `gardener-node-agent` will then watch for kubernetes resources and depending of the object and the changes there, will reconcile the worker. 11 | 12 | ## Architecture 13 | 14 | ![Design](design.drawio.svg) 15 | 16 | TODO: description 17 | 18 | ## Gains 19 | 20 | With the new Architecture we gain a lot, let's describe the most important gains here. 21 | 22 | ### Developer Productivity 23 | 24 | Because we all develop in go day by day, writing business logic in `bash` is difficult, hard to maintain, almost impossible to test. Getting rid of almost all `bash` scripts which are currently in use for this very important part of the cluster creation process will enhance the speed of adding new features and removing bugs. 25 | 26 | ### Speed 27 | 28 | Until now, the `cloud-config-downloader` runs in a loop every 60sec to check if something changed on the shoot which requires modifications on the worker node. This produces a lot of unneeded traffic on the api-server and wastes time, it will sometimes take up to 60sec until a desired modification is started on the worker node. 29 | By using the controller-runtime we can watch for the `node`, the`OSC` in the `secret`, and the shoot-access-token in the `secret`. If any of these object changed, and only then, the required action will take effect immediately. 30 | This will speed up operations and will reduce the load on the api-server of the shoot dramatically. 31 | 32 | ## Scalability 33 | 34 | Actually the `cloud-config-downloader` add a random wait time before restarting the `kubelet` in case the `kubelet` was updated or a configuration change was made to it. This is required to reduce the load on the API server and the traffic on the internet uplink. It also reduces the overall downtime of the services in the cluster because every `kubelet` restart takes a node for several seconds into `NotReady` state which eventually interrupts service availability. 35 | 36 | ~~~ 37 | TODO: The `gardener-node-agent` could do this in a much intelligent way because it watches the `node` object. The gardenlet could add some annotation which tells the `gardener-node-agent` to wait for the kubelet in a coordinated manner. The coordination could be in chunks of nodes and wait for them to finish and then start with the next chunk. Also a equal time spread is possible. 38 | ~~~ 39 | 40 | Decision was made to keep the existing jitter mechanism which calculates the kubelet-download-and-restart-delay-seconds on the controller itself. 41 | 42 | ### Correctness 43 | 44 | The configuration of the `cloud-config-downloader` is actually done by placing a file for every configuration item on the disk on the worker node. This was done because parsing the content of a single file and using this as a value in `bash` reduces to something like `VALUE=$(cat /the/path/to/the/file)`. Simple but lacks validation, type safety and whatnot. 45 | With the `gardener-node-agent` we introduce a new API which is then stored in the `gardener-node-agent` `secret` and stored on disc in a single yaml file for comparison with the previous known state. This brings all benefits of type safe configuration. 46 | Because actual and previous configuration are compared, removed files and units are also removed and stopped on the worker if removed from the `OSC`. 47 | 48 | ### Availability 49 | 50 | Previously the `cloud-config-downloader` simply restarted the `systemd-units` on every change to the `OSC`, regardless which of the services changed. The `gardener-node-agent` first checks which systemd-unit was changed, and will only restart these. This will remove unneeded `kubelet` restarts. 51 | 52 | ## Pull Requests 53 | 54 | In order to bring this work as fast as possible and as smooth as possible into `master` of gardener, we need to split the work into smaller pieces which are easy to review and gradually introduce the new feature. 55 | 56 | We propose the following sets of pull requests: 57 | 58 | - [ ] [Make Decoder aware of plaintext encoding](https://github.com/gardener/gardener/pull/7993) this was found missing during the implementation of the gardener-node-agent. 59 | - [ ] Introduce `gardener-node-agent` and `gardener-node-init` with the required API, push container image with binary inside to registry, do not enable their execution (@majst01, @vknabel first gingko tests) 60 | - [ ] Put the compiled `OSC` into the secret which is downloaded by the worker bootstrap process, no consumer yet (@Gerrit91) 61 | - [ ] Enable downloading of the `gardener-node-init` behind a feature-gate, check os extensions if they manipulate cloud-config-downloader 62 | If they do some manipulation regarding cloud-config-downloader, try to remove these or enable gardener-node-init instead 63 | sample: https://github.com/gardener/gardener-extension-os-gardenlinux/pull/41 64 | - [ ] Disable cloud-config-downloader, cloud-config-executor 65 | - [ ] Remove cloud-config-downloader, cloud-config-executor 66 | 67 | Next Steps: 68 | 69 | - [ ] Create a umbrella issue with this content (@majst01) 70 | - [ ] Write documentation PR 71 | - [ ] Figure out how difficult it would be to add an extra build job to publish gardener-node-agent and potentially kubelet as OCI Image, organize a short meeting with @Christian Cwienk, @rfranzke 72 | - [ ] Future Task: Implement configuration immutability by adding a suffix to the config file and reference of the config file in the systemd-unit, e.g. kubelet, as long as gardener-node-agent is beta restart kubelet on every OSC changes 73 | - [ ] Review usage of configuration file read/write in every controller and document reasoning 74 | - [ ] Silent all TODOs 75 | 76 | 77 | ## Contributors 78 | 79 | - Tim Ebert 80 | - Valenting Knabel 81 | - Gerrit Schwerthelm 82 | - Robin Schneider 83 | - Maximilian Geberl 84 | - Stefan Majer -------------------------------------------------------------------------------- /2023-05_Leverkusen/masterful-shoot/MasterfulShoot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gardener-community/hackathon/385653c42ca25a25256678418cda2b66f10561be/2023-05_Leverkusen/masterful-shoot/MasterfulShoot.png -------------------------------------------------------------------------------- /2023-05_Leverkusen/masterful-shoot/summary.md: -------------------------------------------------------------------------------- 1 | # An Evaluation of Masterful Shoots 2 | 3 | ## General concept 4 | 5 | For a Gardener installation, the setup of an initial Kubernetes cluster is crucial, as Gardener runs on Kubernetes itself. Thus, it was decided to evaluate how we could potentially setup and maintain "initial clusters in Gardener style", aka. masterful shoots, for the hackathon in May 2023. The general idea was to create a Kubernetes control plane via static `Pod`s on machine in an existing `Shoot` cluster and migrate the control plane from its `Seed` cluster to the static control plane, i.e. the `Shoot` becomes its own `Seed`. For this purpose, the existing code in gardener/gardener was modified, so that not only the `Shoot`'s control plane on the `Seed` was built but also the static `Pod` manifests were written to a script (stored in a `ConfigMap`), which should create the static control plane when executed on the machine. In the final design, this script was expected to be executed during the bootstrap phase of machines with special taints which would allow for the specification of a worker group in a `Shoot` spec which would host the control plane of the `Shoot` itself. However, this design idea is still in conceptual phase and was not implemented during the hackathon. Also, the migration of the control plane components from the `Seed` cluster to the static `Pod`s is not implemented, as we faced several issues during evaluation. 6 | 7 | ## Control plane migration from `Seed` to static `Pod`s 8 | 9 | Once the `kube-apiserver` and `etcd` are started as static `Pod`s, the `kube-controller-manager`and `kube-scheduler` need to be connected to the `kube-apiserver`. Consequently, a valid kubeconfig for the `kube-controller-manager` and `kube-scheduler` is required. However, the kubeconfig which is used on the `Seed` cluster for this purpose cannot simply be reused in the static `Pod` control plane, as it is based on a `ServiceAccount` which does not exist in the empty `etcd` which was deployed via the static `Pod`. Creating the exact same `ServiceAccount` is not straight-forward, since the corresponding `token` contains an uid for the `ServiceAccount` which cannot be set during creation. To circumvent this issue, we tried to backup the `etcd` in the `Seed` cluster and restore it in the static control plane. However, a full backup also brought all deployments etc. to the static control plane which were not functional for various reasons. 10 | Consequently, we decided to fetch a kubeconfig for the static `kube-apiserver` with a user from the `system:masters` group and used it for the connectivity setup of the `kube-controller-manger` and `kube-scheduler`. Of course, this was the last resort hack in order to keep going in the evaluation phase. 11 | With the running control plane components, we deployed `kube-proxy`, `calcio` and `coredns` into the empty control plane. The manifests for these deployments were obtained from the `ManagedResource` `Secrets` which were still available in the initial `Seed` cluster. 12 | At this stage, it is possible to run workload on the `Shoot` cluster via interaction with the static `Pod` control plane. 13 | 14 | ## Outlook 15 | 16 | Even though we put a lot of effort into the investigation of the control plane migration, we were not able to complete the task, as the `Shoot` with static `Pod` control plane will not really behave as a `Shoot` whose control plane is hosted on a `Seed`. The reasons are manifold. Most notably, we were not able to design a concept for making the `Shoot` its own `Seed`. Consequently, we cannot really show a proof of concept, as it is still unclear how the final conceptual design looks like. Therefore, any ideas regarding the general architecture are highly welcome 🙂. 17 | 18 | ## Resources 19 | 20 | See [https://github.com/MartinWeindel/gardener/tree/initial-cluster-poc](The initial-cluster-poc branch on Martin Weindel's Gardener Fork) for the very early stage ideas of this work item. 21 | -------------------------------------------------------------------------------- /2023-11_Schelklingen/README.md: -------------------------------------------------------------------------------- 1 | # Hack The Garden 11/2023 Wrap Up 2 | 3 | ## 🏛️ ARM Support For OpenStack Extension 4 | 5 | **Problem Statement:** Today, the [OpenStack extension](https://github.com/gardener/gardener-extension-provider-openstack) does not support shoot clusters with nodes based on the ARM architecture. However, some OpenStack installations support virtual machines with ARM, so Gardener should be able to provision them (similar to how it's already possible on AWS, for example). 6 | 7 | **Motivation/Benefits**: ✨ Support more use cases/scenarios. 8 | 9 | **Achievements:** To support multiple architectures, `provider-openstack` required enhancement for mapping machine images to the respective architecture (e.g., x86 nodes get a different OS image than ARM nodes). This capability has been implemented and validated successfully on STACKIT's OpenStack installation. 10 | 11 | **Next Steps:** Review and merge the PR. 12 | 13 | **Code**: [gardener/gardener-extension-provider-openstack#690](https://github.com/gardener/gardener-extension-provider-openstack/pull/690) 14 | 15 |
16 | 17 | ## 🛡️ Make [ACL Extension](https://github.com/stackitcloud/gardener-extension-acl) Production-Ready 18 | 19 | **Problem Statement:** The [ACL extension](https://github.com/stackitcloud/gardener-extension-acl) for restricting shoot cluster API server access via IP allow-lists only had support for the OpenStack infrastructure and single `istio-ingressgateway`s (i.e., it did neither support HA control planes nor the `ExposureClass` feature of Gardener). 20 | 21 | **Motivation/Benefits**: 🛡️ Increased cluster security, ✨ support more use cases/scenarios. 22 | 23 | **Achievements:** The correct `istio-ingressgateway` is now identified which supports HA control planes and `ExposureClass`es. To allow for generic cloud provider support, the idea is to extend the `extensions.gardener.cloud/v1alpha1.InfrastructureStatus` API with a new field for publishing the egress IPs of the shoot cluster used for communicating with the control plane. Then, the ACL extension can just use this field and no longer needs a big `switch` statement with special code for the various cloud providers. All other provider extensions (AWS, Azure, etc.) have been adapted to publish their IPs to the new fields. However, there are some issues with GCP with dynamic IP allocation for NAT gateways (worker node scale-up might add new IPs - mitigation is to force users to use pre-allocated IPs when they want to use the ACL extension). Azure also has issues when no NAT gateways are used (random egress IPs are used in the case - mitigation is to force users to configure NAT gatways when they want to use the ACL extension). 24 | 25 | **Next Steps:** Finish, review and merge the open PRs. Open a proposal issue for extending the `extensions.gardener.cloud/v1alpha1.InfrastructureStatus` API. Verify extension with different cloud providers. Consider how to add validation for preventing misconfigurations. 26 | 27 | **Code**: https://github.com/kon-angelo/gardener/tree/hackathon-acl, https://github.com/stackitcloud/gardener-extension-acl/tree/feature/SupportForMultipleIstioNamespaces, [stackitcloud/gardener-extension-acl#28](https://github.com/stackitcloud/gardener-extension-acl/pull/28), [stackitcloud/gardener-extension-acl#31](https://github.com/stackitcloud/gardener-extension-acl/pull/31), [stackitcloud/gardener-extension-acl#35](https://github.com/stackitcloud/gardener-extension-acl/pull/35) 28 | 29 |
30 | 31 | ## 🕵️ Continuation Of `gardener-node-agent` 32 | 33 | **Problem Statement:** `cloud-config-downloader` is an ever-growing shell script running on all worker nodes of all shoot clusters. It is templated via Golang and has a high complexity and development burden. It runs every `60s` and checks whether new systemd units or files have to be downloaded. There are several scalability drawbacks due to less flexibility with shell scripts compared to a controller-based implementation, for example unnecessary restarts of systemd units (e.g., `kubelet`) just because the shell script changed (which often results in short interrupts/hick-ups for end-users). 34 | 35 | **Motivation/Benefits**: 💰 Reduction of costs due to less traffic, 📈 better scalability due to less API calls, ⛔️ prevented interrupts/hick-ups due to undesired `kubelet` restarts, 👨🏼‍💻 improved developer productivity, 🔧 reduced operation complexity. 36 | 37 | **Achievements:** In the [previous Hackathon](https://github.com/gardener-community/hackathon/tree/main/2023-05_Leverkusen#%EF%B8%8F-introduction-of-gardener-node-agent), `gardener-node-agent` has been introduced. Thus far, we hadn't managed to put it into use since a few features were still missing. In this Hackathon, we continued working on those, e.g., by letting it write `Lease` objects for health/liveness checks (similar to how the `kubelet` reports its liveness via `Lease` objects in the `kube-node-leases` namespace). Additionally, two remaining large shell scripts for `{kubelet,containerd}-monitor` units have been refactored to Golang controllers. Furthermore, we had [a discussion](https://github.com/gardener-community/hackathon/pull/3) about making `kubelet` restarts across the entire fleet of nodes more robust. 38 | 39 | **Next Steps:** Merge the open PRs, integrate `gardener-node-agent` into `gardenlet` (behind a feature gate), and start rolling it out gradually. 40 | 41 | **Issue:** [gardener/gardener#8023](https://github.com/gardener/gardener/issues/8023) 42 | 43 | **Code**: [gardener/gardener#8767](https://github.com/gardener/gardener/pull/8767), [gardener/gardener#8786](https://github.com/gardener/gardener/pull/8786) 44 | 45 |
46 | 47 | ## 🧑🏼‍🌾 Deploy `gardenlet`s Through Custom Resource Via `gardener-operator` 48 | 49 | **Problem Statement:** `gardener-operator` can only deploy the Gardener control plane components (API server, controller-manager, etc.). `gardenlet`s must be deployed manually to target seed clusters (typically, via the Helm chart). When the `gardener-operator` can reach such seed clusters network-wise, it should be possible to make it easily deploy `gardenlet`s via a new `operator.gardener.cloud/v1alpha1.Gardenlet` custom resource. 50 | 51 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity, 🔧 reduced operation complexity. 52 | 53 | **Achievements:** A new `Gardenlet` CRD has been introduced and [`gardenlet'`s `ManagedSeed` controller](https://github.com/gardener/gardener/tree/master/pkg/gardenlet/controller/managedseed) code has been made reusable such that `gardener-operator` can also instantiate it. The local scenario with `kind` clusters is not yet fully working, but conceptually, the approach has been proven promising. 54 | 55 | **Next Steps:** Finish the implementation and the local scenario, add missing unit/integration tests and documentation. 56 | 57 | **Issue:** [gardener/gardener#8802](https://github.com/gardener/gardener/issues/8802) 58 | 59 | **Code**: https://github.com/metal-stack/gardener/tree/hackathon-operator-gardenlet 60 | 61 |
62 | 63 | ## 🦅 Shoot Control Plane Live Migration (Without Downtime) 64 | 65 | **Problem Statement:** Today, shoot control plane migrations cause a temporary downtime of the shoot cluster since ETCD must be backed-up and deleted before it can be restored in a new seed cluster. During this time frame, the API server is obviously destroyed as well. Workload in the shoot cluster itself continues to run, but cannot be reconciled/scaled/updated/... 66 | 67 | **Motivation/Benefits**: ⛔️ Prevented downtime during control plane migration, ✨ support more use cases/scenarios (think "seed draining/shoot evictions" or simplified seed cluster deletions). 68 | 69 | **Achievements:** The basic idea is to span new ETCD members in the destination seed cluster that join the existing ETCD cluster in the source seed. This way, a gradual move to the destination seed could be established without the necessity to terminate the ETCD cluster. The requirements for such scenario were analysed - changes to [`etcd-druid`](https://github.com/gardener/etcd-druid) and [`etcd-backup-restore`](https://github.com/gardener/etcd-backup-restore) would be required to allow joining new ETCD members into an existing cluster. Manual experiments were performed successfully (including adding multiple load balancers, new DNS records, and mutating the cluster-internal DNS resolution). Another `kube-apiserver` instance was successfully started on the destination seed, effectively spanning the control plane over multiple seeds. In addition, first steps for implementing the ideas in `gardenlet` were performed. 70 | 71 | **Next Steps:** Multiple `gardenlet`s (on source and target seed) need to coordinate migration, this is completely different from what we have today. Discuss the mechanism and implications of this. Also, ETCD experts should be involved for checking/refining the design. Questions, learnings, and ideas should be noted down as basis for further investigations and discussions after the Hackathon. It needs to be decided whether the topic is worth being pushed forward. 72 | 73 | **Repo**: For more details see our small [gardener-control-plane-live-migration](https://github.com/ScheererJ/gardener-control-plane-live-migration) notes repo. 74 | 75 |
76 | 77 | ## 🗄️ Stop Vendoring Third-Party Code In `vendor` Folder 78 | 79 | **Problem Statement:** The `vendor` folder in the root of Go modules contains a copy of all third-party code the module depends on. This blows up the repository and source code releases, makes reviewing pull requests harder because many different files are changed, and creates merge conflicts for many files when both `master` and a PR change dependencies. Committing the `vendor` folder to version control systems is discouraged with newer versions of Golang. 80 | 81 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity by removing clutter from PRs and speeding up `git` operations. 82 | 83 | **Achievements:** The custom `gomegacheck` linter was replaced with the upstream `ginkgolinter` part of `golangci-lint`. With the help of [`vgopath`](https://github.com/ironcore-dev/vgopath), the code and protobuf generation was adapted to run outside the `GOPATH` and without vendoring. The deprecated `go-to-protobuf` generator still requires the `vendor` directory (works with the `vgopath` approach). We don't intend to get rid of it for now as `kubernetes/kubernetes` is still using it as well. 84 | 85 | **Next Steps:** The failing tests in the PR have to be fixed. The https://github.com/gardener/gardener/tree/master/hack/tools/logcheck should be moved into a [dedicated repository](https://github.com/gardener/golangci-logcheck). Verification with one example extension is needed to prevent missing on side-effects to other systems (e.g., Concourse). 86 | 87 | **Code**: https://github.com/afritzler/gardener/tree/enh/remove-vendoring, [gardener/gardener#8769](https://github.com/gardener/gardener/pull/8769), [gardener/gardener#8775](https://github.com/gardener/gardener/pull/8775) 88 | 89 |
90 | 91 | ## 🔍 Generic Extension For Shoot Cluster Audit Logs 92 | 93 | **Problem Statement:** Audit logs of shoot clusters need to be managed outside of Gardener (no built-in/out-of-the-box solution available). Every community member has developed their own closed-source implementations of an audit log extension. Can we find a generic approach that works for everybody and allows for different backends? 94 | 95 | **Motivation/Benefits**: 👨‍👩‍👧‍👦 Harmonize various proprietary implementations, 📦 provide out-of-the-box solution for Gardener community. 96 | 97 | **Achievements:** [A new design](https://github.com/metal-stack/gardener-extension-audit/blob/main/architecture.drawio.svg) has been proposed for reworking the existing implementation (contributed by x-cellent) to be more reliable and reusable: A new `StatefulSet` is added to the shoot control plane that can receive the API server's audit logs via an audit webhook. The webhook backend's logs can be collected via `fluent-bit` and transported to a desired sink from there. The backend basically acts as an audit log buffer. The first steps for implementing the new design were finished and collecting the audit logs in the buffer works. 98 | 99 | **Next Steps:** Implement the missing parts of the design, and move the repository to the [`gardener-community`](https://github.com/gardener-community) GitHub organization. 100 | 101 | **Code**: https://github.com/metal-stack/gardener-extension-audit 102 | 103 |
104 | 105 | ## 🚛 Rework [Shoot Flux Extension](https://github.com/stackitcloud/gardener-extension-shoot-flux) 106 | 107 | **Problem Statement:** The flux extension is currently unmaintained and has a hacky access to the garden cluster (using `gardenlet`'s kubeconfig secret). In addition, users cannot configure the flux installation (e.g., version) and bootstrap configuration (e.g., which `git` repository to connect to) per shoot but only once per garden project. The extension generates an SSH key secret per shoot which needs to be added as a deploy key to the repository afterwards. This is all quite unintiuitive and outdated. 108 | 109 | **Motivation/Benefits**: 📦 Simple delivery of software to shoot clusters, ✨ support more use cases/scenarios. 110 | 111 | **Achievements:** The overall goal should be to simply hook in a shoot into your fleet management by a single apply: Configure everything in the `Shoot` spec once and deploy everything using pure GitOps afterwards. The extension has been brought to the current state of the art (upgraded tools/dependencies, introduced linter and skaffold-based dev setup). The `providerConfig` API was designed and implemented to replace the old project-wide `ConfigMap` approach. The controller has been reworked to deploy and bootstrap flux according to the `providerConfig`. With this, SSH key/credentials for accessing `GitRepository`s are now properly handled. Also, the extension no longer needs access to the garden cluster anymore. 112 | 113 | **Next Steps:** Implement missing unit tests, perform final cleanups of go packages, charts, and entrypoint. Add documentation and test control plane migration. Implement reconciliation of flux installation and resources in additional to initial one-time bootstrap. Move extension to the [`gardener-community`](https://github.com/gardener-community) GitHub organization and switch from GitHub actions to Gardener's Prow. 114 | 115 | **Code**: https://github.com/stackitcloud/gardener-extension-shoot-flux 116 | 117 |
118 | 119 | ## 🤖 Auto-Update Skaffold Dependencies 120 | 121 | **Problem Statement:** Today, `make check` fails when the dependencies in the `skaffold{-operator}.yaml` file are outdated. As human, the list has to be maintained manually, i.e., new dependencies have to be added and old dependencies have to be removed manually. 122 | 123 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 124 | 125 | **Achievements:** A new script `hack/update-skaffold-deps.sh` has been added for automatically updating Skaffold dependencies for the binaries. It can be run in case `make check` reports failures. 126 | 127 | **Code**: [gardener/gardener#8766](https://github.com/gardener/gardener/pull/8766) 128 | 129 |
130 | 131 | ## 🪟 Discussion: Air-Gapped Shoot Clusters 132 | 133 | **Problem Statement:** Customers are interested in limiting internet access for shoot clusters. Today, the clusters need access to the internet by default for bootstrapping nodes, pulling images from the container registry, resolving DNS names, etc. 134 | 135 | **Motivation/Benefits**: 🛡️ Increased cluster security, ✨ support more use cases/scenarios. 136 | 137 | **Achievements:** Ideas and the underlying problems were discussed. The conclusion is that it would be possible to achieve restricting internet access by hosting a container registry with the required images, and by either running the seed cluster in a restricted environment or by explicitly allow-listing the access to the seed networks. Overall, the concrete goals of the interested customers are not perfectly clear yet. 138 | 139 | **Next Steps:** Note down all the thoughts and collect feedback from the discussion with customers. 140 | -------------------------------------------------------------------------------- /2023-11_Schelklingen/discussions/gardener_node_agent_rolling_update.md: -------------------------------------------------------------------------------- 1 | # Would it make sense to have a rolling update of the gardener-node-agent? 2 | 3 | The point of this discussion was to share an opinion on how we can improve on certain downsides of the currently implemented "update jitter" for applying the gardener-node-agent configuration. 4 | 5 | ## Summary 6 | 7 | Downsides of the jitter-approach are: 8 | 9 | - It can happen that kubelet restarts happen at the same time 10 | - Also the jitter periods do not seem to be distributed evenly (as shown by Github comment) 11 | - Node is becoming `NotReady` on kubelet restart leading to a brief traffic interruption for this specific node (e.g. for metal-stack metallb withdraws the route announcements to this node) 12 | - Restarting kubelets in parallel puts pressure on the API server 13 | - When an update goes wrong there is no way to stop the update process such that theoretically a node meltdown can appear (e.g. when kubelet restart on a node leads to node freeze for whatever reason, then updates on the other nodes will be carried out anyway) 14 | 15 | Together we decided the common problem that we want to tackle first is that updates can occur in parallel. For this some easier approaches as presented in the following section can be used as a source of inspiration. 16 | 17 | A more complete solution that also stops updates can come into play once the gardener-node-agent is in place. First, we want to keep the complexity as low as possible. 18 | 19 | Also we agreed that we should collect metrics of kubelet restarts / systemd unit restarts on the nodes as this allows us to make better judgement of our proposed solutions. 20 | 21 | ## Approaches 22 | 23 | ## Lease Solution (Serialized Updates instead of jitter) 24 | 25 | The GNA acquires a lease resource and can then apply its configuration changes. This effectively leads serialized execution of the update. 26 | 27 | - (+) Small implementation burden 28 | - (-) Can not stop the rollout if something goes wrong 29 | 30 | ## Smarter Delay Algorithm 31 | 32 | - (+) Not introducing anymore resources for GNA, only GNA has to implement an algorithm 33 | - (-) Can not stop the rollout if something goes wrong 34 | 35 | ## Sophisticated Idea: Orchestrated update through gardener-node-agent-controller 36 | 37 | - (+) Can stop the update process (protect against node meltdown scenarios) 38 | - (+) Good observability 39 | - (+) Set Health can contribute to overall shoot state 40 | - (-) Complexity 41 | - (-) Bigger architectural change 42 | - (o) If an update is stopped, updates are not distributed anymore to other nodes as long as the original problem gets fixed 43 | - (o) certain similarities to Kubernetes Sets and Pods 44 | 45 | ![Rough Sketch](rolling_update_proposal.drawio.svg) 46 | -------------------------------------------------------------------------------- /2023-11_Schelklingen/discussions/rolling_update_proposal.drawio.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 |
11 |
12 | Shoot 13 |
14 |
15 |
16 |
17 | 18 | Shoot 19 | 20 |
21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
30 |
31 |
32 | Node 1 33 |
34 |
35 |
36 |
37 | 38 | Node 1 39 | 40 |
41 |
42 | 43 | 44 | 45 | 46 |
47 |
48 |
49 | Node 2 50 |
51 |
52 |
53 |
54 | 55 | Node 2 56 | 57 |
58 |
59 | 60 | 61 | 62 | 63 |
64 |
65 |
66 | Node 3 67 |
68 |
69 |
70 |
71 | 72 | Node 3 73 | 74 |
75 |
76 | 77 | 78 | 79 | 80 | 81 |
82 |
83 |
84 | reconciles 85 |
86 |
87 |
88 |
89 | 90 | reconciles 91 | 92 |
93 |
94 | 95 | 96 | 97 | 98 | 99 |
100 |
101 |
102 | watches 103 |
104 |
105 |
106 |
107 | 108 | watches 109 | 110 |
111 |
112 | 113 | 114 | 115 | 116 | 117 | 118 |
119 |
120 |
121 | gardener-node-agent 122 |
123 |
124 |
125 |
126 | 127 | gardener-node-ag... 128 | 129 |
130 |
131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 |
140 |
141 |
142 | systemd 143 |
144 |
145 |
146 |
147 | 148 | systemd 149 | 150 |
151 |
152 | 153 | 154 | 155 | 156 | 157 |
158 |
159 |
160 | reconciles 161 |
162 |
163 |
164 |
165 | 166 | reconciles 167 | 168 |
169 |
170 | 171 | 172 | 173 | 174 | 175 |
176 |
177 |
178 | watches 179 |
180 |
181 |
182 |
183 | 184 | watches 185 | 186 |
187 |
188 | 189 | 190 | 191 | 192 | 193 | 194 |
195 |
196 |
197 | gardener-node-agent 198 |
199 |
200 |
201 |
202 | 203 | gardener-node-ag... 204 | 205 |
206 |
207 | 208 | 209 | 210 | 211 | 212 |
213 |
214 |
215 | reconciles 216 |
217 |
218 |
219 |
220 | 221 | reconciles 222 | 223 |
224 |
225 | 226 | 227 | 228 | 229 | 230 |
231 |
232 |
233 | watches 234 |
235 |
236 |
237 |
238 | 239 | watches 240 | 241 |
242 |
243 | 244 | 245 | 246 | 247 | 248 | 249 |
250 |
251 |
252 | gardener-node-agent 253 |
254 |
255 |
256 |
257 | 258 | gardener-node-ag... 259 | 260 |
261 |
262 | 263 | 264 | 265 | 266 | 267 |
268 |
269 |
270 | ref 271 |
272 |
273 |
274 |
275 | 276 | ref 277 | 278 |
279 |
280 | 281 | 282 | 283 | 284 |
285 |
286 |
287 | gardener-node-agent-node-1 288 |
289 | CR 290 |
291 |
292 |
293 |
294 | 295 | gardener-node-agent-node-1... 296 | 297 |
298 |
299 | 300 | 301 | 302 | 303 | 304 |
305 |
306 |
307 | ref 308 |
309 |
310 |
311 |
312 | 313 | ref 314 | 315 |
316 |
317 | 318 | 319 | 320 | 321 |
322 |
323 |
324 | gardener-node-agent-node-2 325 |
326 | CR 327 |
328 |
329 |
330 |
331 | 332 | gardener-node-agent-node-2... 333 | 334 |
335 |
336 | 337 | 338 | 339 | 340 | 341 |
342 |
343 |
344 | ref 345 |
346 |
347 |
348 |
349 | 350 | ref 351 | 352 |
353 |
354 | 355 | 356 | 357 | 358 |
359 |
360 |
361 | gardener-node-agent-node-3 362 |
363 | CR 364 |
365 |
366 |
367 |
368 | 369 | gardener-node-agent-node-3... 370 | 371 |
372 |
373 | 374 | 375 | 376 | 377 |
378 |
379 |
380 | workergroup-osc-secret 381 |
382 |
383 |
384 |
385 | 386 | workergroup-osc-secr... 387 | 388 |
389 |
390 | 391 | 392 | 393 | 394 |
395 |
396 |
397 | workergroup-osc-secret 398 |
399 |
400 |
401 |
402 | 403 | workergroup-osc-secr... 404 | 405 |
406 |
407 | 408 | 409 | 410 | 411 |
412 |
413 |
414 | node-1 415 |
416 |
417 |
418 |
419 | 420 | node-1 421 | 422 |
423 |
424 | 425 | 426 | 427 | 428 |
429 |
430 |
431 | node-2 432 |
433 |
434 |
435 |
436 | 437 | node-2 438 | 439 |
440 |
441 | 442 | 443 | 444 | 445 |
446 |
447 |
448 | node-3 449 |
450 |
451 |
452 |
453 | 454 | node-3 455 | 456 |
457 |
458 | 459 | 460 | 461 | 462 |
463 |
464 |
465 | kubelet 466 |
467 |
468 |
469 |
470 | 471 | kubelet 472 | 473 |
474 |
475 | 476 | 477 | 478 | 479 |
480 |
481 |
482 | containerd 483 |
484 |
485 |
486 |
487 | 488 | containerd 489 | 490 |
491 |
492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 |
501 |
502 |
503 | systemd 504 |
505 |
506 |
507 |
508 | 509 | systemd 510 | 511 |
512 |
513 | 514 | 515 | 516 | 517 |
518 |
519 |
520 | kubelet 521 |
522 |
523 |
524 |
525 | 526 | kubelet 527 | 528 |
529 |
530 | 531 | 532 | 533 | 534 |
535 |
536 |
537 | containerd 538 |
539 |
540 |
541 |
542 | 543 | containerd 544 | 545 |
546 |
547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 |
556 |
557 |
558 | systemd 559 |
560 |
561 |
562 |
563 | 564 | systemd 565 | 566 |
567 |
568 | 569 | 570 | 571 | 572 |
573 |
574 |
575 | kubelet 576 |
577 |
578 |
579 |
580 | 581 | kubelet 582 | 583 |
584 |
585 | 586 | 587 | 588 | 589 |
590 |
591 |
592 | containerd 593 |
594 |
595 |
596 |
597 | 598 | containerd 599 | 600 |
601 |
602 | 603 | 604 | 605 | 606 | 607 |
608 |
609 |
610 | Seed 611 |
612 |
613 |
614 |
615 | 616 | Seed 617 | 618 |
619 |
620 | 621 | 622 | 623 | 624 | 625 |
626 |
627 |
628 | Shoot Namespace 629 |
630 |
631 |
632 |
633 | 634 | Shoot Namespace 635 | 636 |
637 |
638 | 639 | 640 | 641 | 642 | 643 |
644 |
645 |
646 | deploys 647 |
648 |
649 |
650 |
651 | 652 | deploys 653 | 654 |
655 |
656 | 657 | 658 | 659 | 660 | 661 |
662 |
663 |
664 | deploys 665 |
666 |
667 |
668 |
669 | 670 | deploys 671 | 672 |
673 |
674 | 675 | 676 | 677 | 678 |
679 |
680 |
681 | gardenlet 682 |
683 |
684 |
685 |
686 | 687 | gardenlet 688 | 689 |
690 |
691 | 692 | 693 | 694 | 695 | 696 |
697 |
698 |
699 | deploys 700 |
701 |
702 |
703 |
704 | 705 | deploys 706 | 707 |
708 |
709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 |
719 |
720 |
721 | deploys and 722 |
723 | orcehstrates 724 |
725 |
726 |
727 |
728 | 729 | deploys and... 730 | 731 |
732 |
733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 |
744 |
745 |
746 | gardener-node-agent-controller 747 |
748 |
749 |
750 |
751 | 752 | gardener-node-agent-controller 753 | 754 |
755 |
756 | 757 | 758 | 759 | 760 |
761 |
762 |
763 | Worker 764 |
765 |
766 |
767 |
768 | 769 | Worker 770 | 771 |
772 |
773 | 774 | 775 | 776 | 777 |
778 |
779 |
780 | GNA CRD 781 |
782 |
783 |
784 |
785 | 786 | GNA CRD 787 | 788 |
789 |
790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 |
799 |
800 |
801 | OSC 802 |
803 |
804 |
805 |
806 | 807 | OSC 808 | 809 |
810 |
811 | 812 | 813 | 814 | 815 | 816 |
817 |
818 |
819 | deploys 820 |
821 |
822 |
823 |
824 | 825 | deploys 826 | 827 |
828 |
829 | 830 | 831 | 832 | 833 |
834 |
835 |
836 | machine-controller-manager 837 |
838 |
839 |
840 |
841 | 842 | machine-controller-manager 843 | 844 |
845 |
846 | 847 | 848 | 849 | 850 |
851 |
852 |
853 | gardener-node-agent-set 854 |
855 | (containing GNA template, holds ready and progressing GNAs) 856 |
857 |
858 |
859 |
860 | 861 | gardener-node-agent-set... 862 | 863 |
864 |
865 | 866 | 867 | 868 | 869 |
870 |
871 |
872 |

873 | Rolling Update Algorithm 874 |

875 |

876 | - Evaluate current status (healthy, progressing and unhealthy GNAs) 877 |

878 |

879 | - If unhealthy GNAs --> Don't do anything just update set status 880 |

881 |

882 | - If progressing GNAs --> Don't do anything 883 |

884 |

885 | - If all healthy --> Evaluate through template checksum find out which GNAs have config already applied 886 |

887 |

888 | - If all configs are latest --> Done 889 |

890 |

891 | - Update config of GNA if not latest according to max unavailable and max surge strategy 892 |

893 |
894 |
895 |
896 |
897 | 898 | Rolling Update Algorithm... 899 | 900 |
901 |
902 | 903 |
904 | 905 | 906 | 907 | 908 | Text is not SVG - cannot display 909 | 910 | 911 | 912 |
913 | -------------------------------------------------------------------------------- /2023-11_Schelklingen/gardener-operator-deploys-gardenlet/gardenlet-via-gardener-operator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gardener-community/hackathon/385653c42ca25a25256678418cda2b66f10561be/2023-11_Schelklingen/gardener-operator-deploys-gardenlet/gardenlet-via-gardener-operator.png -------------------------------------------------------------------------------- /2024-05_Schelklingen/README.md: -------------------------------------------------------------------------------- 1 | # Hack The Garden 05/2024 Wrap Up 2 | 3 | ## 🗃️ OCI Helm Release Reference For `ControllerDeployment`s 4 | 5 | **Problem Statement:** Today, `ControllerDeployment`s contain the base64-encoded, gzip'ed, tar'ed raw Helm chart inside [their specification](https://github.com/gardener/gardener/blob/master/example/25-controllerdeployment.yaml#L11). Such an API blows up the backing ETCD unnecessarily, and it is error-prone/cumbersome to maintain. 6 | 7 | **Motivation/Benefits**: 🔧 Reduced operational complexity, 🔄 enabled reusability for other scenarios. 8 | 9 | **Achievements:** The `core.gardener.cloud/v1` API has been introduced for the `ControllerDeployment` which provides a more mature API and also supports OCI repository-based Helm chart references. It is also more extensible (hence, it could also support other deployment options in the future, e.g., `kustomize`). Based on this foundation, it is now possible to specify the URL of an OCI repository containing the Helm chart. `gardenlet`'s `ControllerInstallation` controller fetches the Helm chart from there and installs it as usual. 10 | 11 | **Next Steps:** Currently, we don't cache the downloaded OCI Helm charts (i.e., in every reconciliation, we pull them again). This might need to get optimized to keep network traffic under control. Unit tests have to be written for the OCI registry puller, and the PR has to be opened. 12 | 13 | **Issue:** https://github.com/gardener/gardener/issues/9773 14 | 15 | **Code/Pull Requests**: https://github.com/stackitcloud/gardener/tree/controllerdeployment, https://github.com/gardener/gardener/pull/9771 16 | 17 |
18 | 19 | ## 👨🏼‍💻 `gardener-operator` Local Development Setup With `gardenlet`s 20 | 21 | **Problem Statement:** Today, there are two development setups for Gardener: The first, most common one is based on Gardener's [control plane Helm chart](https://github.com/gardener/gardener/tree/master/charts/gardener/controlplane) and other [custom manifests](https://github.com/gardener/gardener/tree/master/example/gardener-local/etcd) to bring up Gardener and a seed cluster. The second one is using [`gardener-operator`](https://github.com/gardener/gardener/blob/master/docs/concepts/operator.md) and the [`operator.gardener.cloud/v1alpha1.Garden`](https://github.com/gardener/gardener/blob/master/example/operator/20-garden.yaml) resource to bring up a full-fledged garden cluster. However, this setup does not bring up a seed cluster, hence, creating a `Shoot` is not possible. Generally, it would be better if we could harmonize the scenarios such that we only have to maintain one solution. 22 | 23 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity, 🧪 increased output qualification. 24 | 25 | **Achievements:** A new Skaffold configuration has been created which deploys `gardener-operator` and its `Garden` CRD, and later the `gardenlet` which register the garden cluster as a seed cluster. This also includes the basic Gardener configuration (`CloudProfile`s, `Project`s, etc.) and the registration of the `provider-local` extension. With this setup, it is now possible to create `Shoot`s as well. 26 | 27 | **Next Steps:** Optimizing the readiness probes to speed up the reconciliation times should be considered. 28 | 29 | **Code/Pull Requests**: https://github.com/gardener/gardener/pull/9763 30 | 31 |
32 | 33 | ## 👨🏻‍🌾 Extensions For Garden Cluster Via `gardener-operator` 34 | 35 | **Problem Statement:** A Gardener installation usually needs additional and tedious preparation tasks to be done outside "the Kubernetes world", e.g. creating storage buckets for ETCD backups or managing DNS records for the API server and the ingress controller. All of those could actually be automated via `gardener-operator` to reduce operational complexity. These tasks even overlap with requirements that have already been implemented for shoot clusters by Gardener extensions, i.e., the code is already available and could be reused. 36 | 37 | **Motivation/Benefits**: 🔧 Reduced operational complexity, ✨ support more use cases/scenarios, 📦 provide out-of-the-box solution for Gardener community. 38 | 39 | **Achievements:** The `Garden` controller has been augumented to deploy `extensions.gardener.cloud/v1alpha1.{BackupBucket,DNSRecord}` resources as part of its reconciliation flow. A new `operator.gardener.cloud/v1alpha1.Extension` CRD has been introduced to register extensions on `gardener-operator` level (specification is similar to `Controller{Registration,Deployment}`s. Several new controllers have been added to reconcile the new CRDs - the concepts are very similar to what already happens for extensions in the seed cluster and the related existing code in `gardenlet`. In case the garden cluster is a seed cluster at the same time, multiple instances of the same extension are needed. This requires that we prevent simultaneous reconciliations of the same extension object by different extension controllers. For this purpose, a `class` field has been added to the extension APIs, and extensions can be configured accordingly to restrict their watches for only objects of a specific `class`. 40 | 41 | **Next Steps:** Deployment of extension admission components is still missing. Also, validation of the `operator.gardener.cloud/v1alpha1.Extension` as well as tests and documentation is missing. In the future, all `BackupBucket` and `DNSRecord` extension controllers must be adapted such that they support the scenario of running in the garden cluster. 42 | 43 | **Issue:** https://github.com/gardener/gardener/issues/9635 44 | 45 | **Code/Pull Requests**: https://github.com/metal-stack/gardener/commits/hackathon-gardener-operator-extensions/ 46 | 47 |
48 | 49 | ## 🪄 Gardenlet Self-Upgrades For Unmanaged `Seed`s 50 | 51 | **Problem Statement:** In order to keep `gardenlet`s in unmanaged seeds up-to-date (i.e., in seeds which are no shoot clusters, like the "root cluster"), its Helm chart must be regularly deployed to it. This requires network connectivity to such clusters which can be challenging if they reside behind a firewall or in restricted environments. Similar challenges might arise for the to-be-developed [autonomous shoot clusters](https://github.com/gardener/gardener/issues/2906) (see also the [topic summary](https://github.com/gardener-community/hackathon/blob/main/2023-05_Leverkusen/README.md#-bootstrapping-masterful-clusters-aka-autonomous-shoots) from one of the previous Hackathons). It would be much simpler if `gardenlet` could keep itself up-to-date, based on configuration read from the garden cluster. 52 | 53 | **Motivation/Benefits**: 🔧 Reduced operational complexity, ✨ support more use cases/scenarios. 54 | 55 | **Achievements:** A new `seedmanagement.gardener.cloud/v1alpha1.Gardenlet` resource has been introduced whose specification is similar to the `seedmanagement.gardener.cloud/v1alpha1.ManagedSeed` resource. It allows specifying deployment values (replica count, resource requests, etc.) as well as the `gardenlet`'s component configuration (feature gates, seed spec, etc.). In addition, the `Gardenlet` object must contain a URL to an OCI registry storing `gardenlet`'s Helm chart. A new controller within `gardenlet` watches such resources, and if needed, downloads the Helm chart and applies it with the provided configuration to its own cluster. 56 | 57 | **Next Steps:** Write unit tests and documentation (including a guide that can be followed when a `gardenlet` needs to be deployed to a new unmanaged soil/seed cluster). Open pull request. 58 | 59 | **Code/Pull Requests**: https://github.com/metal-stack/gardener/commits/hackathon-gardenlet-self-upgrade/ 60 | 61 |
62 | 63 | ## 🦺 Type-Safe Configurability In `OperatingSystemConfig` For `containerd`, DNS, NTP, etc. 64 | 65 | **Problem Statement:** Some Gardener extensions have to manipulate the `OperatingSystemConfig` for shoot worker nodes by changing `containerd` configuration, DNS or NTP servers, or similar. Currently, the API does not support first-class fields for such operations. Providing Bash scripts that apply the respective configs as part of `systemd` units is the only option. This makes the development process rather tedious and error-prone. 66 | 67 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 68 | 69 | **Achievements:** The `OperatingSystemConfig` API has been augmented to support the `containerd`-config related use-cases existing today. `gardener-node-agent` has been adapted to evaluate the new fields and apply the respective configs. Managing DNS and/or NTP servers probably needs to be handled by the OS extensions directly so that this information is already available during machine bootstrapping (i.e., before `gardener-node-agent` starts). The `os-gardenlinux`, `runtime-gvisor`, and `registry-cache` extensions have been adapted to the newly introduced `containerd`-config API. This allowed deleting a large portion of Bash scripts and related `systemd` units or `DaemonSet`s manipulating the host file system. Note that tackling this entire topic only became possible because we have developed [`gardener-node-agent`](https://github.com/gardener/gardener/blob/master/docs/concepts/node-agent.md), [an achievement from one of the previous Hackathons](https://github.com/gardener-community/hackathon/blob/main/2023-05_Leverkusen/README.md#%EF%B8%8F-introduction-of-gardener-node-agent). 70 | 71 | **Next Steps:** The API adaptations in `gardener/gardener` have to be finalized first. This includes adding unit tests, augmenting the extension developer documentation, and polishing the code. Once merged and released, the work in the extensions can continue. The DNS/NTP server configuration requirements need to be discussed separately since the concept above does not fit well. 72 | 73 | **Issue:** https://github.com/gardener/gardener/issues/8929 74 | 75 | **Code/Pull Requests**: https://github.com/metal-stack/gardener/tree/enh.osc-api, https://github.com/gardener/gardener-extension-os-gardenlinux/pull/169, https://github.com/Gerrit91/gardener-extension-runtime-gvisor/tree/hackathon-improved-osc-api, https://github.com/timuthy/gardener-extension-registry-cache/tree/enh.osc-registry-poc 76 | 77 |
78 | 79 | ## 👮 Expose Shoot API Server In [Tailscale](https://tailscale.com/) VPN 80 | 81 | **Problem Statement:** The most common ways to secure a shoot cluster is to [apply ACLs](https://github.com/stackitcloud/gardener-extension-acl), or to use an `ExposureClass` which exposes the API server only within a corporate network. However, managing the ACL configuration can become difficult with a growing number of participants (needed IP addresses), especially in a dynamic environment and work-from-home scenarios. `ExposureClass`es might be not possible because no corporate network might be available. A Tailscale-based VPN, however, is a scalable and manageable alternative. 82 | 83 | **Motivation/Benefits**: 🛡️ Increased cluster security, ✨ support more use cases/scenarios. 84 | 85 | **Achievements:** A document has been compiled which explains what a shoot owner can do to achieve exposing their API server within a Tailscale VPN. Writing an extension or any code does not make sense for this topic. For each Tailnet, only one API server can be exposed. 86 | 87 | **Next Steps:** The documentation shall be published on https://gardener.cloud and submitted to the Tailscale newsletter (they are calling for content/success stories). 88 | 89 | **Code/Pull Requests**: https://gardener.cloud/docs/guides/administer-shoots/tailscale/ 90 | 91 |
92 | 93 | ## ⌨️ Rewrite [gardener/vpn2](https://github.com/gardener/vpn2) From Bash To Golang 94 | 95 | **Problem Statement:** Currently, the VPN components mostly consist out of Bash scripts which are hard to maintain and easy to break (since they are untested). [Similar to how we have refactored other Bash scripts in Gardener to Golang](https://github.com/gardener-community/hackathon/blob/main/2023-05_Leverkusen/README.md#%EF%B8%8F-introduction-of-gardener-node-agent), we could increase developer productivity by eliminating the scripts in favor of testable Golang code. 96 | 97 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity, 🧪 increased output qualification. 98 | 99 | **Achievements:** All functionality has been rewritten to Golang, whereby all integration scenarios with Gardener have been considered. The validation and tests with both the local and an OpenStack-based environments were successful. The pull requests (already including unit tests) have been opened for `gardener/vpn2` and the integration in `gardener/gardener`. 100 | 101 | **Next Steps:** The documentation needs to be adapted. Minor cosmetics and cleanups have to be performed. For the future, it should be considered to move the new Golang code to `gardener/gardener`. This could enable Gardener e2e tests of the VPN-related code (currently, a new image/release of `gardener/vpn2` is a prerequisite). 102 | 103 | **Code/Pull Requests**: https://github.com/gardener/vpn2/pull/84, https://github.com/gardener/gardener/pull/9774 104 | 105 |
106 | 107 | ## 🕳️ Pure IPv6-Based VPN Tunnel 108 | 109 | **Problem Statement:** Today, the shoot pod, service and node network CIDRs may not overlap with the hard-coded VPN network CIDR. Hence, users are limited to use certain ranges (even if they would like to have another network design). Instead of making the VPN network CIDR configurable on seed level, a proper solution is to switch the VPN tunnel to a pure IPv6-based network which can transport both IPv4 and IPv6 traffic. This is a more mature solution because this lifts the mention CIDR restriction for IPv4 scenarios. For IPv6, the restriction still exists, however, this is not a real concern since the address space is not so scarce compared to the IPv4 world. 110 | 111 | **Motivation/Benefits**: 🏗️ Lift restrictions, ✨ support more use cases/scenarios. 112 | 113 | **Achievements:** VPN is configured to use IPv6 addresses only (even if the seed and shoot is IPv4-based). This lifts above mentioned restriction. 114 | 115 | **Next Steps:** After `vpn2` has been [refactored to Golang](#⌨%EF%B8%8F-Rewrite-gardenervpn2-From-Bash-To-Golang), adapt the changes in the below linked draft PR, and make adaptations to support the high-availability case. 116 | 117 | **Issue:** https://github.com/gardener/gardener/pull/9597#discussion_r1572358733 118 | 119 | **Code/Pull Requests**: https://github.com/gardener/vpn2/pull/83 120 | 121 |
122 | 123 | ## 👐 Harmonize Local VPN Setup With Real-World Scenario 124 | 125 | **Problem Statement:** In the local development setup, the VPN tunnel check performed by `gardenlet` (port-forward check) does not detect a broken VPN tunnel, because either `kube-apiserver` (HA clusters) or `vpn-seed-server` (non-HA clusters) route requests to the `kubelet` API directly via the seed's pod network. When the VPN connection is broken, `kubectl port-forward` and `kubectl logs` continue to work, while `kubectl top no` (`APIServices`, `Webhooks`, etc.) is broken. We should strive towards resolving this discrepancy between the local setup and real-world scenarios regarding the VPN connection to prevent bugs by validating the real setup in e2e tests. 126 | 127 | **Motivation/Benefits**: 🧪 Increased output qualification. 128 | 129 | **Achievements:** `provider-local` has been augmented to dynamically create Calico's `IPPool` resources. These are used for allocating IP addresses for the shoot worker pods according to the specified node CIDR in `.spec.networking.nodes`. This way, the VPN components are configured correctly to route traffic from control plane components to shoot kubelets via the tunnel. This aligns the local scenario with the real-world situation. 130 | 131 | **Next Steps:** Review and merge the opened pull request. 132 | 133 | **Issue:** https://github.com/gardener/gardener/issues/9604 134 | 135 | **Code/Pull Requests**: https://github.com/gardener/machine-controller-manager-provider-local/pull/42, https://github.com/gardener/gardener/pull/9752 136 | 137 |
138 | 139 | ## 🐝 Support Cilium `v1.15+` For HA `Shoot`s 140 | 141 | **Problem Statement:** Cilium `v1.15` does not consider [`StatefulSet` labels in `NetworkPolicy`s](https://docs.cilium.io/en/v1.15/operations/performance/scalability/identity-relevant-labels/). Unfortunately, Gardener uses `statefulset.kubernetes.io/pod-name` in `Service`s/`NetworkPolicy`s to address individual `vpn-seed-server` pods for highly-available `Shoot`s. Therefore, the VPN tunnel does not work for such `Shoot`s in case the seed cluster runs Cilium `v1.15` or higher. 142 | 143 | **Motivation/Benefits**: 🏗️ Lift restrictions, ✨ support more use cases/scenarios. 144 | 145 | **Achievements:** A prototype has been developed making the `Service`s for the `vpn-seed-server` [headless](https://kubernetes.io/docs/concepts/services-networking/service/#headless-services). Thereby, it is no longer required that the `StatefulSet` labels address individual `Pod`s. This simplifies the `NetworkPolicy`s and makes them work again with the mentioned Cilium release. 146 | 147 | **Next Steps:** Decide on the final implementation of the solution within the Gardener networking team (there are alternate possible solutions/ideas). 148 | 149 | **Code/Pull Requests**: https://github.com/ScheererJ/gardener/tree/enhancement/ha-vpn-headless 150 | 151 |
152 | 153 | ## 🍞 Compression For `ManagedResource` `Secret`s 154 | 155 | **Problem Statement:** The Kubernetes resources applied to garden, seed, or shoot clusters are stored in raw format (YAML) in `Secret`s within the garden runtime or seed cluster. These `Secret`s quickly and easily grow up in size, leading to considerable load for the ETCD cluster as well as network I/O (= costs) for basically all controllers watching `Secret`s. 156 | 157 | **Motivation/Benefits**: 💰 Reduction of costs due to less traffic, 📈 better scalability due to less data volume. 158 | 159 | **Achievements:** By leveraging the [Brotli compression algorithm](https://de.wikipedia.org/wiki/Brotli), we have been able to reduce the size of all `Secret`s by roughly a magnitude. This should have a great impact on network I/O and related costs. 160 | 161 | **Next Steps:** Unit tests have to be written, and most unit tests for components deployed by `gardener/gardener` and extensions have to be adapted. The PR has to be opened. 162 | 163 | **Code/Pull Requests**: https://github.com/metal-stack/gardener/tree/hackathon-mr-compression 164 | 165 |
166 | 167 | ## 🚛 Making [Shoot Flux Extension](https://github.com/stackitcloud/gardener-extension-shoot-flux) Production-Ready 168 | 169 | **Problem Statement:** [Continuing the track](https://github.com/gardener-community/hackathon/blob/main/2023-11_Schelklingen/README.md#-rework-shoot-flux-extension) of promoting the Flux extension to "production-ready" status, two main features are currently missing: Firstly, the Flux components were only installed once but never reconciled/updated anymore. Secondly, for some scenarios it might be necessary to provide additional `Secret`s (e.g., SOPS or encryption keys). 170 | 171 | **Motivation/Benefits**: 🏗️ Lift restrictions, ✨ support more use cases/scenarios. 172 | 173 | **Achievements:** A new `syncMode` field has been added to the extension's provider config API which can be used to control the reconciliation behaviour. In addition, the additional `Secret` resources are now synced into the cluster such that Flux can use them to decrypt/access resources. 174 | 175 | **Next Steps:** Unit tests have to be implemented and the PR has to be opened. Generally, in order to release `v1.0.0`, the documentation and the `README.md` should be reworked. 176 | 177 | **Code/Pull Requests**: https://github.com/stackitcloud/gardener-extension-shoot-flux/tree/sync-mode 178 | 179 |
180 | 181 | ## 🧹 Move `machine-controller-manager-provider-local` Repository Into `gardener/gardener` 182 | 183 | **Problem Statement:** The [`machine-controller-manager-provider-local`](https://github.com/gardener/machine-controller-manager-provider-local) implementation used by [`gardener-extension-provider-local`](https://github.com/gardener/gardener/blob/master/docs/extensions/provider-local.md) is currently maintained in a different GitHub repository. This makes related maintenance and development tasks more complicated and tedious. 184 | 185 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 186 | 187 | **Achievements:** The contents of the repository have been moved to `gardener/gardener`, and Skaffold has been augmented to dynamically build both the `node` image (used for local shoot worker node pods) and the `machine-controller-manager-provider-local` image (used for manging these worker node pods). Unfortunately, Skaffold does not support all features needed, so we had to introduce a few workarounds to make the e2e flow work. Still, the move will alleviate certain detours needed during development. 188 | 189 | **Next Steps:** Once the opened PR got merged, the [`machine-controller-manager-provider-local`](https://github.com/gardener/machine-controller-manager-provider-local) repository should be archived/deleted. 190 | 191 | **Code/Pull Requests**: https://github.com/gardener/gardener/pull/9782 192 | 193 |
194 | 195 | ## 🗄️ Stop Vendoring Third-Party Code In OS Extensions 196 | 197 | **Problem Statement:** Similar to [the last Hackathon's achievement](https://github.com/gardener-community/hackathon/blob/main/2023-11_Schelklingen/README.md#%EF%B8%8F-stop-vendoring-third-party-code-in-vendor-folder) regarding dropping the `vendor` folder in `gardener/gardener`, vendoring third-party code should also be avoided in the [OS extensions](https://github.com/gardener/gardener/blob/master/extensions/README.md#operating-system). 198 | 199 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity by removing clutter from PRs and speeding up `git` operations. 200 | 201 | **Achievements:** Two out of the four OS extensions maintained in the `gardener` GitHub organization have been adapted. 202 | 203 | **Next Steps:** Replicate the work for [`os-ubuntu`](https://github.com/gardener/gardener-extension-os-ubuntu) and [`os-coreos`](https://github.com/gardener/gardener-extension-os-coreos). 204 | 205 | **Code/Pull Requests**: https://github.com/gardener/gardener-extension-os-gardenlinux/pull/170, https://github.com/gardener/gardener-extension-os-suse-chost/pull/145 206 | 207 |
208 | 209 | ## 📦 Consider Embedded Files For Local Image Builds 210 | 211 | **Problem Statement:** Currently, Gardener images are not automatically rebuilt by Skaffold for local development in case a file embedded into the code using [Golang's `embed` feature](https://pkg.go.dev/embed) changes. The reason is that we use `go list` to compute all package dependencies of the Gardener binaries, but `go list` cannot detect embedded files. Hence, they are not part of the dependency lists in the `skaffold.yaml` for the Gardener binaries. This makes the development process of these files tedious. 212 | 213 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 214 | 215 | **Achievements:** The [`hack` script](https://github.com/gardener/gardener/blob/master/hack/check-skaffold-deps-for-binary.sh) ([an achievement of a previous Hackathon](https://github.com/gardener-community/hackathon/blob/main/2023-11_Schelklingen/README.md#-auto-update-skaffold-dependencies)) computing the binary dependencies has been augmented to detect the embedded files and make them part of the list of dependencies. 216 | 217 | **Next Steps:** Review and merge the pull request. 218 | 219 | **Code/Pull Requests**: https://github.com/gardener/gardener/pull/9778 220 | 221 |
222 | 223 | ![ApeiroRA](https://apeirora.eu/assets/img/BMWK-EU.png) 224 | -------------------------------------------------------------------------------- /2024-12_Schelklingen/README.md: -------------------------------------------------------------------------------- 1 | # Hack The Garden 12/2024 Wrap Up 2 | 3 | ## 🌐 IPv6 Support On [IronCore](https://github.com/ironcore-dev) 4 | 5 | **Problem Statement:** IPv6-only shoot clusters are not yet supported on IronCore, yet it is clear that we eventually have to support it, given the limited scalability and phase-out of IPv4. 6 | 7 | **Motivation/Benefits**: ✨ Support more use cases/scenarios. 8 | 9 | **Achievements:** Due to incompatibilities of the recent Gardener version with the extension for IronCore, which had to be fixed first, the progress was delayed a bit. Eventually, we were able to create dual-stack shoot clusters. However, there are still quite some open issues, e.g., `Service` of type `LoadBalancer` don't not work for them, node-to-node communication does not work for IPv6 traffic, and a few more. 10 | 11 | **Next Steps:** Now that a shoot cluster can at least be created (reconciles to 100%), the issues can be investigated and hopefully fixed one-by-one. 12 | 13 | **Code/Pull Requests:** https://github.com/ironcore-dev/machine-controller-manager-provider-ironcore/pull/436, https://github.com/ironcore-dev/gardener-extension-provider-ironcore/pull/669, https://github.com/ironcore-dev/cloud-provider-ironcore/pull/473 14 | 15 |
16 | 17 | ## 🔁 Version Classification Lifecycle In `CloudProfile`s 18 | 19 | **Problem Statement:** Kubernetes or machine image versions typically go through different classifications whilst being offered to users via `CloudProfile`s. Usually, they start with `preview`, transition to `supported`, and eventually to `deprecated` before they expire. Today, human operators have to manually facilitate these transitions and change the respective `CloudProfile`s at the times when they want to promote or demote a version. The goal of this task is to allow predefining the timestamps when a certain classification is reached. This does not only lift manual effort of the operators, but also makes the transitions more predictable/plannable for end-users. 20 | 21 | **Motivation/Benefits**: 🔧 Reduced operational complexity, 🤩 improved user experience. 22 | 23 | **Achievements:** After a lot of brainstorming, we decided that it makes sense to write a GEP to get more opinions on how the API should look like. The current tendency is to maintain the lifecycle times for the classifications in the `spec`, while a controller would then maintain the currently active classification in the `status` for clients to read. 24 | 25 | **Next Steps:** The GEP has been filed and is currently in review. There is a PoC implementation according to the description, so once the design has been approved, the work can be finalized and the pull requests can be prepared. 26 | 27 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10982, https://github.com/metal-stack/gardener/pull/9 28 | 29 |
30 | 31 | ## 💡 Gardener SLIs: Shoot Cluster Creation/Deletion Times 32 | 33 | **Problem Statement:** We lack observability in our end-to-end testing whether a change introduces a regression in terms of prolonged shoot cluster creation or deletion times. The goal is to (a) expose these metrics via `gardenlet`, and (b) collect and push them to a [Prometheus](https://prometheus.io/) instance running in our [Prow cluster](https://prow.gardener.cloud/). From there, we can centrally observe the SLIs over time and potentially even define alerting rules. 34 | 35 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity, 🧪 increased output qualification. 36 | 37 | **Achievements:** A new Prometheus instance in the Prow cluster now collects the metrics for cluster/creation time as well as times for individual tasks in the large `Shoot` reconciliation flow. Dashboards nicely show the data for service providers to evaluate and reason about. While the metrics are mainly collected in the CI system for e2e test executions, productive deployments of `gardenlet`s also expose them. Hence, we now also get some data for real systems. 38 | 39 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10964, https://github.com/gardener/gardener/pull/10967, https://github.com/gardener/ci-infra/pull/2807 40 | 41 |
42 | 43 | ## 🛡️ Enhanced `Seed` Authorizer With Label/Field Selectors 44 | 45 | **Problem Statement:** The `Seed` Authorizer is used to restrict `gardenlet` to only access resources which belong to the `Seed` it is responsible for. However, until Kubernetes `v1.31`, there was no technical possibility to limit its visibility for resources it watches (e.g., `Shoot`s). This means that all `gardenlet`s are able to see all `Shoot`s even if they don't belong to their `Seed`. With Kubernetes `v1.31`, a new feature allows to pass label/field selectors to authorizers, which finally allows us to enforce that the `gardenlet` uses respective selectors. 46 | 47 | **Motivation/Benefits**: 🛡️ Increased cluster security. 48 | 49 | **Achievements:** We enabled the selector enforcement for the `Shoot`, `ControllerInstallation`, `Bastion`, and `Gardenlet` APIs. Unfortunately, it requires to enable the new `AuthorizeWithSelectors` feature gate on both `{kube,gardener}-apiserver`. We have also introduced a new feature gate in `gardener-admission-controller` (serving the `Seed` authorizer webhook) in order to be able to enforce the selector (this can only work with Kubernetes `v1.31` when the feature gates are enabled, or with `v1.32`+ where the feature gates got promoted to beta). On our way, we have discovered more opportunity to tune the authorizer, for example shrinking `gardenlet`'s permissions for other `Seed` resources. 50 | 51 | **Next Steps:** A PoC implementation has been prepared, yet it has to be finalized and the mentioned feature gate has to be introduced. There are separate/independent work branches for the other improvements we discovered, and also they have to be finalized. 52 | 53 | **Code/Pull Requests:** https://github.com/rfranzke/gardener/tree/improve-seed-authz 54 | 55 |
56 | 57 | ## 🔑 Bring Your Own ETCD Encryption Key Via Key Management Systems 58 | 59 | **Problem Statement:** Gardener uses the AESCBC algorithm to compute the encryption key for ETCD of shoot clusters. This key is stored as `Secret` in the seed cluster. Yet, some users prefer to manage the encryption key on their own via a key management system (e.g., Vault or AWS KMS). Kubernetes already supports [various KMS providers](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/). The goal is to allow configuring an external KMS and key in the `Shoot` specification to support this feature. 60 | 61 | **Motivation/Benefits**: 🛡️ Increased cluster security. 62 | 63 | **Achievements:** We came up with a new API called `ControlPlaneEncryptionConfig` in the `extensions.gardener.cloud` API group. This way, we can write new extensions for arbitrary external KMS plugins. In the `Shoot` API, the mentioned encryption config can be passed via `.spec.kubernetes.kubeAPIServer.encryptionConfig`. The respective extension is then responsible to create the [`EncryptionConfiguration` for `kube-apiserver`](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#encrypting-your-data-with-the-kms-provider-kms-v1) and inject the KMS plugin sidecar container into the `kube-apiserver` `Deployment`. We have also develop concepts for migration to and from the current AESCBC provider, and for KMS key rotation. 64 | 65 | **Next Steps:** Due to the size and criticality of the feature, we decided to write a GEP to align on the overall approach. The general concept has been validated in a PoC and looks promising. 66 | 67 | **Code/Pull Requests:** https://github.com/plkokanov/gardener/tree/hackathon-kms 68 | 69 |
70 | 71 | ## ⚖️ Load Balancing For Calls To `kube-apiserver`s 72 | 73 | **Problem Statement:** TLS connections are terminated by the individual shoot API server instances. After establishing a TLS connection, all requests sent over this connection end up on the same API server instance. Hence, the Istio ingress gateway performs L4 load balancing only, but it doesn't distribute individual requests across API server instances. This often overloads a single API server pod while others are much under-utilized. 74 | 75 | **Motivation/Benefits**: 📈 Better scalability, 🤩 improved user experience. 76 | 77 | **Achievements:** We have switched to Istio terminating the TLS connections which allows to properly load balance the request to the `kube-apiserver` pods. A load generator nicely demonstrates that this indeed improves the situation compared to current implementation. 78 | 79 | **Next Steps:** The work has to finalized and the pull request has to be opened. 80 | 81 | **Issue:** https://github.com/gardener/gardener/issues/8810 82 | 83 | **Code/Pull Requests:** https://github.com/oliver-goetz/gardener/tree/hack/distribute-apiserver-requests 84 | 85 |
86 | 87 | ## 🪴 Validate PoC For In-Place Node Updates Of Shoot Clusters 88 | 89 | **Problem Statement:** GEP-31 focuses on allowing updating Kubernetes minor versions and/or machine image versions without requiring the deletion and recreation of the nodes. This aims to minimize the overhead traditionally associated with node replacement, and it offers an alternative approach to updates, particularly important for physical machines or bare-metal nodes. There is already a proof-of-concept implementation, and the goal of this task was to validate it on a real infrastructure/system. 90 | 91 | **Motivation/Benefits**: ✨ Support more use cases/scenarios. 92 | 93 | **Achievements:** After fixing a few missing pieces in the PoC implementation that have to be incorporated into the work branch, we achieved performing an in-place update for a Kubernetes minor version change on a bare-metal infrastructure. It can be configured per worker pool with a new `updateStrategy` field. 94 | 95 | **Next Steps:** Unfortunately, we were lacking a functional version of Gardenlinux in order to test the in-place updates for OS versions. Hence, this part is still open and should be validated once an image is available. 96 | 97 | **Issue:** https://github.com/gardener/gardener/issues/10219 98 | 99 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10828 100 | 101 |
102 | 103 | ## 🚀 Prevent `Pod` Scheduling Issues Due To Overscaling 104 | 105 | **Problem Statement:** The Vertical Pod Autoscaler sometimes recommends resource requirements exceeding the allocatable resources of the largest nodes. This results in some pods becoming unschedulable (which eventually can result in downtimes). The goal is to prevent this from happening, thus ensuring properly running pods. 106 | 107 | **Motivation/Benefits**: 🔧 Reduced operational complexity, ✨ support more use cases/scenarios. 108 | 109 | **Achievements:** We came up with three different approaches, but all of them have their drawbacks. Details about them can be read in the linked `hackmd.io` document. We decided to try augmenting the upstream VPA implementation with global "max allowed" flags for `vpa-recommender`. In addition, we prepared the needed changes in Gardener to consume it. 110 | 111 | **Next Steps:** Work with the autoscaling community to make sure the proposed pull requests gets accepted and merged. 112 | 113 | **Issue:** https://hackmd.io/GwTNubtZTg-D1mNhzV5VOw 114 | 115 | **Code/Pull Requests:** https://github.com/kubernetes/autoscaler/pull/7560, https://github.com/ialidzhikov/gardener/commits/enh/seed-and-shoot-vpa-max-allowed, https://github.com/gardener/gardener/pull/10413 116 | 117 |
118 | 119 | ## 💪🏻 Prevent Multiple `systemd` Unit Restarts On Reconciliation Errors 120 | 121 | **Problem Statement:** Restarting `systemd` units, especially the `kubelet.service` unit, can be quite costly/counter-productive. When `gardener-node-agent` fails during its reconciliation of files and units on the worker nodes, it starts again from scratch and potentially restarts units multiple times until the entire reconciliation loop finally succeeds. 122 | 123 | **Motivation/Benefits**: 🤩 Improved user experience. 124 | 125 | **Achievements:** Instead of only persisting its last applied `OperatingSystemConfig` specification at the end of a successful reconciliation, `gardener-node-agent` now updates its state immediately after the reconciliation of a file or unit. This way, it doesn't perform such changes again if the reconciliation fails in the meantime. 126 | 127 | **Issue:** https://github.com/gardener/gardener/issues/10972 128 | 129 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10969 130 | 131 |
132 | 133 | ## 🤹‍♂️ Trigger Nodes Rollout Individually Per Worker Pool During Credentials Rotation 134 | 135 | **Problem Statement:** Today, triggering a shoot cluster credentials rotation results in an immediate rollout of all worker nodes in the cluster. Some scenarios require to have more control over this rollout, particularly the option to trigger it individually per worker pool at different times. 136 | 137 | **Motivation/Benefits**: ✨ Support more use cases/scenarios, 🏗️ lift restrictions. 138 | 139 | **Achievements:** A concept has been developed, proposed, and implemented in a prototype. The idea is to introduce new, special operation annotation for starting a credentials rotation without triggering worker poll rollouts. The user can then roll the nodes (via another operation annotation) at their convenience. Once all pools have been rolled out, the rotation can then be completed as usual. 140 | 141 | **Next Steps:** The PoC looks promising, yet the work has to be finished (tests, feature gate, documentation still missing) before a pull request has been opened. 142 | 143 | **Issue:** [https://github.com/gardener/gardener/issues/10121](https://github.com/gardener/gardener/issues/10121#issuecomment-2515479242) 144 | 145 | **Code/Pull Requests:** https://github.com/rfranzke/gardener/tree/individual-node-rollout-rotation 146 | 147 |
148 | 149 | ## ⛓️‍💥 E2E Test Skeleton For Autonomous Shoot Clusters 150 | 151 | **Problem Statement:** GEP-28 proposes to augment Gardener functionality with support for managing autonomous shoot clusters. The GEP has been merged a while ago, and we also crafted a skeleton for the to-be-developed `gardenadm` binary already. However, the needed e2e test infrastructure is missing and should be established before the implementation of the business logic is started. 152 | 153 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 154 | 155 | **Achievements:** We introduced new `make` targets to create a local development setup based on KinD and machine pods (like for regular shoot clusters). The `gardenadm` binary is build and transported to these machine pods. New e2e tests in `gardener/gardener` execute the various commands of `gardenadm` and assert that it behaves as expected. 156 | 157 | **Issue:** https://github.com/gardener/gardener/issues/2906 158 | 159 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10977, https://github.com/gardener/ci-infra/pull/2827 160 | 161 |
162 | 163 | ## ⬆️ Deploy Prow Via Flux 164 | 165 | **Problem Statement:** Prow is a system for Gardener's CI and automation. So far, it was deployed using manually crafted deployment scripts. Switching to Flux, which is a cloud-native solution for continuous delivery based on GitOps, allows to eliminate these scripts and implement industry best-practises. 166 | 167 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 168 | 169 | **Achievements:** The custom deployment script has been almost eliminated and replaced with Kustomizations and Helm releases that are managed by Flux. 170 | 171 | **Next Steps:** The monitoring configurations for the Prow cluster still need to be migrated. However, we prefer to create a new deployment from scratch (since we are not too happy with the current solution, i.e., we don't want to just "migrate" it over to Flux but use the opportunity to renovate it). 172 | 173 | **Code/Pull Requests:** https://github.com/gardener/ci-infra/pull/2812, https://github.com/gardener/ci-infra/pull/2813, https://github.com/gardener/ci-infra/pull/2814, https://github.com/gardener/ci-infra/pull/2816, https://github.com/gardener/ci-infra/pull/2817, https://github.com/gardener/ci-infra/pull/2818, and many more... 😉 174 | 175 |
176 | 177 | ## 🚏 Replace `TopologyAwareHints` With `ServiceTrafficDistribution` 178 | 179 | **Problem Statement:** The `TopologyAwareHints` uses allocatable CPUs for distributing the traffic according to the configured hints. This forced us to introduce custom code (webhook on `EndpointSlice`s) in Gardener to mitigate this limitation. The new `ServiceTrafficDistribution` is just using the availability zone without taking into account the CPUs, hence it would eventually allow us to get rid of our custom code. 180 | 181 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 182 | 183 | **Achievements:** Kubernetes clusters of version `v1.31` and higher now use the new `ServiceTrafficDistribution` feature. Yet, until this is our lowest supported Kubernetes version, we have to keep our custom code (to also support older versions). 184 | 185 | **Issue:** https://github.com/gardener/gardener/issues/10421 186 | 187 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10973 188 | 189 |
190 | 191 | ## 🪪 Support More Use-Cases For `TokenRequestor` 192 | 193 | **Problem Statement:** Currently, `gardener-resource-manager`'s `TokenRequestor` controller is not capable of injecting the current CA bundle into access secrets. This requires operators to use the generic token kubeconfig, however, this is not possible in all use cases. For example, Flux needs to read a `Secret` with a self-contained kubeconfig and cannot mount it. Hence, operators had to work around this and manually fetch the CA bundle first before crafting the kubeconfig for the controller. Another use-case is to make it watch additional namespaces. 194 | 195 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity, ✨ support more use cases/scenarios. 196 | 197 | **Achievements:** Access secrets annotated with `serviceaccount.resources.gardener.cloud/inject-ca-bundle` now get either a new `.data."bundle.crt"` field with the CA bundle. When there is a kubeconfig, then the CA bundle will be injected directly into the server configuration. This enables new features and allows operators to get rid of workarounds and technical debt. 198 | 199 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10988 200 | 201 |
202 | 203 | ## 🫄 `cluster-autoscaler`'s `ProvisioningRequest` API 204 | 205 | **Problem Statement:** The `cluster-autoscaler` community introduced the new `ProvisioningRequest` API. This enables users to provide a pod template and ask the autoscaler to either provision a new node or to get to know whether the pod would fit in the existing cluster without doing any scale-up. The new API can help users to ensure all-or-nothing semantics. Without this API, users have to create the pod which might remain stuck in the `Pending` state. 206 | 207 | **Motivation/Benefits**: ✨ Support more use cases/scenarios. 208 | 209 | **Achievements:** A new field in the `Shoot` specification has been added to enable users to activate this API. 210 | 211 | **Next Steps:** Finalize unit/integration tests and open the pull request for review. 212 | 213 | **Issue:** https://github.com/gardener/gardener/issues/10962 214 | 215 | **Code/Pull Requests:** https://github.com/tobschli/gardener/tree/ca-provisioning-api 216 | 217 |
218 | 219 | ## 👀 Watch `ManagedResource`s In `Shoot` Care Controller 220 | 221 | **Problem Statement:** The `Shoot` care controller performs regular health checks for various aspects of the shoot cluster. These are ran periodically based on configuration. This approach has the downside that failing conditions will not be reported in the meantime, i.e., they have to wait for the next sync period before getting visible in the `Shoot` status. 222 | 223 | **Motivation/Benefits**: 🔧 Reduced operational complexity, 🤩 improved user experience. 224 | 225 | **Achievements:** Similar to [this previous effort in the 2022/09 Hackathon](https://github.com/gardener-community/hackathon/blob/main/2022-09_Hirschegg/README.md#resource-manager-health-check-watches), we introduced a WATCH for `ManagedResource` in the controller which causes a re-evaluation of the health checks once a relevant condition has changed. This way, the `Shoot` status reflects health changes immediately. 226 | 227 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10987 228 | 229 |
230 | 231 | ## 🐢 Cluster API Provider For Gardener 232 | 233 | **Problem Statement:** It is a common/regular ask of users to use the community-standard APIs for cluster management. While this is generally questionable (see https://github.com/gardener/gardener/blob/master/docs/concepts/cluster-api.md), this request is reoccurring every now and then. It could help adoption if Gardener supported the cluster API, even if it is just about forwarding the data to the actual Gardener API server. 234 | 235 | **Motivation/Benefits**: ✨ Support more use cases/scenarios. 236 | 237 | **Achievements:** It is possible to deploy or delete a shoot cluster via the cluster API (the manifest is "just forwarded" to the Gardener API). There are still some rough edges, but the general approach seems to work (i.e., we haven't discovered technical dead-ends). Yet, it is unclear what the benefit of using this is since it's just proxying it down. 238 | 239 | **Next Steps:** There are still quite a few open points before this should be used productively. We have to decide how to continue this effort. 240 | 241 | **Issue:** https://github.com/gardener/gardener/blob/master/docs/concepts/cluster-api.md 242 | 243 | **Code/Pull Requests:** https://github.com/metal-stack/cluster-api-provider-gardener 244 | 245 |
246 | 247 | ## 👨🏼‍💻 Make `cluster-autoscaler` Work In Local Setup 248 | 249 | **Problem Statement:** Currently, the `cluster-autoscaler` is not working in the local setup because we were not setting the `nodeTemplate` in the `MachineClass` for the `cluster-autoscaler` to get to know about the resource capacity of the nodes. In addition, the `Node` object reported false data for allocatable resources. It also missed a provider ID which prevented `cluster-autoscaler` to associate the `Node` with a `Machine` resource. 250 | 251 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 252 | 253 | **Achievements:** We crafted a custom `cloud-controller-manager` for `provider-local` such that the provider ID can be populated into the `Node` object. In addition, we properly populate the `nodeTemplate` in the `MachineClass` now. 254 | 255 | **Next Steps:** We still have to figure out how to make the `Node` report its allocatable resources properly. 256 | 257 | **Code/Pull Requests:** https://github.com/ialidzhikov/cloud-provider-local, https://github.com/ialidzhikov/gardener/tree/fix/cluster-autoscaler-provider-local 258 | 259 |
260 | 261 | ## 🧹 Use Structured Authorization In Local KinD Cluster 262 | 263 | **Problem Statement:** Due to changes in `kubeadm`, we had to introduce a workaround for enabling the seed authorizer in the local KinD clusters. This slows down the creation of the cluster and hence all e2e tests. 264 | 265 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 266 | 267 | **Achievements:** With the new [Structured Authorization](https://kubernetes.io/docs/reference/access-authn-authz/authorization/#using-configuration-file-for-authorization) feature of Kubernetes, we were able to get rid of the workaround and speed up the KinD cluster creation dramatically. 268 | 269 | **Issue:** https://github.com/gardener/gardener/issues/10421 270 | 271 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10984 272 | 273 |
274 | 275 | ## 🧹 Drop Internal Versions From Component Configuration APIs 276 | 277 | **Problem Statement:** In order to follow the same approaches like for regular Gardener APIs, we currently maintain an internal and an external version of the component configurations. However, we concluded that the internal version actually has no benefit and just causes more maintenance effort during development. 278 | 279 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 280 | 281 | **Achievements:** We started working on removing the internal version, but due to its wide-spread use in the code base, we weren't able to finish it yet. 282 | 283 | **Next Steps:** The effort has to be continued after the Hackathon such that a pull requests can be opened. 284 | 285 | **Issue:** https://github.com/gardener/gardener/issues/11043 286 | 287 | **Code/Pull Requests:** https://github.com/timebertt/gardener/tree/config-apis 288 | 289 |
290 | 291 | ## 🐛 Fix Non-Functional Shoot Node Logging In Local Setup 292 | 293 | **Problem Statement:** We discovered that the shoot node logging was not working in the local development setup for quite a while. 294 | 295 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 296 | 297 | **Achievements:** We have identified the reason for the problem (it probably broke when we introduced `NetworkPolicy`s in the `garden` namespace). In the meantime, `Ingress`es (like the Vali which gets the shoot node logs) are exposed via Istio. Yet, in the local setup the traffic was still sent to the `nginx-ingress-controller` for which the network path was blocked. By switching it to Istio, we were able to fix the problem. 298 | 299 | **Issue:** https://github.com/gardener/gardener/issues/10916 300 | 301 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/10991 302 | 303 | ## 🧹 No Longer Generate Empty `Secret` For `reconcile` `OperatingSystemConfig`s 304 | 305 | **Problem Statement:** After the introduction of `gardener-node-agent`, the `OperatingSystemConfig` controller no longer needs to generate a `Secret` when the `purpose` is `reconcile`. Yet, it was still doing this, effectively creating an empty `Secret` which was not used at all (and just "polluted" the system). 306 | 307 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 308 | 309 | **Achievements:** `gardenlet` has been adapted to no longer read the `Secret`, and the respective creation in the `OperatingSystemConfig` has been deprecated. It can be removed after a few releases to ensure backwards-compatibility. 310 | 311 | **Next Steps:** Wait until the deprecation period has expired, and then finally cleanup the `Secret` generation from the `OperatingSystemConfig` controller. 312 | 313 | **Code/Pull Requests:** https://github.com/rfranzke/gardener/tree/osc-controller-cloudconfig-secret 314 | 315 |
316 | 317 | ## 🖥️ Generic Monitoring Extension 318 | 319 | **Problem Statement:** Since quite a while, we know that different parties/users of Gardener have different requirements w.r.t. monitoring (or observability in general). While a few things have been made configurable in the past, many others have not. This required users to write custom extensions that manipulate configuration. This approach has some limitations and cannot properly realize all use-cases, i.e., we need a better way of making them work without introducing more technical debt. 320 | 321 | **Motivation/Benefits**: 👨🏼‍💻 Improved developer productivity. 322 | 323 | **Achievements:** We talked about the requirements and brain-stormed how/whether the monitoring aspect of Gardener can be externalized. We decided that it's best to write them down in a clear way so that they can be used as input in follow-up discussions. 324 | 325 | **Next Steps:** Discuss the requirements with the monitoring experts of all affected parties and align on an approach that works for most use-cases. 326 | 327 | **Issue:** https://github.com/gardener/gardener/issues/10985 328 | 329 |
330 | 331 | ![ApeiroRA](https://apeirora.eu/assets/img/BMWK-EU.png) 332 | -------------------------------------------------------------------------------- /2025-06_Schelklingen/README.md: -------------------------------------------------------------------------------- 1 | # Hack The Garden 06/2025 Wrap Up 2 | 3 | ## ⚡️Replace OpenVPN with Wireguard 4 | 5 | **Problem Statement:** The Gardener VPN implementation between control and data plane currently uses OpenVPN, which is a well-established but somewhat old solution for VPNs. Wireguard is a relatively new, but well-liked contender in the VPN space. It could be possible to replace OpenVPN with Wireguard. As we do not want to spin up a load balancer per control plane (or use one port per control plane) a reverse proxy like [mwgp](https://github.com/apernet/mwgp) is required. 6 | 7 | **Motivation/Benefits:** 🚀 Modernize VPN stack, ⚡️ improved performance and simplicity. 8 | 9 | **Achievements:** 10 | 11 | **Next Steps:** 12 | 13 | **Code/Pull Requests:** 14 | * https://github.com/metal-stack/gardener/tree/wireguard 15 | * https://github.com/metal-stack/vpn2/tree/wireguard 16 | * https://github.com/majst01/mwgp 17 | 18 |
19 | 20 | ## ⛳️ Make `gardener-operator` Single-Node Ready 21 | 22 | **Problem Statement:** By default, when Gardener is deployed, some components of it are deployed for high availability, assuming multiple nodes in the cluster. This is not necessary or hinders the deployment of Gardener in single-node clusters. For bare-metal scenarios, sometimes only a single node is available, meaning e.g. multiple replicas of some components are not needed. 23 | 24 | **Motivation/Benefits:** 🧩 Enable lightweight/single-node deployments, 🛠️ reduce resource overhead. 25 | 26 | **Achievements:** 27 | 28 | **Next Steps:** 29 | 30 | **Code/Pull Requests:** 31 | * https://github.com/gardener/gardener-extension-provider-gcp/pull/1052 32 | * https://github.com/gardener/gardener-extension-provider-openstack/pull/1042 33 | * https://github.com/fluent/fluent-operator/pull/1616 34 | * https://github.com/afritzler/cortex/commit/4000c188086fe383d314efeb40a663f49aa8b35b 35 | * https://github.com/afritzler/cortex/commit/616c0b80d90d036f4275636e1a3be9c5f2aac9e5 36 | * https://github.com/afritzler/vali/commit/35fbce152f783f33fdc2066e09d60ec2ba56b562 37 | * https://github.com/afritzler?tab=packages&repo_name=vali 38 | * https://github.com/gardener/gardener/pull/12248 39 | 40 |
41 | 42 | ## 📡 OpenTelemetry Transport for `Shoot` Metrics 43 | 44 | **Problem Statement:** Today the shoot metrics are collected by the control plane prometheus using the kube-apiserver `/proxy` endpoint, without any ability to fine tune the collected metrics sets. Since we introduce opentelemetry collector instance on the shoots as a replacement of valitail service and on seeds in the shoot control plane namespace, the goal is to try out collecting and filtering the shoot metrics via opentelemetry collector instances also giving the opportunity for filtering and fine tuning of the metrics sets. This story is part of Observability 2.0 initiative. 45 | 46 | **Motivation/Benefits:** 📊 Flexible and modern metrics collection, 🔍 improved observability. 47 | 48 | **Achievements:** 49 | 50 | **Next Steps:** 51 | 52 | **Code/Pull Requests:** 53 | 54 | ![OpenTelemetry transport flow](./otel-transport-shoot-metrics/otel-flow.png) 55 | 56 | ![Prometheus chart using OpenTelemtry metrics](./otel-transport-shoot-metrics/prometheus-chart.png) 57 | 58 |
59 | 60 | ## 🔬 Cluster Network Observability 61 | 62 | **Problem Statement:** It might be beneficial to be able to get deeper insights into the traffic of a Kubernetes cluster. For example, traffic across availability zone boundaries may have increased latency or monetary costs. There are tools, e.g. https://github.com/microsoft/retina, which allow to gain more detailed insights into the pod network, but may lack some features like availability zone tracking (see https://github.com/microsoft/retina/issues/1179) 63 | 64 | **Motivation/Benefits:** 👁️‍🗨️ Enhanced network visibility, 📈 actionable insights for optimization. 65 | 66 | **Achievements:** 67 | 68 | **Next Steps:** 69 | 70 | **Issue:** https://github.com/microsoft/retina/issues/1654 71 | 72 | **Code/Pull Requests:** https://github.com/microsoft/retina/pull/1657 73 | 74 | ![Cluster Network Observability – Prometheus Traffic Chart](./cluster-network-observability/prometheus-traffic-chart.png) 75 | 76 |
77 | 78 | ## 📝 Signing of `ManagedResource` Secrets 79 | 80 | **Problem Statement:** The secrets of ManagedResources, are currently used as-is by the gardener-resource-manager. This could lead to a bad-actor manipulating these secrets to deploy resources with the permissions of the grm. To prevent one potential scenario of privilege escalation, we want to sign the secrets of ManagedResources with a key that is only known to the gardener-resource-manager. This way, the grm can verify that the secrets it receives are not manipulated by a bad actor. 81 | 82 | **Motivation/Benefits:** 🔒 Improved security and integrity for managed resources. 83 | 84 | **Achievements:** 85 | 86 | **Next Steps:** 87 | 88 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/12247 89 | 90 |
91 | 92 | ## 🧰 Migrate Control Plane Reconciliation of Provider Extensions to `ManagedResource`s 93 | 94 | **Problem Statement:** Currently we deploy control-plane components using the chart applier instead of managed-resources. This creates some issues where for example if we want to scale a component, we have to do it "manually", e.g. scaling a controller to 0 needs to be done imperatively. 95 | 96 | **Motivation/Benefits:** 🔄 Simplified operations, ⚙️ improved scalability and automation. 97 | 98 | **Achievements:** 99 | 100 | **Next Steps:** 101 | 102 | **Issue:** 103 | * https://github.com/gardener/gardener/issues/12250 (stretch goal) 104 | 105 | **Code/Pull Requests:** 106 | * https://github.com/gardener/gardener/pull/12251 107 | * https://github.com/gardener/gardener/compare/master...metal-stack:gardener:controlplane-objects-provider-interface (stretch goal) 108 | 109 |
110 | 111 | ## ✨ Dashboard Usability Improvements 112 | 113 | **Problem Statement:** The Gardener Dashboard assumes static, non-configurable defaults for e.g. Shoot values, which may not be suitable for all deployment scenarios. Some points that could be improved: Value Defaulting (landscape scope): e.g. AutoScaler min/max replicas; Overrides (project scope): Optional labels for Shoots that can be used as a display name to overcome the project name length limit; Hide UI Elements (landscape scope): e.g. Control Plane HA; Add new UI Elements (stretch goal): would require extensibility concept for the dashboard 114 | 115 | **Motivation/Benefits:** 🖥️ Improved user experience, 🛠️ more flexible and customizable dashboard. 116 | 117 | **Achievements:** 118 | 119 | **Next Steps:** 120 | 121 | **Issue:** https://github.com/gardener/dashboard/issues/2469 122 | 123 | **Code/Pull Requests:** 124 | * Project Title – https://github.com/gardener/dashboard/pull/2470 125 | * Value Defaulting – https://github.com/gardener/dashboard/pull/2476 126 | * Hide UI Elements – https://github.com/gardener/dashboard/pull/2478 127 | 128 |
129 | 130 | ## ⚖️ Cluster-internal L7 Load-Balancing Endpoints for `kube-apiserver`s 131 | 132 | **Problem Statement:** In the last hackathon we created an L7 load-balancing for the external endpoints of Gardener kube-apiservers (Shoots & Virtual Garden). However, cluster internal traffic like from gardener-resource-manager and gardener-controller-manager accesses the Kubernetes internal services directly, skips Istio and so the L7 load-balancing. We noticed at least for gardener-controller-manager that it could generate some load to the gardener-apiserver. Thus, it would be nice to have a cluster internal load-balancing too. We don't want to use the external endpoint since depending on the infrastructure this could create additional external traffic. 133 | 134 | **Motivation/Benefits:** ⚖️ Better resource distribution, 🚦 improved reliability for internal traffic. 135 | 136 | **Achievements:** 137 | 138 | **Next Steps:** 139 | 140 | **Issue:** https://github.com/gardener/gardener/issues/8810 141 | 142 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/12260 143 | 144 |
145 | 146 | ## 📜 Documentation Revamp 147 | 148 | **Problem Statement:** Usually, the individual content of our documentation is of high quality and helpful. However, we typically receive complaints about the structure and explorability of our documentation. 149 | 150 | **Motivation/Benefits:** 📚 Easier onboarding, 🔎 improved discoverability and structure. 151 | 152 | **Achievements:** 153 | 154 | **Next Steps:** 155 | 156 | **Code/Pull Requests:** 157 | * https://github.com/gardener/documentation/pull/652 158 | * https://github.com/gardener/documentation/pull/653 159 | 160 |
161 | 162 | ## ℹ️ Expose EgressCIDRs in shoot-info `ConfigMap` 🏎️ 163 | 164 | **Problem Statement:** Some stakeholders need to know the egress CIDRs of a shoot cluster. Helps expose meta-level information about the shoot to workloads of the shoot. This could be useful in case of controllers e.g. crossplane that run on the shoot and need access to some information of the existing infrastructure. 165 | 166 | **Motivation/Benefits:** 🏷️ Enable better integration for shoot workloads, 📤 improved transparency. 167 | 168 | **Achievements:** 169 | 170 | **Next Steps:** 171 | 172 | **Code/Pull Requests:** https://github.com/gardener/gardener/pull/12252 173 | 174 |
175 | 176 | ## 📈 Overcome Maximum of 450 `Node`s on Azure 177 | 178 | **Problem Statement:** Extensions that do not rely on overlay networking for their in-cluster networking usually rely on other mechanisms such as route tables to establish p2p traffic. Azure being one of them. We currently face scaling difficulties as clusters generally approach the maximum size of route tables set by the provider and we need a new network architecture to overcome this limitation. 179 | 180 | **Motivation/Benefits:** 🚀 Enable larger clusters, 🏗️ reference for other providers. 181 | 182 | **Achievements:** 183 | 184 | **Next Steps:** 185 | 186 | **Code/Pull Requests:** 187 | 188 |
189 | 190 | ## 🦜 Multiple Parallel Versions in a Gardener Landscape (fka. Canary Deployments) 191 | 192 | **Problem Statement:** 193 | 194 | **Motivation/Benefits:** 🦜 Enable canary/parallel versioning, 🔄 safer rollouts. 195 | 196 | **Achievements:** 197 | 198 | **Next Steps:** 199 | 200 | **Code/Pull Requests:** 201 | 202 |
203 | 204 | ## ♻️ GEP-32 – Version Classification Lifecycles 🏎️ 205 | 206 | **Problem Statement:** 207 | 208 | **Motivation/Benefits:** ♻️ Automated version lifecycle management, ⏳ reduced manual effort. 209 | 210 | **Achievements:** 211 | 212 | **Next Steps:** 213 | 214 | **Code/Pull Requests:** https://github.com/metal-stack/gardener/pull/9 215 | 216 |
217 | 218 | ## 🧑‍🔧 Worker Group Node Roll-out 🏎️ 219 | 220 | **Problem Statement:** 221 | 222 | **Motivation/Benefits:** 🧑‍🔧 Improved node management, 🚀 streamlined rollouts. 223 | 224 | **Achievements:** 225 | 226 | **Next Steps:** 227 | 228 | **Code/Pull Requests:** https://github.com/rrhubenov/gardener/tree/worker-pool-rollout 229 | 230 |
231 | 232 | ## 👀 Instance Scheduled Events Watcher 233 | 234 | **Problem Statement:** 235 | 236 | **Motivation/Benefits:** 237 | 238 | **Achievements:** 239 | 240 | **Next Steps:** 241 | 242 | **Code/Pull Requests:** 243 | 244 |
245 | 246 | ![ApeiroRA](https://apeirora.eu/assets/img/BMWK-EU.png) 247 | -------------------------------------------------------------------------------- /2025-06_Schelklingen/cluster-network-observability/prometheus-traffic-chart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gardener-community/hackathon/385653c42ca25a25256678418cda2b66f10561be/2025-06_Schelklingen/cluster-network-observability/prometheus-traffic-chart.png -------------------------------------------------------------------------------- /2025-06_Schelklingen/otel-transport-shoot-metrics/otel-flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gardener-community/hackathon/385653c42ca25a25256678418cda2b66f10561be/2025-06_Schelklingen/otel-transport-shoot-metrics/otel-flow.png -------------------------------------------------------------------------------- /2025-06_Schelklingen/otel-transport-shoot-metrics/prometheus-chart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gardener-community/hackathon/385653c42ca25a25256678418cda2b66f10561be/2025-06_Schelklingen/otel-transport-shoot-metrics/prometheus-chart.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Gardener Hackathons 2 | 3 | Project Gardener is an essential component of [ApeiroRA](https://apeirora.eu), part of the European IPCEI-CIS [initiative](https://www.bmwk.de/Redaktion/EN/Artikel/Industry/ipcei-cis.html). 4 | The community building in open-source is actively encouraged by the EU Commission with the [Open source software strategy](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/digital-services/open-source-software-strategy_en). 5 | 6 | This repository is meant to collect information, input, output, etc. related to Gardener Hackathons. 7 | If you feel like you can contribute something, you are encouraged to file a PR or simply work directly on the `main` branch. 8 | The general idea is to create directories for past and future Hackathons and collect files with planning, review, documentation, whatnot pieces of information in there. 9 | 10 | ## Events 11 | 12 | | Date | Location | Organizer | Wrap Up | 13 | |:------------------------|:--------------------------------------------------------------------------------|:----------|:------------------------------------------| 14 | | 02.11.2021 - 05.11.2021 | [Mesnerhof C, Steinberg am Rofan, Österreich](https://www.mesnerhof-c.at/) | x-cellent | [Summary](2021-11_Steinberg/README.md) | 15 | | 26.09.2022 - 30.09.2022 | [Württemberger Haus, Hirschegg, Österreich](https://www.wuerttembergerhaus.de/) | SAP | [Summary](2022-09_Hirschegg/README.md) | 16 | | 22.05.2023 - 26.05.2023 | [Open Sky House, Leverkusen](https://www.openskyhouse.org/) | STACKIT | [Summary](2023-05_Leverkusen/README.md) | 17 | | 06.11.2023 - 10.11.2023 | [Schlosshof Freizeitheim, Schelklingen](https://www.schlosshof-info.de/) | x-cellent | [Summary](2023-11_Schelklingen/README.md) | 18 | | 13.05.2024 - 17.05.2024 | [Schlosshof Freizeitheim, Schelklingen](https://www.schlosshof-info.de/) | x-cellent | [Summary](2024-05_Schelklingen/README.md) | 19 | | 02.12.2024 - 06.12.2024 | [Schlosshof Freizeitheim, Schelklingen](https://www.schlosshof-info.de/) | x-cellent | [Summary](2024-12_Schelklingen/README.md) | 20 | | 02.06.2025 - 06.06.2025 | [Schlosshof Freizeitheim, Schelklingen](https://www.schlosshof-info.de/) | x-cellent | [Summary](2025-06_Schelklingen/README.md) | 21 | 22 | A subsequent Hackathon is proposed to be held in late autumn 2025, but concrete dates are not decided yet. 23 | 24 | ## What to Expect 25 | 26 | The Gardener Community Hackathon is a week-long event held twice a year (spring and late autumn) where we team up with Gardener community members. As you can see [above](#events), it usually takes place in Schelklingen, Germany, and focuses on collaborative work on Gardener-related topics. A few weeks beforehand, we hold a planning meeting where everyone can pitch ideas and vote on the topics they're interested in tackling. Initial teams are formed based on these topics, aiming to mix participants from different companies for diversity. That said, teams can change, and you can switch topics dynamically during the week as desired – there are no hard rules or enforcements. 27 | 28 | You can check out past topics in the summary documents stored inside this repository – it's a broad mix, and pretty much anything Gardener-related is welcome! The week starts around 12 PM on Monday and wraps up around 10 AM on Friday. During the Hackathon, we have regular demo sessions to showcase team progress, and topics are briefly presented in a review meeting the following week. 29 | 30 | On-site, breakfast is provided daily, lunch comes via catering, and we cook dinner together (participation optional). One evening, we go out for dinner. For fun, there's table soccer, table tennis, pool, and a huge outdoor area. Some folks go running in the mornings before breakfast. Internet access is via Starlink, and we usually organize car pools to get to the location. 31 | --------------------------------------------------------------------------------