Keeping the Stack Fresh
Keeping your software library and stack fresh. My after action report.
I am not a creative writer by trade or interest, so in some sections, I’ve leaned on AI to get my points across in a semi-coherent way. However, my experiences are real.~*
Its hard to track where an issue stems from when everything is neglected. What the Industry Says
During my research on this topic, I came across a two-year-old Elastic article discussing how regularly updating their Elasticsearch instances not only improved performance but also saved them money in the long run. Financial benefits aside, security should always be a part of our processes, but some might struggle with how to be more secure or how they can implement good security practices. This JFrog article from 2023 highlights the risks of outdated dependencies, including security vulnerabilities, lack of support, and overall stability concerns. Who would have thought, keeping software up-to-date was a good security practice?
These articles reflect the insights of industry leaders and organizations we often emulated for their best practices or ideas. I don’t see myself as an influential figure, but I wanted to back up my points with examples from these more influential entities of the tech world. My experiences might be more relatable than groundbreaking, but I believe they reflect the challenges that are surprisingly common and hopefully relatable across the industry. I’d even argue teams at Elastic and JFrog still battle with the very topic of maintaining software dependencies. ___ Personal Experiences
I’ve shared some insights into what larger organizations have achieved by maintaining updated dependencies in their code bases. Like some of you, I found out I learn best through hands-on experience or examples. It’s hard to truly understand the consequences of neglecting updates until you’ve faced them firsthand. Here are my three, real-world examples were failing to keep dependencies and codebases up-to-date nearly caused major headaches for me and the teams I was part of.
Helm Charts
Hopefully most of us lean into community-maintained charts, or charts built by application maintainers. In my first job using Helm and Kubernetes this practice was not the case and was the exception. Helm charts like Grafana, External DNS, Cert manager and Prometheus were custom made from the ground up.
I walked into a collection of roughly 20 applications, all with custom built Helm charts with a variety of ways to configure everything from autoscaling to deployment resources. No group of charts had any consistency in any given resource, it was as if 20 individuals worked on them in isolation. This headache was made worse by the fact they were all built on the v1beta1 Kubernetes API which during my arrival was being deprecated. The Google Cloud Kubernetes Engine was being upgraded away from these deprecated APIs automatically for us, creating a hard requirement and timeline if we wanted to continue to use these applications deployed through Helm.
That entire team left the company together. Contractors were brought in and a new in-house team was built. We had just 3 months to migrate everything to community-maintained charts with no knowledge to tap into as to why certain choices were made. In total, the business had 4 full time employees converting helm charts over the course of 3 months before Google forced our Kubernetes engine upgrade.
We crushed it in the end! All our base charts were migrated to community-maintained versions. We also learned a lot about the ecosystem we had inherited during the process. Finally, we were able to point renovate to the community maintained charts allowing us to check feature changes and any deprecations all within the Pull request process. Updating a chart became a simple approval or minor refactor of something small if it was required. Doing it this way also allowed us to keep up-to-date with feature releases for each of our applications.
2. Terraform Providers
At another organization, we relied heavily on Cloudflare for application ingress and edge networking. Their Web Application Firewall (WAF) allowed us to enforce access control rules, including A/B testing, geolocation blocking, and bot detection.
The problem? Cloudflare announced a complete overhaul of their WAF. With new API endpoints this meant one thing for us. New Terraform resources in the provider requiring a total overhaul to our infrastructure as code setup around our WAF rules. If we did nothing, Cloudflare would automatically migrate our rules, creating state inconsistencies and forcing manual intervention in our otherwise fully automated Terraform setup. In fact, Terraform would fail to run if we kept the no longer usable resources in the project.
Our best solution was to wait for the maintainers of the Terraform provider to support the new WAF resources. Then we would build out the new resources and use the Cloudflare UI to migrate the rules and just run Terraform imports. Unfortunately, as the months passed, we kept procrastinating on the migration. Eventually, we had no choice. Cloudflare was a month away from enforcing the changes and migrating our rules for us.
Thanks to an incredible engineer on our team, he completed the task with zero production impact. He even managed to consolidate and remove old rules. He also managed to find issues with current rules we were using like our bot detection scoring. While it worked out in the end, the experience served as a stark reminder of why proactive updates are critical. Dealing with deprecations now sooner than later became critical to our way of working moving forward.
3. Making Updates Easy
Cultural buy-in is important for creating an attitude where continuous maintenance of dependencies is just part of the daily work. Equally important is making the entire process simple and efficient. My next example is where simplicity helped our security team achieve a bit of peace of mind. Not only in their infrastructure, but the team managing it (us).
Our security team was highly vigilant about vulnerabilities in our container library and the underlying software and libraries used in those base images. If you remember incidents like Log4J or the high-profile SSH CVEs from 2023 and 2024, you’ll understand their concerns. Thankfully, our platform team had decoupled the update and testing processes for container images from the actual build and testing of application code.
Using Renovate, we pulled in the latest image versions from external registries, built and tested each container, and shipped those images to our registry. Our cloud vendor’s tooling allowed us to assess every image for potential vulnerabilities, dramatically shortening the feedback loop for applying security patches or retiring images flagged as too risky by the security team.
For development teams, this process meant deploying updated images was as simple as changing the image tag in their build configurations. As a bit of digression — each Jenkins build for each microservice had an input variable for image tags. Developers could also build and test their code against multiple image versions in parallel, enabling quicker validation of updates within our microservice architecture.
To take this a step further, every codebase used our registry as a source of truth for images. We configured Renovate to automatically generate pull requests for these codebases when there was a major version update to any given image in our library. This ensured that every up-to-date image we shipped triggered the necessary changes in the developer’s repositories. Developers no longer had to manually check the registry — they only needed to review and test the automatically generated PRs.
Lastly, we attempted to remove base image vendor diversity as much as possible. Having the same or similar pipelines for Alpine, Ubuntu and Debian meant to many moving parts and created to many ‘vectors of attack’ for the security team. Tracking down errors or bugs with one vendor while another did not display those bugs became a sore point that was removed through this standardization process. ___
In Conclusion
We learn the most from our failures — and the failures of others. Think of this as an after-action report on the consequences of neglecting updates or over complicating things, and how you can improved from these experiences.
Humans, like water, take the path of least resistance. If a process is too complex or time-consuming, people will find excuses to avoid it. This avoidance creates a cascade of issues: 3rd party bugs go unfixed, security holes remain open, and the overall cost of maintainability skyrockets. Engineers grow frustrated as they repeatedly dedicate large blocks of time to fixing problems that could have been avoided through smaller iterations or smaller isolated workflows.
The solution lies in minimizing friction, by decoupling dependencies, simplifying processes, and automating where possible. Make updating your stack a routine task — a 5 to 10 minute chore rather than a daunting project. With the right practices in place, you can ensure security, stability, and scalability while keeping your team focused on innovation rather than firefighting.