Death by Automatic Updates

How Heroku suffered a preventable outage after Datadog did it first. Why automatic updates are irresponsible, and what to do instead.

July 24, 2025
cloud architecture

Automatic updates. Are they a blessing, or are they a curse disguised as a blessing? In a world of rapid development, relentless innovation, and ongoing security threats (and a bunch of social media induced FOMO on top), one would be inclined to believe that automatic updates to software are not only good – they are necessary.

I think it’s a little bit more nuanced than that, and in today’s blog I’m going to discuss some high-profile outages that quite literally would not have happened without automatic updates, the implications of both having and not having automatic updates, and what I think companies should be doing instead.

Not all automatic updates are bad

Should you disable automatic updates everywhere? Maybe not. Of course there are cases to be made for quickly rolling out patches for critical security issues that are actively being abused. And in a world of Continuous Delivery, it only makes sense that your most recent released version is also actually the one running in production – if not, why is it even released?

If you’re actually building the software then, by all means, use automatic updates. Ensuring rapid feedback while being able to address issues that arise in-house are essential to keep an edge in competitive markets, but also generally help engineering teams work better. Yay for automatic updates.

But some automatic updates are

The problems start to emerge once you start to blindly trust software from third parties. It doesn’t matter if it’s your operating system, application software, or software libraries. None of it was tested against your specific use case and architecture before releasing. That’s your responsibility, and one you can’t simply shift away.

Yet, so many of us do exactly that. Not just as individuals, but even large, reputable tech companies, who have millions if not billions to lose. Because in so many cases, nothing really ever goes wrong.

Until it does.

Death by Automatic Updates: Heroku

On June 10th, 2025 Heroku suffered what was arguably the biggest outage in the history of the company. You can read more about it here. They went down seemingly out of nowhere for almost a full day, and all of their infrastructure was affected. So what happened, and why did it take so long?

Heroku uses Ubuntu Linux pretty much everywhere, in this case they were running the 22.04 LTS version, which in itself isn’t anything unusual. Although Heroku have kept a lot of technical details to themselves, it seems most likely that an automated update to systemd caused a chain reaction (further supported by the fact that a new version of the systemd package for Ubuntu LTS was indeed released shortly before the outage). For those unaware, systemd is a “system manager” component used in most modern Linux distributions. It runs with process ID 1 (being the first process to run after the kernel). If you are to update systemd on a live system, it will be restarted. As such, all services managed by systemd will also be restarted.

This is where the problems really began.

A restart of systemd triggered a restart of network services, and by default this restarting process is configured so that all network routes that aren’t explicitly managed through systemd, are flushed. This can be a desirable default from a security perspective. However, if you are running any sort of software-defined or overlay networking solution, perhaps because you’re running Kubernetes, you will have tons of routing rules that aren’t managed by systemd. You really don’t want those rules to get flushed. In case of Heroku, it caused their VMs (they call them ‘dynos’) to lose the ability to make outbound connections.

According to their own published incident timeline, it took Heroku over 3 hours to identify that they were missing routes, and another hour or two to further figure out what was wrong and why it happened. That’s 5 hours of completely avoidable downtime, and they hadn’t even begun fixing it. I’m not writing this to throw shade on Heroku’s engineers, but it does show that debugging such a widespread ‘surprise outage’ can take much longer than you’d expect. Especially if your internal tooling (e.g. monitoring, maybe even systems for remote access) is affected as well.

Could they have foreseen this? Maybe, yes.

Death by Automatic Updates: Datadog

Let me start by saying it’s impossible to keep track of every blog or RCA from every company on the planet. But when reading about the Heroku incident I had a proper feeling of deja vu. In March of 2023, a little observability company called Datadog suffered from a widespread outage, affecting multiple regions across multiple cloud providers, where their systems lost connectivity. This outage lasted 2 days.

The root cause: an automatic unattended update to their Linux systems (coincidentally, Ubuntu 22.04) updated systemd, triggering a restart of the systemd-networkd service which deleted all network routes that weren’t managed by systemd. Datadog had a lot of those, since they use the Cilium CNI for communication between containers. Losing those routes made their container networking fall apart in disastrous ways.

Sounds familiar?

But this wasn’t systemd’s fault

So lightning apparently struck twice. Two high-profile outages caused by an update to systemd, triggering a network restart and the flushing of ‘unmanaged routes’, which in both cases was undesirable behaviour. Could Heroku have taken learnings from Datadog’s prior exposure to this failure mode? Maybe, especially given the amount of online outrage and coverage. Whenever I read the post-mortems of high-profile outages I do wonder if their particular issue ‘could happen to me’. Sometimes this results in improvements, sometimes a note in design documentation.

But it’s not just about ‘could this bug/issue affect us?’. Both Datadog and Heroku experienced widespread outages, not because of a bug, but because of how they treated updates from their OS vendor, and how they treated the updating process.

  • They hadn’t tested this new version of systemd (but it wouldn’t have mattered)
  • They hadn’t tested what happens when systemd is updated on a live system (or they would have learned of the issues with lost routes)
  • They had automatic unattended updates enabled, so they would automatically get updated to whatever Canonical released (shifting QA responsibility to a 3rd party that knows nothing about you)
  • They had no tiered/staggered rollout approach for those updates, so everything was more or less updated at the same time (so when it did go wrong, everything caught fire at the same time)

Let’s break this down some more.

Mitigation: Trust, but verify

Neither Heroku nor Datadog were running some wildly experimental version of Linux. They were running the Long Term Support release, so it’s not completely unrealistic to put some trust towards that. But in general, the trustworthiness of an LTS version has more to do with stability and lack of breaking changes. There’s no guarantee that a stock configuration (e.g. to flush non-managed network routes on restarting) won’t break your setup. There’s also no guarantee that an update to your LTS Linux distro won’t break whatever 3rd party software you run on top of that Linux.

You need to test your changes. And not doing so is quite frankly irresponsible.

And depending on your company, it might even get you sued. If your customers lose millions because you side-stepped your Change Management process (that according to your ISO27001 certification does exist), they might seek compensation.

There are many ways of doing this, depending on your architecture and your risk appetite. It could be as simple as having regular updates (from upstream) applied to a small set of systems that are ‘less important’. If no issues are found within X time, the updates are applied to everything. It’s not airtight, but it also adds very little overhead.

Mitigation: Immutable Infrastructure

If you’ve been following my ramblings you know I’m a big fan of immutable infrastructure. Treating systems as Lego bricks. You don’t change the system, you don’t reboot the system, you replace the system. This has several benefits from a security perspective, but also in the context of this particular outage:

You’re never applying updates to a live system.

The failure mode of “unmanaged network routes being deleted after systemd restart” would never happen. Instead of updating dozens or even thousands of machines, you create a new base image. You can then create new systems based on that image, perform testing, and eventually start rolling out new systems to replace old systems. The upside is that your rollback strategy also doesn’t have to rely on backups or snapshots to ‘return to a previous system state’ – you simply run a new system based on the old image.

However, these benefits come at a price. Increased engineering complexity (to get reduced operational complexity), and architectural constraints. You can’t ‘switch to immutable’ overnight, if you’re used to running long-lived systems with years of uptime. But you can introduce this pattern for new systems, and gradually get rid of long-lived infrastructure.

Mitigation: Reduce blast radius

While the previous points were about trying to minimize the chance of issues, there will always be a chance that things break. So what’s the next best thing? Making sure not everything breaks at the same time, and making sure that you can fix issues relatively easily.

Tiered/staggered rollouts are key. Whether it’s updates to live systems or rolling out new ‘immutable’ systems, start small. Roll out a few, test, accept, move on. Rinse and repeat. This way you can also ensure that any infrastructure that you need for operational work isn’t breaking while everything else is.

Get in control. Turn off automatic updates.

Heroku wrote the following in their Incident Report:

The root cause of this outage was an unexpected change to our running environment. We disabled the automated upgrade service during the incident (June 10), with permanent controls coming early next week.

Heroku enabled the automatic update service willingly. The consequence being they were no longer in control of their systems' state. They could only configure when the service would run these updates, which package sources it could use, whether it was allowed to reboot or not, and where to send notifications. Activating automatic updates means you’re giving up control over what is exactly changed, and you’re agreeing to have these changes implemented without any testing or validation on your end. You’re setting yourself up for failure.

Get in control. Turn off automatic updates.