Field Notes

When US-east-1 sneezes...

When us-east-1 sneezes, Global catches a cold

Notes on the recent AWS outage and what “Global” actually means

So AWS had an outage, just in case you missed the familiar “half the Internet is down” messages everyone in IT was sharing. It happened in the morning and caused a lot of confusion. Here’s a concise, non-nerd version of what went wrong and why an outage in one region can feel global.

What happened (high level)

AWS published a full incident write-up, but the short version is: there was a race condition in us-east-1’s DynamoDB DNS system that resulted in DNS records being deleted. DNS is like a phone book for the internet; without it, systems don’t know how to find one another, so server-to-server communication breaks down.

A race condition is when two or more computers try to complete tasks that depend on the other finishing first; they can block each other or work against one another. Because DynamoDB is heavily used inside AWS, requests backed up while the system was recovering; when things were resolved, the backlog flooded services and caused knock-on effects across other AWS offerings.

Why did an outage in us-east-1 affect services worldwide?

A lot of people asked: if this happened in us-east-1, why were UK services like HMRC affected?

AWS divides its infrastructure into geographical regions, for example, us-east-1 (Northern Virginia), eu-west-2 (London), and ap-northeast-1 (Tokyo). Regions help reduce latency, address data-sovereignty needs, and provide some level of redundancy. But many AWS features need to run across regions for convenience and manageability. For example, IAM (identity and access management) is presented as a single, account-wide service so you don’t have to copy IAM policies into every region or re-authenticate every time you switch regions. To support that, AWS provides a “Global” layer.

The catch is that “Global” isn’t magic. Global services are typically split into two surfaces:

  • Data plane regional, close to your workload (what actually runs where your users are).
  • Control plane the centralized control surface that manages and configures the service, and which often lives in a single region (commonly us-east-1).

If the control plane for a Global service is unavailable, for example because us-east-1 is having issues, parts of that service can break in all regions, even if the workloads and data are regional. That’s why an outage in us-east-1 can ripple out and interrupt systems across the globe.

It’s also worth noting that AWS uses its own services heavily. If DynamoDB is down, services that depend on a fast database (including AWS’s own control plane components) can be impaired. Because many control planes live in us-east-1, a failure there can have wide impact.

The race for solutions: single-region, multi-region, multi-cloud

Outages like this kick off lots of comments: “You should’ve prevented this with multi-region or multi-cloud.” That’s a fair reaction, but it’s worth calling out the tradeoffs people often overlook.

Cost

Running in multiple regions, and especially across multiple clouds, multiplies costs. One comment I saw proposed running everything in at least two regions in at least two clouds. That’s roughly a 4× cloud bill (plus data transfer and engineering overhead). There are ingress/egress costs to keep state synchronized, the complexities of cross-cloud networking, and significant operational overhead. This is “Big Budget” energy, as a colleague once put it.

Benefit

  • Multi-region (single cloud) can be very effective and is usually the more cost-efficient way to increase regional resilience. If eu-west-2 won’t let you create EC2 instances, failing over to eu-west-1 can be seamless for many workloads.
  • Multi-cloud gives you an extra level of isolation: if AWS as a whole is down, you can fail over to GCP or Azure. In practice, true seamless failover across clouds is expensive and operationally challenging, and therefore generally sensible only for very critical services.

The Global-service problem

For services with a Global control plane, you get almost no additional resilience from multi-region within the same cloud if the control plane is centralized. In those cases, the difference between single-region and multi-region is smaller than people assume.

Practical takeaways

There’s no single answer that fits every company. The right approach depends on the business impact and the risk profile of each resource. A sensible program looks like this:

  1. Inventory your key production resources and identify which ones are critical to business operations.

  2. Classify those resources by failure mode (regional vs. global control plane, data-sovereignty needs, latency constraints).

  3. Assess risk and cost for each resource, estimate the business impact of a regional outage and weigh that against the cost of increased resilience.

  4. Apply controls appropriate for the resource: multi-region replication, active-active configurations, managed failover, or multi-cloud only where it makes economic sense.

  5. Test your failover mechanisms regularly,resilience without rehearsal is guesswork.

Summary

When us-east-1 sneezes, Global can catch a cold. The recent outage highlighted two key realities:

  1. Many “Global” services have centralized control planes that introduce single points of failure.

  2. Multi-region or multi-cloud strategies carry real costs and complexity.

The answer to “What is the right course of action?” is: it depends. Do the work to classify your resources, estimate business impact, and apply the right controls for the right systems.

Written by Phil Chambers