Yeah, this Amazon outage was wild. It’s kind of crazy how a single DNS manager in one region managed to take out what felt like half the internet. You’d think by now, AWS would have every possible safety net in place. But this shows even the biggest players can still get tripped up by a single point of failure.
According to Amazon’s own engineers, it all started with a race condition inside DynamoDB’s DNS management system. Basically, two pieces of automation were trying to update DNS plans at the same time, and one ended up overwriting the other with bad data. When that happened, the system deleted the active DNS plan entirely. This left AWS with no valid IP addresses for a major endpoint. Everything that depended on it, which as we saw is A LOT, just went dark.
The outage lasted over 15 hours. Ookla said they saw over 17 million outage reports across more than 3,000 organizations. The US, UK, and Germany got hit hardest, with big services like Snapchat, Roblox, and AWS itself all down. It’s kind of funny and not funny at the same time that the old saying “it’s always DNS” turned out to be right again.
Clear timeline of events:
Three impact windows in us-east-1, include the following:
-
11:48 PM–2:40 AM PDT: DynamoDB API errors.
-
2:25 AM–10:36 AM PDT: EC2 launches struggled, with network state backlog.
-
5:30 AM–2:09 PM PDT: NLB connection errors from bad health checks.
Even after they fixed DynamoDB, things didn’t just come back online back right away. EC2 in the US-East-1 region was backed up with network state propagation issues. So, while you could technically launch new instances, they couldn’t connect to anything for a while.
Exact DynamoDB fallout
-
Anything that needed the public DynamoDB endpoint in us-east-1 could not connect.
-
Global tables kept working in other regions, but replication to us-east-1 lagged until 2:32 AM PDT.
-
DNS was fully restored at 2:25 AM PDT, and customers recovered as caches expired by 2:40 AM.
Fresh EC2 details that were not in the media summaries
-
EC2 relies on DWFM, which manages the physical hosts they call droplets.
-
DWFM leases with droplets timed out during the DNS break.
-
When DNS came back, DWFM tried to re-establish leases at scale and hit congestive collapse.
-
Engineers throttled requests and did selective DWFM restarts at 4:14 AM to recover.
-
Launches began to work at 5:28 AM, but many still hit throttles until 1:50 PM.
-
Network Manager had a big backlog of network state to push, so some new instances had no connectivity until 10:36 AM.
Amazon says they’ve now disabled the automated parts of the DNS Planner and Enactor until they fix the underlying bug. They’re also adding safeguards to prevent older DNS plans from overwriting newer ones. Still, it sounds like they had to do a lot of manual cleanup to get things running again. Wait, we still need actual people to fix problems, yes we do!
Concrete fixes AWS says they are doing
-
DNS Planner and Enactor automation is disabled worldwide until the race condition is fixed and extra safeguards are added.
-
NLB will get a velocity control so one balancer cannot yank too much capacity during failover.
-
EC2 will add a DWFM recovery scale test suite and smarter throttling that keys off queue depth.
-
They will review cross-service paths to reduce time to recovery next time.
What’s really interesting is what Ookla pointed out about the US-East-1 region being a single choke point for so many services. Even apps that are “global” often depend on that region for authentication or some type of metadata. So you can imagine when it goes down, everything else is subject to failure. That’s a big red flag for anyone designing supposedly “resilient” systems.
The big takeaway here isn’t just about bugs or DNS. It’s about design. If one region or service can take down your entire operation, you’re setting yourself up for this kind of domino effect. AWS will recover, of course, but it’s a reminder that “cloud” doesn’t automatically mean “bulletproof.”
What do you all think? Is this just a rare slip-up for Amazon, or does it show how fragile centralized cloud architecture really is? Anyone else rethinking their reliance on a single region or provider after this one? What about repatriation?
Read the official AWS blog post on the outage here: Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region
