Postmortem: The Day Our Cache Stampeded
A popular key expired at peak traffic and 4,000 requests hit the database at once. Here is the timeline and the fix.
Author
Site Reliability Engineer. On call so you don't have to be. Loves a good runbook.
4 posts by Tom Becker.
A popular key expired at peak traffic and 4,000 requests hit the database at once. Here is the timeline and the fix.
Rolling deploys, health checks that actually check health, and the database migration rule that saved us.
RSS climbed 40MB an hour until the pod OOM'd. A heap snapshot and one closure later, we found it.
Logs told us what happened. Traces told us where the time went. The difference was a 3am incident solved in ten minutes.