How Meta runs blameless Post-mortems at scale - Efficient incident reviews w/o the Politics for a Product org of 30,000

Date: 2024-03-04 | create | tech | business | productivity | software-engineering | creation-cycle |

I worked at Instagram for 3.5 years. In that time I ran over 50 incidents (called SEVs) from inception to post-mortem.

Since then I've gone on to several other roles - from startup SWE to solo founder to scaleup SWE (where I am now). I've observed how eng processes work from planning to building to incident response. It's made me appreciate Meta's approach to post-mortems even more for its simplicity and effectiveness.

In this post I'm going to share the post-mortem process we used to effectively respond to incidents and prevent them from happening again - without the politics and drama.

Blameless Post-Mortems

First off - blameless post-mortems are table stakes for effective problem solving. If you are in a blame-first culture - your org is not (and can not be) effective.

This is simply because problem solving requires taking risks. It requires the ability to create, share, and test ideas that may at first seem illogical.

If you cannot do that then you will be stuck in the status quo. And guess what? The status quo is what got you to this incident in the first place.

Not changing it is akin to asking it to happen again.

So blameless post-mortems are good. They allow people to bring their whole self (and their whole brain) to the table to solve the problem at hand.

Meta's Post-Mortem process

Meta is a planetscale company. Most orgs are not planetscale. So many of their processes are built to scale well beyond what most orgs need.

But I still think it's a useful case study of a Simple Scalable System so here I want to start with a high-level view of the process before getting into specific steps.

Meta's Incident process:

  • Incident (SEV) happens
  • Create Incident in SEV tool - This is custom software that acts as a single source of truth for the SEV and aids in status updates, notifications, escalation calls, timelines, reports, discussion, etc
  • Tackle incident - whatever that requires
  • Update incident tool with report on what happened (root cause and impact) and timeline of events from start -> finish

The highest impact SEVs are then picked up in recurring meetings called SEV Reviews. There are many different versions of these - usually a company-wide one for the worst SEVs and smaller more focused ones for particular orgs (like Instagram or Facebook Ads or Privacy).

These SEV Reviews are meant to serve as additional eyes / perspectives to share knowledge of what happened and provide input on how to improve - v similar to code reviews. They typically include:

  • Sev Owner - The Point of Contact for the SEV, typically someone who owns the particular domain that caused it or that was particularly affected by the issue
  • SEV Experts - Experts on Incident response and overall domain that run the meeting and provide feedback on tools / techniques from across the org
  • Relevant Stakeholders - May be owners of relevant tools, similar domains, or downstream products that were affected by this issue and are invested in its fix

This serves as a relatively lightweight way to ensure that:

  • The Incident is thoroughly understood
  • Followups are created / executed to prevent it from happening again
  • Knowledge is shared so that other teams can learn from this, stop it from happening in their own domain

Now let's talk ab what actually happens in a Post-Mortem.

Meta's Incident Post-Mortem Steps

The most important thing about Meta's Post-mortem process I think is the steps / categories they go through when reviewing a SEV. I think these are important because they force people to think through the entire process of incident response to find and eliminate bottlenecks to more effectively prevent it in the future.

Plus these are steps that you can easily bring with you to your own org to improve incident response.

To start, we share the most important details of the SEV so everyone's on the same page:

  • Overview - What happened
  • Impact - What was the impact on customers / the business
  • Root Cause - Why did this happen -> be specific w code pointers!

Then the Post-Mortem steps follow the timeline of the Incident from start -> fixed. This is useful because you can then easily spot bottlenecks in the process and come up w followups to improve in the future.

  • Incident Start - When the issue started and what caused it
  • Detection - When we detected it and how
  • Escalation - When the issue arrived at the appropriate team and how it got there
  • Mitigation - When the issue was first mitigated and how (i.e. we stopped the bleeding)
  • Remediation - When we were able to fix the issue for all affected parties (i.e. we healed the wound)
  • Close - When all followups were completed

This may seem pedantic but I can assure you from experience that this extra bit of writing / thinking often leads to ideas for simple changes that could vastly improve time to fix.

An example might be:

  • Incident start -> Mitigation time took 2d and 12h - this is too long so we need to mitigate faster

But what if we look at the breakdown of events and it's like:

  • Detection - 10 mins after start
  • Escalation - 2 days of bouncing around to find the appropriate team
  • Mitigation - 12h to fix

Here mitigation is an issue but actually Escalation is the biggest bottleneck. Perhaps ownership / alerts need to be tweaked to get to the right team faster.

For each of these steps, those present at the SEV review walk through:

  • What happened?
  • Why did it take so long?
  • How can we improve it in the future, pulling from our combined experience / perspectives?

Then the SEV owner is tasked with prioritizing and delegating the followups before closing out the SEV.

Next

I believe Reflection is a crucial part of any effective cycle - and it's a core part of my personal OS the Creation Cycle. If you don't learn from what you did, you can't improve. Improving a little bit every cycle is the Simple Scalable System for continuous improvement.

SEV Reviews (like Code Reviews) are a relatively lightweight process that provide unbounded upside. As such they are 3S Approved in my book and I happily recommend them to any engineering org.

If you liked this post you might also like:

Want more like this?

The best / easiest way to support my work is by subscribing for future updates and sharing with your network.