Date: 2024.03.04 | business | create | creation-cycle | productivity | software-engineering | tech |
DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)
I worked at Instagram for 3.5 years. In that time I ran over 50 incidents (called SEVs) from inception to post-mortem.
Since then I've gone on to several other roles - from startup SWE to solo founder to scaleup SWE (where I am now). I've observed how eng processes work from planning to building to incident response. It's made me appreciate Meta's approach to post-mortems even more for its simplicity and effectiveness.
In this post I'm going to share the post-mortem process we used to effectively respond to incidents and prevent them from happening again - without the politics and drama.
First off - blameless post-mortems are table stakes for effective problem solving. If you are in a blame-first culture - your org is not (and can not be) effective.
This is simply because problem solving requires taking risks. It requires the ability to create, share, and test ideas that may at first seem illogical.
If you cannot do that then you will be stuck in the status quo. And guess what? The status quo is what got you to this incident in the first place.
Not changing it is akin to asking it to happen again.
So blameless post-mortems are good. They allow people to bring their whole self (and their whole brain) to the table to solve the problem at hand.
Meta is a planetscale company. Most orgs are not planetscale. So many of their processes are built to scale well beyond what most orgs need.
But I still think it's a useful case study of a Simple Scalable System so here I want to start with a high-level view of the process before getting into specific steps.
Meta's Incident process:
The highest impact SEVs are then picked up in recurring meetings called SEV Reviews. There are many different versions of these - usually a company-wide one for the worst SEVs and smaller more focused ones for particular orgs (like Instagram or Facebook Ads or Privacy).
These SEV Reviews are meant to serve as additional eyes / perspectives to share knowledge of what happened and provide input on how to improve - v similar to code reviews. They typically include:
This serves as a relatively lightweight way to ensure that:
Now let's talk ab what actually happens in a Post-Mortem.
The most important thing about Meta's Post-mortem process I think is the steps / categories they go through when reviewing a SEV. I think these are important because they force people to think through the entire process of incident response to find and eliminate bottlenecks to more effectively prevent it in the future.
Plus these are steps that you can easily bring with you to your own org to improve incident response.
To start, we share the most important details of the SEV so everyone's on the same page:
Then the Post-Mortem steps follow the timeline of the Incident from start -> fixed. This is useful because you can then easily spot bottlenecks in the process and come up w followups to improve in the future.
This may seem pedantic but I can assure you from experience that this extra bit of writing / thinking often leads to ideas for simple changes that could vastly improve time to fix.
An example might be:
But what if we look at the breakdown of events and it's like:
Here mitigation is an issue but actually Escalation is the biggest bottleneck. Perhaps ownership / alerts need to be tweaked to get to the right team faster.
For each of these steps, those present at the SEV review walk through:
Then the SEV owner is tasked with prioritizing and delegating the followups before closing out the SEV.
I believe Reflection is a crucial part of any effective cycle - and it's a core part of my personal OS the Creation Cycle. If you don't learn from what you did, you can't improve. Improving a little bit every cycle is the Simple Scalable System for continuous improvement.
SEV Reviews (like Code Reviews) are a relatively lightweight process that provide unbounded upside. As such they are 3S Approved in my book and I happily recommend them to any engineering org.
If you liked this post you might also like:
The best way to support my work is to like / comment / share for the algorithm and subscribe for future updates.