Essay - Published: 2025.12.12 | create | iamhamy | incident | post-mortem |
DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)
On 2025.11.22 hamy.xyz went down, returning 404s to users. Post Mortems are a useful tool for reflecting on incidents to better understand what happened and prevent them from recurring.
hamy.xyz is the public face of my brand so it's not great if it's down. Here we'll do a light post-mortem.
On 2025.11.22t~1100 ET a new note was created with a slug that collided with that of an existing blog post. This broke the server's slug to post mapping but the revision was still pushed through CI.
hamy.xyz went down, returning 404s to users.
It was detected at 11.23t0600 ET (18h later) and resolved by 11.23t0700 (19h).
Users: No users could reach the site for 19h - they just saw 404s.
Company: This drops web traffic by ~1/30 for the month. It blocked many forward links to various money-making projects. It may have an unknown long-term impact on SEO / AIEO.
Notes are a new part of the website that were added after my migration from F# to C# as a way to build the site into more of a digital garden.
For simplicity, I had separated the folders that the markdown files went into:
In order to achieve bidirectional links (when one post links to another, a backlink is created), I build a map of slugs to the posts they link to and then reverse it. There's a guard in there for duplicate slugs that breaks the build as that would lead to overwritten data.
In this case I was trying to create a note for my website, which I call iamhamy. So I wanted a note with slug iamhamy.
iamhamyiamhamy (post)So when I published that new note by moving it into the AllNotes folder it broke my site's post mapping and thus failed to spin up.
This issue was not detected for 18h. It was detected by me checking my website to review a different note and seeing an unexpected 404.
Failures in detection:
Followups:
Escalated immediately (after all I'm a one-man show).
Mitigation took ~20 minutes from when I got to my computer to when I'd found root cause and fixed it.
This is pretty good but could've been improved with better detection coverage (could've root caused automatically by looking at failing logs) and maybe runbooks for debugging my apps.
Followups:
The bad commit was pushed and made its way through the full CI pipeline.
Followups:
Nothing really to remediate. For SEO / AIEO, best I can do is keep the site up more so it doesn't get flagged as unreliable.
I'll take on a few of these to improve my robustness to these kinds of issues in the future.
Short-term followups:
Long-term followups:
If you liked this post you might also like:
The best way to support my work is to like / comment / share for the algorithm and subscribe for future updates.