Post-Mortem: hamy.xyz returning 404s (2025.11)

Essay - Published: 2025.12.12 | create | iamhamy | incident | post-mortem |

DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)

On 2025.11.22 hamy.xyz went down, returning 404s to users. Post Mortems are a useful tool for reflecting on incidents to better understand what happened and prevent them from recurring.

hamy.xyz is the public face of my brand so it's not great if it's down. Here we'll do a light post-mortem.

Overview

On 2025.11.22t~1100 ET a new note was created with a slug that collided with that of an existing blog post. This broke the server's slug to post mapping but the revision was still pushed through CI.

hamy.xyz went down, returning 404s to users.

It was detected at 11.23t0600 ET (18h later) and resolved by 11.23t0700 (19h).

Impact

Users: No users could reach the site for 19h - they just saw 404s.

Company: This drops web traffic by ~1/30 for the month. It blocked many forward links to various money-making projects. It may have an unknown long-term impact on SEO / AIEO.

Root Cause

Notes are a new part of the website that were added after my migration from F# to C# as a way to build the site into more of a digital garden.

For simplicity, I had separated the folders that the markdown files went into:

  • AllPosts - Existing folder with all my blog posts
  • AllNotes - New folder for notes

In order to achieve bidirectional links (when one post links to another, a backlink is created), I build a map of slugs to the posts they link to and then reverse it. There's a guard in there for duplicate slugs that breaks the build as that would lead to overwritten data.

In this case I was trying to create a note for my website, which I call iamhamy. So I wanted a note with slug iamhamy.

  • New note - iamhamy
  • Existing post - iamhamy (post)

So when I published that new note by moving it into the AllNotes folder it broke my site's post mapping and thus failed to spin up.

Detection

This issue was not detected for 18h. It was detected by me checking my website to review a different note and seeing an unexpected 404.

Failures in detection:

  • No local testing was performed -> So pushed to GH
  • No CI test / build gate existed -> so image was built and pushed to image repo
  • No pre-deploy healthcheck existed -> so new image was promoted to prod
  • No ongoing statuscheck was implemented -> so no auto alerts
  • I rarely go on social media -> so I didn't see the comments telling me my site was down

Followups:

  • Prefer local test before push (though this slows down my writing so not worth 100% enforcement)
  • Add healthcheck gates for image publish and deploy promotion to detect bad revisions early
  • Add ongoing status checks to notify when things go down

Escalation

Escalated immediately (after all I'm a one-man show).

Mitigation

Mitigation took ~20 minutes from when I got to my computer to when I'd found root cause and fixed it.

This is pretty good but could've been improved with better detection coverage (could've root caused automatically by looking at failing logs) and maybe runbooks for debugging my apps.

Followups:

  • Add detection gates
  • Add runbooks for debugging my deploy stack - builds, deploys, run logs

Prevention

The bad commit was pushed and made its way through the full CI pipeline.

Followups:

  • Change how posts are stored in my repo to get file-level uniqueness guarantees (aka hold them in the same folder)
  • Add healthcheck gates at GH build and deploy time so bad revisions don't make it to prod

Remediation

Nothing really to remediate. For SEO / AIEO, best I can do is keep the site up more so it doesn't get flagged as unreliable.

Next

I'll take on a few of these to improve my robustness to these kinds of issues in the future.

Short-term followups:

  • Prevention: Move all posts to live in same directory so file system enforces uniqueness
  • Detection: Add ongoing healthchecks and alerts to all sites

Long-term followups:

  • Prevention: Add healthchecks before build publish
  • Prevention: Add healthchecks before deploy
  • Prevention: Consider moving off of raw ansible to something with better blue/green deploys built in - maybe k8s, nomad, or smth else

If you liked this post you might also like:

Want more like this?

The best way to support my work is to like / comment / share for the algorithm and subscribe for future updates.