Updates Realm

Updates

0 Characters

Out of Cards Incident Report - 2020-10-17

Submitted 3 years, 6 months ago by

Hey all,

We had some unintended downtime this way-too-early morning. Because this kind of stuff always interests me, and transparency can be fun, let's see what went wrong!


Event Log

All times in Eastern.

2:43 AM - Out of Cards goes offline.

2:49 AM - A service unreachable email gets fired off.

This email did not trigger the alerts that it should have. Since we did a site migration a couple of months ago, our tech for that changed and was never tested. Pfft, downtime. That'll never happen!

3:01 AM - The messages begin to hit me on Discord and Twitter.

See, you don't need proper downtime checks!

3:03 AM - Investigation Begins

3:19 AM - Problem Discovered

Our container service decided it wanted to do an automatic update. That really should not have been an issue but because the containers running our site were not set to automatically restart, the site went down.

3:22 AM - Site comes back online

Updates for the service have been disabled - we'd rather manage these ourselves anyway.


What We're Changing

Obviously, our big issue is that our app containers didn't restart themselves. That's shitty and a huge oversight. A configuration change will go out in the morning that will resolve this.

The secondary issue is garbage monitoring of the site. There should be some serious alarms going off when stuff goes wrong. I'll make sure we have better processes in place going forward to notify of downtime.


Hopefully there isn't a next time!

  • Fluxflashor's Avatar
    CEO 2005 3070 Posts Joined 10/19/2018
    Posted 3 years, 6 months ago

    Hey all,

    We had some unintended downtime this way-too-early morning. Because this kind of stuff always interests me, and transparency can be fun, let's see what went wrong!


    Event Log

    All times in Eastern.

    2:43 AM - Out of Cards goes offline.

    2:49 AM - A service unreachable email gets fired off.

    This email did not trigger the alerts that it should have. Since we did a site migration a couple of months ago, our tech for that changed and was never tested. Pfft, downtime. That'll never happen!

    3:01 AM - The messages begin to hit me on Discord and Twitter.

    See, you don't need proper downtime checks!

    3:03 AM - Investigation Begins

    3:19 AM - Problem Discovered

    Our container service decided it wanted to do an automatic update. That really should not have been an issue but because the containers running our site were not set to automatically restart, the site went down.

    3:22 AM - Site comes back online

    Updates for the service have been disabled - we'd rather manage these ourselves anyway.


    What We're Changing

    Obviously, our big issue is that our app containers didn't restart themselves. That's shitty and a huge oversight. A configuration change will go out in the morning that will resolve this.

    The secondary issue is garbage monitoring of the site. There should be some serious alarms going off when stuff goes wrong. I'll make sure we have better processes in place going forward to notify of downtime.


    Hopefully there isn't a next time!

    Founder, Out of Games

    Follow me on Twitch and Twitter.
    If you are planning on playing WoW on US realms, consider using my recruit link =)

    5
  • Leave a Comment

    You must be signed in to leave a comment. Sign in here.

    ODYN
    0 Users Here