Outage: AWS outage impacting JumpRope's web app

Incident Report for JumpRope

Postmortem

Dear Everyone,

I'm reaching out to follow up on the recent 1 hour 45-minute service disruption that occurred on Tuesday afternoon at approximately 3pm ET. This incident was due to an issue at Amazon Web Services (AWS), our hosting platform.

In simple terms, we experienced a 100% error spike in a service we use to host important parts of our application. We tried to restore access using various strategies but were only successful after AWS fixed the underlying issue. Our systems recovered quickly within a few minutes once the problem was resolved. For more technical details, you can visit AWS’s global status page here: https://health.aws.amazon.com/health/status

I want to reassure you that our customer data (your scores, attendance, standards, etc.) was not affected and there was no data loss. Although the services that provide access to our platform were unavailable, all your data remained secure.

During the outage, we prioritized ensuring that our database and its backups were accessible. We made efforts to work around the issue, but it was a "region-wide" problem, meaning our existing multi-availability-zone setup was not enough to prevent disruption.

Throughout the incident, we kept our status page (status.jumpro.pe) updated every ~15-30 minutes and I encourage everyone to check this status page for the latest information. Within 30 minutes of the outage, we were able to display a real-time notification on our login page to inform users of the ongoing issue.

In response to this incident, we're exploring ways to enhance our infrastructure to prevent such issues from reoccurring. One possible solution is to introduce "multi-region" redundancy to ensure that our web application traffic can bypass any regions experiencing outages.

We've decided to invest in this additional redundancy despite the increase in operational costs because we believe it will significantly enhance the reliability of our application. We are committed to maintaining a near-perfect uptime and providing the best possible service.

To conclude, I apologize for the inconvenience caused by the outage, especially to those who were actively entering grades at the time. We are working diligently to improve our operational architecture to prevent similar issues in the future. Despite the outage, we have maintained a 99.945% uptime for the 2022-2023 school year. We will continue to work towards achieving 100% uptime.

Thank you for your patience. If you have any concerns or queries, please don't hesitate to contact our customer support team.

Best,

Jesse

Chief Technology Officer

JumpRope, Inc.

Posted Jun 16, 2023 - 13:06 EDT

Resolved

All systems are a "go" and you should be able to use the system as usual. No customer data was impacted while sever infrastructure was unavailable, so you can pick up right where you left off. We will continue to keep a close eye on things and follow up with a post-mortem in the coming days. Please send any questions or concerns to support@jumpro.pe and thanks for your patience while we worked to resolve this issue.

Jesse

Posted Jun 13, 2023 - 17:07 EDT

Update

We are continuing to monitor for any further issues.

Posted Jun 13, 2023 - 16:50 EDT

Update

We've confirmed that all servers are operational and are double-checking individual services manually to ensure overall application health.

Posted Jun 13, 2023 - 16:48 EDT

Monitoring

Alright folks, we've received word from AWS that they've addressed the root cause of the outage and in our checks our servers have begun the automatic process of "rebooting" which should mean availability is coming back online. We will continue to monitor the situation closely and will give the all clear when we're again confident that performance and availability is restored. In the meantime, it IS safe to enter data if you're able to log in successfully.

Again, we apologize for the trouble and appreciate your patience. We'll add relevant details as they become available and will issue a postmortem once we're able to complete an investigation and determine potential steps to avoid this in the future.

Posted Jun 13, 2023 - 16:45 EDT

Update

We've received a small update from the AWS site reliability team indicating that they're making progress addressing the issue but providing us with few other details. We have have taken steps to ensure that all customer data is backed up and accessible to us in the event that we need to move our servers to another region or provider, but it's very likely that AWS will clear up the issue on their end before those efforts would pay off.

Posted Jun 13, 2023 - 16:19 EDT

Update

Quick update: AWS has acknowledged the issue and assured that they're doing everything possible to fix it. We are doing the same on our end, but with limited options as they complete their work. We're well aware that grades are due for many schools & districts this week - rest assured that we'll have things up and running and soon as we possibly can. We're sorry for the trouble and appreciate your patience.

Posted Jun 13, 2023 - 15:44 EDT

Identified

The issue is related to an underlying outage in the the US East-1 region of Amazon Web Services, where the majority of our servers are located. We can confirm that all customer data is safe and backed up to an external location. We are looking into mitigations on our end to bring the application online while Amazon Web Services technicians do the same for the underlying infrastructure. We're sorry for the trouble - stay tuned for updates!

You can also follow the regional/global status on Amazon's availability page here: https://health.aws.amazon.com/health/status

Jesse

Posted Jun 13, 2023 - 15:16 EDT

Investigating

We are currently investigating this issue.

Posted Jun 13, 2023 - 15:09 EDT

This incident affected: JumpRope Authentication, JumpRope Application (front-end), JumpRope Services & Database, and Mastery Calculations.