Dear Everyone,
I'm reaching out to follow up on the recent 1 hour 45-minute service disruption that occurred on Tuesday afternoon at approximately 3pm ET. This incident was due to an issue at Amazon Web Services (AWS), our hosting platform.
In simple terms, we experienced a 100% error spike in a service we use to host important parts of our application. We tried to restore access using various strategies but were only successful after AWS fixed the underlying issue. Our systems recovered quickly within a few minutes once the problem was resolved. For more technical details, you can visit AWS’s global status page here: https://health.aws.amazon.com/health/status
I want to reassure you that our customer data (your scores, attendance, standards, etc.) was not affected and there was no data loss. Although the services that provide access to our platform were unavailable, all your data remained secure.
During the outage, we prioritized ensuring that our database and its backups were accessible. We made efforts to work around the issue, but it was a "region-wide" problem, meaning our existing multi-availability-zone setup was not enough to prevent disruption.
Throughout the incident, we kept our status page (status.jumpro.pe) updated every ~15-30 minutes and I encourage everyone to check this status page for the latest information. Within 30 minutes of the outage, we were able to display a real-time notification on our login page to inform users of the ongoing issue.
In response to this incident, we're exploring ways to enhance our infrastructure to prevent such issues from reoccurring. One possible solution is to introduce "multi-region" redundancy to ensure that our web application traffic can bypass any regions experiencing outages.
We've decided to invest in this additional redundancy despite the increase in operational costs because we believe it will significantly enhance the reliability of our application. We are committed to maintaining a near-perfect uptime and providing the best possible service.
To conclude, I apologize for the inconvenience caused by the outage, especially to those who were actively entering grades at the time. We are working diligently to improve our operational architecture to prevent similar issues in the future. Despite the outage, we have maintained a 99.945% uptime for the 2022-2023 school year. We will continue to work towards achieving 100% uptime.
Thank you for your patience. If you have any concerns or queries, please don't hesitate to contact our customer support team.
Best,
Jesse
Chief Technology Officer
JumpRope, Inc.