Production Database Outage
Incident Report for JumpRope
Postmortem

After further investigation, we have determined that today’s downtime was caused initially by faulty memory in our primary database instance which stores and serves most data used in our application. This triggered an automatic failover process to our standby database instance as expected, which is a procedure that we test periodically and happens in the wild occasionally (once or twice per year).

Unfortunately, the issue was compounded by a secondary problem in our failover process that caused the standby instance to incorrectly act as a read-only database instead of a read/write database. As a result, the database cluster “hung” in an unreachable state until we could completely recover the original database instance on new hardware. At this time, the application was briefly available, but shortly thereafter the aforementioned failover process attempted to complete. Correcting for this added another ~10 minutes to the downtime that users experienced, which overall was from ~9:33am until 10:21am ET. After that time, the database was healthy and other pieces of our infrastructure recovered over the next few minutes.

No data was lost or corrupted in this process, and multiple offsite backups were available to us in the event that they were required. Recovery from this instance did not involve recovery to the data itself, but rather to the compute hardware (CPU and memory) and networking infrastructure (pointing our application servers to the recovered database), both of which are decoupled from the storage.

We apologize for the trouble and inconvenience that this caused our customers. We continue to investigate the details of this incident and plan to take multiple steps to prevent similar incidents in the future.

Thanks very much,

Jesse

Posted Oct 27, 2022 - 12:49 EDT

Resolved
UPDATE: Resolved at 10:21am ET. The application should become available to users the next time they log in.

JumpRope's production database encountered a memory allocation error that prompted a reboot. Attempts to failover to our standby instance were delayed for reasons we continue to investigate. During this time, users would see errors or blank pages when they attempted to log in our use the application across all devices and customers.
Posted Oct 27, 2022 - 09:33 EDT