After further investigation, we have determined that today’s downtime was caused initially by faulty memory in our primary database instance which stores and serves most data used in our application. This triggered an automatic failover process to our standby database instance as expected, which is a procedure that we test periodically and happens in the wild occasionally (once or twice per year).
Unfortunately, the issue was compounded by a secondary problem in our failover process that caused the standby instance to incorrectly act as a read-only database instead of a read/write database. As a result, the database cluster “hung” in an unreachable state until we could completely recover the original database instance on new hardware. At this time, the application was briefly available, but shortly thereafter the aforementioned failover process attempted to complete. Correcting for this added another ~10 minutes to the downtime that users experienced, which overall was from ~9:33am until 10:21am ET. After that time, the database was healthy and other pieces of our infrastructure recovered over the next few minutes.
No data was lost or corrupted in this process, and multiple offsite backups were available to us in the event that they were required. Recovery from this instance did not involve recovery to the data itself, but rather to the compute hardware (CPU and memory) and networking infrastructure (pointing our application servers to the recovered database), both of which are decoupled from the storage.
We apologize for the trouble and inconvenience that this caused our customers. We continue to investigate the details of this incident and plan to take multiple steps to prevent similar incidents in the future.
Thanks very much,
Jesse