Blog

Addressing Our Current Outage: Steps Forward

by on March 26, 2024

Dear Community,

We’ve always believed in transparency and accountability, especially when things don’t go as planned. Today, we find ourselves in one of those challenging moments: a significant service outage has affected our operations in the EU and US regions. This blog post aims to shed light on the situation, our ongoing efforts to resolve the issue, and the steps to prevent future occurrences.

Background

The journey to this point began in February when we noticed unusual behavior in our database servers, leading to instability across several applications. Our investigation pinpointed the cause: a change in data type usage within some applications where developers had altered column types to or from JSON. This finding led us to update our database systems to the latest stable versions of Percona XtraDB Cluster and MySQL, which contained a fix for this issue. Following thorough testing in our development and staging environments, we confidently rolled out the update to our production environment overnight from March 24th to March 25th across US and EU clusters.

The Challenge

However, we encountered a new, unforeseen issue shortly after the update. Our database clusters began breaking up, resulting in service downtime in both production environments. Despite several attempts to restore service, the problem persisted, with symptoms including nodes in the database cluster accumulating an unsustainable number of connections. This bottleneck would eventually lead the cluster software to terminate the node, a process that would then repeat with other nodes, leading to widespread service unavailability.

Upon further investigation, we discovered a bug in the newly installed version of our database software, contributing to this critical situation.

Our Recovery Plan

As we navigate the service outage, it’s clear that neither maintaining the current software version nor downgrading presents a feasible solution. In response, our team has developed several strategies to overcome this challenge. Among these, we have identified a particularly promising recovery plan outlined below:

  1. Database System Restoration: Our initial step involves restoring a snapshot of the database system from the evening of Sunday, March 24th. This snapshot predates the database software upgrade, providing a reliable and stable baseline for our recovery efforts.
  2. Recreation of the Database Cluster: Following the restoration, we will rebuild the database cluster from the secured backup. This crucial phase ensures that our foundational systems are reset to a known, stable state, free from the complications introduced by the recent update.
  3. Data Recovery Process for Affected Applications: We recognize that some applications may have experienced data loss subsequent from March 25th. If your application has suffered from data loss, we urge you to reach out to us through the support forum, providing your application ID. Our dedicated team will then perform a targeted data recovery operation for your application, ensuring minimal disruption and loss.

We acknowledge that this recovery path is not without its complexities and challenges. However, we believe it offers a feasible route to restoring our services.

Looking Ahead

Though undoubtedly difficult, this experience provides valuable lessons on the importance of robust system testing and the need for contingency plans that can adapt to unexpected technical challenges. We are committed to conducting a thorough review of our systems and processes to prevent such occurrences in the future.

We understand this outage’s impact on our users and partners and sincerely apologize. Rest assured, our team works around the clock to restore services as swiftly and safely as possible. We appreciate your patience and support as we navigate this period.

I appreciate your understanding and continued trust in us.

Sincerely,

Mark

1 Comment

No matter how detailed and “robust” testing is performed, some bugs will always get through… doesn’t seem like Backendless did anything wrong (Percona released software with a bug) and did or is doing everything humanly possible to fix/recover/avoid-future problems… much appreciated!

Leave a Reply