Dear Community,
We’ve always believed in transparency and accountability, especially when things don’t go as planned. Today, we find ourselves in one of those challenging moments: a significant service outage has affected our operations in the EU and US regions. This blog post aims to shed light on the situation, our ongoing efforts to resolve the issue, and the steps to prevent future occurrences.
The journey to this point began in February when we noticed unusual behavior in our database servers, leading to instability across several applications. Our investigation pinpointed the cause: a change in data type usage within some applications where developers had altered column types to or from JSON. This finding led us to update our database systems to the latest stable versions of Percona XtraDB Cluster and MySQL, which contained a fix for this issue. Following thorough testing in our development and staging environments, we confidently rolled out the update to our production environment overnight from March 24th to March 25th across US and EU clusters.
However, we encountered a new, unforeseen issue shortly after the update. Our database clusters began breaking up, resulting in service downtime in both production environments. Despite several attempts to restore service, the problem persisted, with symptoms including nodes in the database cluster accumulating an unsustainable number of connections. This bottleneck would eventually lead the cluster software to terminate the node, a process that would then repeat with other nodes, leading to widespread service unavailability.
Upon further investigation, we discovered a bug in the newly installed version of our database software, contributing to this critical situation.
As we navigate the service outage, it’s clear that neither maintaining the current software version nor downgrading presents a feasible solution. In response, our team has developed several strategies to overcome this challenge. Among these, we have identified a particularly promising recovery plan outlined below:
We acknowledge that this recovery path is not without its complexities and challenges. However, we believe it offers a feasible route to restoring our services.
Though undoubtedly difficult, this experience provides valuable lessons on the importance of robust system testing and the need for contingency plans that can adapt to unexpected technical challenges. We are committed to conducting a thorough review of our systems and processes to prevent such occurrences in the future.
We understand this outage’s impact on our users and partners and sincerely apologize. Rest assured, our team works around the clock to restore services as swiftly and safely as possible. We appreciate your patience and support as we navigate this period.
I appreciate your understanding and continued trust in us.
Sincerely,
Mark
No matter how detailed and “robust” testing is performed, some bugs will always get through… doesn’t seem like Backendless did anything wrong (Percona released software with a bug) and did or is doing everything humanly possible to fix/recover/avoid-future problems… much appreciated!