by Mark Piller, September 13, 2022
In a business where one of your value propositions is uptime, outages are particularly painful. It is hard to describe the stress of an unplanned downtime, when you work against time and feel the customers’ pain with every “I am getting 502 errors” message.
I know that the stress on the customer side is just as painful, when you know that your app is not operational and there is nothing you can do. It is a pain I do not wish anyone, even our competition, to experience.
In this post, I will describe what happened yesterday, why it happened, and what we will do to prevent it in the future.
First, let’s dive into our infrastructure. Backendless Cloud runs in two different data centers, US and EU. The US cluster is located in Dallas, TX, and runs in a Tier 4 data center, which provides the highest level of redundancy across everything. There is dual power, dual internet connectivity, blended across four separate providers, enormous cooling facility, and world-class data and access security. We own our own hardware, the architecture provides complete redundancy for each layer of the platform – gateways, routers, switches, web servers, app servers, code runners, database, file and backup systems.
The database layer requires special attention here because that’s what failed yesterday.
To provide high availability for our database, we use a system called Percona. The data tier consists of three nodes. This helps to distribute the load and implement various read/write strategies. Failure of any one node is not a critical event for the cluster; the remaining two nodes can handle all database operations just fine. However, a simultaneous failure (or non-availability) of any two nodes in a three node cluster is fatal as the database goes into self-preservation mode to avoid cluster breakage and stops accepting incoming requests. Let’s remember that point.
The sequence of unfortunate events yesterday started with a plan to update memory in one of the database servers. The process of memory update goes like this: remove the server node from the cluster, disconnect and open the server, add memory, put it back in, turn it on, and add the server to the cluster. This is the kind of operation we had done many times in the past. It does not presume any downtime since the cluster can operate just fine with the two remaining nodes. As a result, there were no public announcements about the upcoming maintenance. Yes, the procedure carries a risk that one of the remaining nodes becomes unavailable, however, the odds of that are very low.
A technician has been dispatched to the data center…
12:30 pm – We began the procedure, our devops have remotely removed the server from the cluster and turned it off.
12:36 pm – The onsite technician started physical removal of the server from the rack.
12:38 pm – Our monitoring reported that the entire database cluster is unavailable.
12:41 pm – A user reported on our Slack channel that they cannot access the service. Indeed, everyone accessing the service now was getting the 502 page.
What happened in the first 8 minutes that caused a 3 hour outage? There were several omissions in the actual protocol and I will dig into them, but the root cause is the basic accidental human carelessness. While disconnecting one server, the network cable of another server (that happened to be another database node) was accidentally pulled and by doing so disconnected the server. As a result, we ended up with a one-node database cluster, which as you know from the description above, is pretty much useless.
Let me digress for a moment and talk about network cables. These are not the same colorful red, yellow or blue ethernet cables with plastic plugs. Even those are not that easy to pull out of a computer. We’re talking about enterprise grade, 10 gigabit ethernet cables:
A loose cable… with about a 2 inch metal plug that goes into a socket and gets locked in there.
How is this possible, I asked?
It is.
One thing I learned in this adventure is that in the server administration world, everything is possible. You have to be ready for everything, even loose cables.
Back from the digression. Database cluster recovery is not a fast process. Keep in mind that at the time when we learned that the database was not available, we still did not know what exactly happened. The monitoring log was getting filled with a few hundred new lines per second.
So, in situations when you don’t know what’s going on, you go back to the basics, check network connectivity for every physical machine and then virtualized environment in them (the latter is simpler since we have a very powerful orchestration software showing us the status of every VM).
The network check shows that one of the servers is disconnected from the network. Bringing it back was quick; however, now we have a task of rebuilding the cluster. Rebuilding the cluster is a multi-stage process. The first node must be started in a special mode. Then you connect the second node and wait till they get synchronized. After that, the process is repeated with the third node. Unfortunately, it takes time and that’s what made the outage last as long as it did.
So, what are the lessons we learned from this painful incident? I have put them in the list below:
- Before we start a procedure with a physical server removal, check all cables, make sure they are firmly plugged in.
- When going through a procedure coordinated between an onsite and offsite staff, acknowledge each step done by either party. Do not proceed to the next step until it is confirmed that the previous one did not result in any kind of failures.
- Upgrade server’s network interface redundancy so when one cable is pulled/cut/shredded/chewed/burned/etc., there is another one still providing connectivity (this has been scheduled).
- Plan physical server maintenance outside of the main business hours.
Finally, if you were impacted by our outage yesterday, please accept my apology. We do not take these incidents lightly. I know it damages your trust in us and I hope you will give us a chance to rebuild the trust.