As some of you already know, on Monday, May 2nd, the infrastructure currently running the Aleph Zero blockchain experienced a partial service outage, resulting in slower block production times and temporary delays with the transaction finalization. This article presents a detailed report of the incident that occurred on that day and the way they were mitigated.
Currently, in the first phase of our journey toward decentralization, the Aleph Zero blockchain is maintained by a committee of ten validator nodes controlled by the Aleph Zero Foundation. These nodes are deployed as virtual machines hosted on a state-of-the-art cloud computing service provider (AWS) and spread out over five different geographical regions. On top of that, our infrastructure consists of several non-validator nodes, optimized for handling external traffic, and additional layers responsible for load balancing, storage backups, logs and metrics aggregation, and many more.
Every element of our setup followed the best DevOps practices and was thoroughly reviewed and tested before being deployed. Unfortunately, sometimes mistakes happen and bugs manage to slip through even the most rigorous sieve of tests and reviews. Recently we have introduced a small maintenance script to our setup dedicated to cleaning up old and unused storage. The script turned out to contain a bug that manifested itself by cleaning much more than we initially intended…
The Nature of the Emergency
On the 2nd of May, at 08:00 AM (CET), the scheduled cleaning script started and, after executing all its intended actions, proceeded to delete the active storage of four validator nodes, effectively disabling them and causing them to crash. The problem was immediately picked up by our monitoring system and the on-call developer was alerted. Within a few minutes, he assessed the situation and made the correct decision to escalate the issue. Shortly after, the emergency response team of senior developers and DevOps engineers was formed and a thorough investigation was started. Shortly before 09:00 AM, the cause was diagnosed, the problematic cleaning script was disabled, and a remediation plan was put underway.
It’s worth noting here that the whole problem was not caused by any bug or misbehavior of Aleph Node — the code responsible for running each node of our blockchain. It was our misconfigured infrastructure that decided to simulate a full-blown massive-scale attack on the network by killing four out of ten validators. What Aleph Node itself did in this dire situation is actually quite remarkable.
Byzantine Fault Tolerance and Misbehaving Nodes
To fully understand the above, let’s have a quick dive into some blockchain theory. The best guarantees of Byzantine Fault Tolerant consensus protocols (such as AlephBFT) state that the theoretical limit of resilience against misbehaving nodes is one-third. In other words, no distributed system can ever guarantee to work properly if more than a third of its participants become dead, faulty, or adversarial. That means in our case of a ten-node committee the minimal number of nodes needed to keep the system running is seven.
That Monday morning our network was left with only six working nodes. And it kept going. The average block production time increased a bit (from 1 to 1.7 seconds) and the block finalization was stalled, as the network requires a minimum of seven committee members to proceed. But besides that, the Aleph Zero blockchain simply kept on working — responding to all external queries, accepting and applying transactions, and producing new blocks. All of that despite the fact that 40% of validator nodes were completely offline.
The Aleph Zero Team’s Emergency Response
Having diagnosed and understood the root cause, the emergency response team quickly realized that the correct way to fix the issue is to restore lost nodes’ storage using backed-up data. To skip the lengthy process of downloading the whole blockchain by a freshly booted node, the response team decided to copy the storage of one of the six alive nodes to one of the dead nodes. Thanks to that the minimal number of seven nodes required for finalization to recover was back and running around 01:30 PM and finalization caught up with the head of the blockchain around 02:45 PM. The remaining three dead validator nodes were restarted with a fresh database and naturally synced with the other nodes. Around 07:00 PM they caught up with the latest block and the system was back to its full capacity.
The situation described above was the first time since the launch of the Aleph Zero mainnet that any emergency occurred. Never before the resilience of the whole network, as well as our emergency response procedures, have been put to a test by a real-life, non-simulated incident. After a thorough internal investigation, we can safely conclude that we are quite proud of how both our technology and our people passed that first test. Despite the severe scale of the infrastructure disruption, our network kept operating with acceptable performance and the response team was able to quickly and safely diagnose and mitigate the problem.
Re-designing Internal Processes and Expediting Decentralization
The lessons we would learn from this story are of two kinds. The first kind involves our internal processes. We decided to introduce a stricter code review pipeline for all changes applied to the infrastructure. We also added a policy enforcing the schedules of any maintenance actions to be more spread out in time, to minimize the possibility of such a multiple-node failure in the future. Our emergency response runbooks have been updated with the priceless experience we gathered during the incident.
The second, more universal type of take-home message here is that one can never overestimate the importance of decentralization. This is something that we all felt was true before, but now it became even more apparent. The whole situation strengthened our conviction that moving toward full decentralization should and will be our main goal for the upcoming months.
Resetting the Testnet
The testnet is our test environment where we roll out new features and improvements. Each new version of Aleph Node needs to run on the testnet without any problems for at least a month before it gets deployed to the mainnet. For that reason, the testnet infrastructure is configured as an exact copy of the mainnet one. Together with all maintenance scripts. Including the infamous overeager storage cleaning script.
As a result, the same kind of validator storage wipeout happened on the testnet on that unfortunate Monday morning. Sadly, the consequences here turned out to be a bit more serious.
One of the main goals we have been working on recently is opening our network for external validators and introducing the mechanism for rotating the validator committee. Very recently, we have been performing tests that involved generating new sets of private keys for testnet validators. Due to the early stage of the feature, the backup scripts on the testnet were not yet adapted to care about those new keys when the wipeout happened. As a consequence of that unfortunate incident, some of the keys have been lost forever.
Sadly, it is impossible to restore the testnet in its current state without implementing a dedicated ad-hoc recovery mechanism. After careful consideration, we decided not to do that and instead reset the testnet. Shutting down the old testnet and starting a new one will be performed this week.