2025-02-12 Aleph Zero Mainnet Outage Post-Mortem
Feb 20, 2025
Incident Summary
On February 12, 2025, between 10:22 – 12:02 and 12:14 – 13:10 UTC, finalization stalled on the Aleph Zero Mainnet, leading to a total outage of 2 hours and 33 minutes.
Impact:
- Transactions on Mainnet were temporarily halted.
- Trading on exchanges remained functional, but withdrawals and deposits were unavailable.
- All chain indexers stopped indexing new data.
Root Cause:
The outage was caused by a human error, where an incorrect sudo call was executed on the Aleph Zero Mainnet. [Transaction link]
Response & Recovery:
- The Aleph Zero Team took immediate action, releasing a fix in the aleph-node binary within 50 minutes of the outage starting.
- Thanks to the swift response from community validators, the required upgrades were completed, restoring finalization within 3 hours.
- External communication was timely and consistent, with updates provided across all general and Discord channels from the very beginning of the incident.
Lead-Up to the Incident
Each aleph-node binary supports two compatible finality versions—one current and one legacy. Validators can only participate in consensus if the on-chain finality version matches one of these values.
At the time of the incident:
- The Mainnet was running r-14.0.0, supporting finality versions 3 (legacy) and 4 (current).
- The upcoming Testnet release candidate (r-15.0.0-rc1) supported versions 4 (legacy) and 5 (current).
Where Things Went Wrong
- Changing the finality version on-chain requires a manual sudo call—a process managed by a small, trusted group with access to the sudo account.
- Executing a sudo call requires two approvals before being signed on-chain.
- Due to human error and miscommunication, the wrong finality version (5) was set on Mainnet instead of 4, leading to the outage.
- The Mainnet 14 upgrade documentation did not explicitly state the correct finality version, contributing to the confusion.
- With no automated safeguards in place, the error was not detected until it caused the outage.
Fault & Outage Timeline
At 10:22 UTC, when session #113915 began, all validators switched to the on-chain finality version 5. Since all Mainnet nodes were running version ≤14, which only supported versions 3 and 4, every validator crashed immediately—halting finalization.
- Validators produced 20 blocks before stopping completely, as expected when finalization fails.
- The first recovery attempt succeeded at 12:02, when enough validators had upgraded their nodes.
- However, finalization broke again at 12:14, when the next session (#113916) began, since not all reserved nodes had updated their binaries.
- The outage was fully resolved at 13:10, when enough validators had updated their nodes.
Impact Analysis
Effects on Users
- Transaction Processing Halted: Users were unable to execute transactions or interact with smart contracts.
- Funds Remained Secure: Despite the outage, all on-chain assets remained safe, as the blockchain was fully preserved.
Effects on Exchanges
- Trading Continued: Users could continue buying and selling tokens on exchanges.
- Withdrawals & Deposits Suspended: Users were temporarily unable to withdraw or deposit tokens. Most exchanges marked this period as “Maintenance” to manage expectations.
Effects on Infrastructure
- Chain Indexers Stalled: Tools like Subscan could not display new blocks during the outage.
- RPC Nodes Unaffected: Since finality version mismatches only affect validators, RPC endpoints (e.g., azero.dev) continued to function normally.
Detection & Initial Response
The issue was detected immediately during a routine sync call at 10:15 UTC, where a team member noticed the stalled finalization in session #113915.
Response Actions
- A war room was established within minutes.
- By 10:30 UTC, the Marketing team joined the war room to coordinate external communication.
- At 10:45 UTC, a unified message was posted across all channels, informing the community about the issue and ongoing resolution efforts.
Community Involvement
- The community quickly noticed the outage, with validators and exchange partners reaching out for updates.
- By 11:15 UTC, direct outreach to validators ensured they upgraded their nodes promptly.
Recovery & Resolution
Solution Implemented
- Engineers initially considered a chainspec override, but a simpler solution was proposed:
- Releasing a new aleph-node binary (14.1.0) that recognized finality version 5 as valid.
- Signing a sudo call to revert the finality version back to 3, ensuring long-term stability.
- The fixed binary was published at 11:15 UTC, and validators began upgrading.
Final Resolution
- By 12:02 UTC, enough validators in session #113915 had upgraded, restoring finalization.
- However, in session #113916 (12:14 UTC), finalization stalled again due to a few remaining outdated nodes.
- By 13:10 UTC, enough validators had upgraded, and finalization was fully restored.
Preventative Measures & Process Improvements
To prevent similar incidents in the future, the sudo call process will be formalized with additional safeguards:
Pre-Execution Documentation:
- Every sudo call will be documented in a dedicated Notion page, detailing the date, time, and exact parameters.
Approval Process:
- Before execution, the sudo call must be reviewed by at least two people:
- The second signatory.
- A trusted reviewer outside the sudo holders group.
- Only after both approvals are recorded will the sudo call be executed.
Verification Step:
- Both signers must manually verify parameters before execution.
By implementing these changes, we aim to eliminate human error risks and enhance operational resilience.
Timeline of Events
All times in UTC.
| Time | Event |
| 10:04 | Incorrect sudo call executed (finality version set to 5). |
| 10:22 | Session #113915 begins—validators crash due to incompatible finality version. |
| 10:22 | War room established (sync call in progress). |
| 10:30 | Marketing team joins response efforts. |
| 10:45 | External communications posted to general channels. |
| 11:15 | Fixed aleph-node binary (14.1.0) published; validators begin updating. |
| 12:02 | Finalization restored for session #113915. |
| 12:14 | Session #113916 begins—finalization halts again. |
| 13:10 | Enough validators upgrade, finalization restored. |
Conclusion
This incident highlighted the need for improved safeguards around sudo calls. Despite the human error, the quick response from the Aleph Zero team and validator community ensured a rapid recovery.We appreciate the cooperation of our validators and community members during this process. Moving forward, we are committed to implementing stronger safeguards to enhance network reliability.