2025-02-12 Aleph Zero Mainnet Outage Post-Mortem
Feb 20, 2025

Incident Summary
On February 12, 2025, between 10:22 – 12:02 and 12:14 – 13:10 UTC, finalization stalled on the Aleph Zero Mainnet, leading to a total outage of 2 hours and 33 minutes.
Impact:
- Transactions on Mainnet were temporarily halted.
- Trading on exchanges remained functional, but withdrawals and deposits were unavailable.
- All chain indexers stopped indexing new data.
Root Cause:
The outage was caused by a human error, where an incorrect sudo call was executed on the Aleph Zero Mainnet. [Transaction link]
Response & Recovery:
- The Aleph Zero Team took immediate action, releasing a fix in the aleph-node binary within 50 minutes of the outage starting.
- Thanks to the swift response from community validators, the required upgrades were completed, restoring finalization within 3 hours.
- External communication was timely and consistent, with updates provided across all general and Discord channels from the very beginning of the incident.
Lead-Up to the Incident
Each aleph-node binary supports two compatible finality versions—one current and one legacy. Validators can only participate in consensus if the on-chain finality version matches one of these values.
At the time of the incident:
- The Mainnet was running r-14.0.0, supporting finality versions 3 (legacy) and 4 (current).
- The upcoming Testnet release candidate (r-15.0.0-rc1) supported versions 4 (legacy) and 5 (current).
Where Things Went Wrong
- Changing the finality version on-chain requires a manual sudo call—a process managed by a small, trusted group with access to the sudo account.
- Executing a sudo call requires two approvals before being signed on-chain.
- Due to human error and miscommunication, the wrong finality version (5) was set on Mainnet instead of 4, leading to the outage.
- The Mainnet 14 upgrade documentation did not explicitly state the correct finality version, contributing to the confusion.
- With no automated safeguards in place, the error was not detected until it caused the outage.
Fault & Outage Timeline
At 10:22 UTC, when session #113915 began, all validators switched to the on-chain finality version 5. Since all Mainnet nodes were running version ≤14, which only supported versions 3 and 4, every validator crashed immediately—halting finalization.
- Validators produced 20 blocks before stopping completely, as expected when finalization fails.
- The first recovery attempt succeeded at 12:02, when enough validators had upgraded their nodes.
- However, finalization broke again at 12:14, when the next session (#113916) began, since not all reserved nodes had updated their binaries.
- The outage was fully resolved at 13:10, when enough validators had updated their nodes.
Impact Analysis
Effects on Users
- Transaction Processing Halted: Users were unable to execute transactions or interact with smart contracts.
- Funds Remained Secure: Despite the outage, all on-chain assets remained safe, as the blockchain was fully preserved.
Effects on Exchanges
- Trading Continued: Users could continue buying and selling tokens on exchanges.
- Withdrawals & Deposits Suspended: Users were temporarily unable to withdraw or deposit tokens. Most exchanges marked this period as “Maintenance” to manage expectations.
Effects on Infrastructure
- Chain Indexers Stalled: Tools like Subscan could not display new blocks during the outage.
- RPC Nodes Unaffected: Since finality version mismatches only affect validators, RPC endpoints (e.g., azero.dev) continued to function normally.
Detection & Initial Response
The issue was detected immediately during a routine sync call at 10:15 UTC, where a team member noticed the stalled finalization in session #113915.
Response Actions
- A war room was established within minutes.
- By 10:30 UTC, the Marketing team joined the war room to coordinate external communication.
- At 10:45 UTC, a unified message was posted across all channels, informing the community about the issue and ongoing resolution efforts.
Community Involvement
- The community quickly noticed the outage, with validators and exchange partners reaching out for updates.
- By 11:15 UTC, direct outreach to validators ensured they upgraded their nodes promptly.
Recovery & Resolution
Solution Implemented
- Engineers initially considered a chainspec override, but a simpler solution was proposed:
- Releasing a new aleph-node binary (14.1.0) that recognized finality version 5 as valid.
- Signing a sudo call to revert the finality version back to 3, ensuring long-term stability.
- The fixed binary was published at 11:15 UTC, and validators began upgrading.
Final Resolution
- By 12:02 UTC, enough validators in session #113915 had upgraded, restoring finalization.
- However, in session #113916 (12:14 UTC), finalization stalled again due to a few remaining outdated nodes.
- By 13:10 UTC, enough validators had upgraded, and finalization was fully restored.
Preventative Measures & Process Improvements
To prevent similar incidents in the future, the sudo call process will be formalized with additional safeguards:
Pre-Execution Documentation:
- Every sudo call will be documented in a dedicated Notion page, detailing the date, time, and exact parameters.
Approval Process:
- Before execution, the sudo call must be reviewed by at least two people:
- The second signatory.
- A trusted reviewer outside the sudo holders group.
- Only after both approvals are recorded will the sudo call be executed.
Verification Step:
- Both signers must manually verify parameters before execution.
By implementing these changes, we aim to eliminate human error risks and enhance operational resilience.
Timeline of Events
All times in UTC.
Time | Event |
10:04 | Incorrect sudo call executed (finality version set to 5). |
10:22 | Session #113915 begins—validators crash due to incompatible finality version. |
10:22 | War room established (sync call in progress). |
10:30 | Marketing team joins response efforts. |
10:45 | External communications posted to general channels. |
11:15 | Fixed aleph-node binary (14.1.0) published; validators begin updating. |
12:02 | Finalization restored for session #113915. |
12:14 | Session #113916 begins—finalization halts again. |
13:10 | Enough validators upgrade, finalization restored. |
Conclusion
This incident highlighted the need for improved safeguards around sudo calls. Despite the human error, the quick response from the Aleph Zero team and validator community ensured a rapid recovery.We appreciate the cooperation of our validators and community members during this process. Moving forward, we are committed to implementing stronger safeguards to enhance network reliability.