Aleph Zero Blog
Others

2025-02-12 Aleph Zero Mainnet Outage Post-Mortem

Feb 20, 2025

AI Summary

Here's your AI summary of 2025-02-12 Aleph Zero Mainnet Outage Post-Mortem on Aleph Zero blog

Top 10 Key Takeaways:

  1. Incident Overview: On February 12, 2025, the Aleph Zero Mainnet experienced a 2-hour and 33-minute outage due to a finalization stall, affecting transaction processing but not trading on exchanges.

  2. Root Cause: The outage was triggered by a human error involving an incorrect sudo call that set the wrong finality version on the Mainnet.

  3. Response and Recovery: The Aleph Zero team quickly released a fix within 50 minutes, and with the help of community validators, finalization was restored within 3 hours.

  4. Communication: The team maintained timely and consistent communication with the community through general and Discord channels throughout the incident.

  5. Technical Details: The Mainnet was running a version that supported finality versions 3 and 4, but a miscommunication led to setting version 5, which was incompatible, causing validators to crash.

  6. Impact on Users: While transaction processing was halted, all on-chain assets remained secure, and trading on exchanges continued, though withdrawals and deposits were suspended.

  7. Detection and Initial Response: The issue was detected during a routine sync call, leading to the establishment of a war room and coordinated external communication.

  8. Community Involvement: Validators and exchange partners were actively involved in the recovery process, ensuring prompt node upgrades.

  9. Preventative Measures: To prevent future incidents, the sudo call process will be formalized with additional documentation, approval, and verification steps to minimize human error.

  10. Conclusion: The incident underscored the need for improved safeguards, and the Aleph Zero team is committed to implementing stronger measures to enhance network reliability.

AI Summary

Incident Summary

On February 12, 2025, between 10:22 – 12:02 and 12:14 – 13:10 UTC, finalization stalled on the Aleph Zero Mainnet, leading to a total outage of 2 hours and 33 minutes.

Impact:

  • Transactions on Mainnet were temporarily halted.
  • Trading on exchanges remained functional, but withdrawals and deposits were unavailable.
  • All chain indexers stopped indexing new data.

Root Cause:

The outage was caused by a human error, where an incorrect sudo call was executed on the Aleph Zero Mainnet. [Transaction link]

Response & Recovery:

  • The Aleph Zero Team took immediate action, releasing a fix in the aleph-node binary within 50 minutes of the outage starting.
  • Thanks to the swift response from community validators, the required upgrades were completed, restoring finalization within 3 hours.
  • External communication was timely and consistent, with updates provided across all general and Discord channels from the very beginning of the incident.

Lead-Up to the Incident

Each aleph-node binary supports two compatible finality versions—one current and one legacy. Validators can only participate in consensus if the on-chain finality version matches one of these values.

At the time of the incident:

  • The Mainnet was running r-14.0.0, supporting finality versions 3 (legacy) and 4 (current).
  • The upcoming Testnet release candidate (r-15.0.0-rc1) supported versions 4 (legacy) and 5 (current).

Where Things Went Wrong

  • Changing the finality version on-chain requires a manual sudo call—a process managed by a small, trusted group with access to the sudo account.
  • Executing a sudo call requires two approvals before being signed on-chain.
  • Due to human error and miscommunication, the wrong finality version (5) was set on Mainnet instead of 4, leading to the outage.
  • The Mainnet 14 upgrade documentation did not explicitly state the correct finality version, contributing to the confusion.
  • With no automated safeguards in place, the error was not detected until it caused the outage.

Fault & Outage Timeline

At 10:22 UTC, when session #113915 began, all validators switched to the on-chain finality version 5. Since all Mainnet nodes were running version ≤14, which only supported versions 3 and 4, every validator crashed immediately—halting finalization.

  • Validators produced 20 blocks before stopping completely, as expected when finalization fails.
  • The first recovery attempt succeeded at 12:02, when enough validators had upgraded their nodes.
  • However, finalization broke again at 12:14, when the next session (#113916) began, since not all reserved nodes had updated their binaries.
  • The outage was fully resolved at 13:10, when enough validators had updated their nodes.

Impact Analysis

Effects on Users

  • Transaction Processing Halted: Users were unable to execute transactions or interact with smart contracts.
  • Funds Remained Secure: Despite the outage, all on-chain assets remained safe, as the blockchain was fully preserved.

Effects on Exchanges

  • Trading Continued: Users could continue buying and selling tokens on exchanges.
  • Withdrawals & Deposits Suspended: Users were temporarily unable to withdraw or deposit tokens. Most exchanges marked this period as “Maintenance” to manage expectations.

Effects on Infrastructure

  • Chain Indexers Stalled: Tools like Subscan could not display new blocks during the outage.
  • RPC Nodes Unaffected: Since finality version mismatches only affect validators, RPC endpoints (e.g., azero.dev) continued to function normally.

Detection & Initial Response

The issue was detected immediately during a routine sync call at 10:15 UTC, where a team member noticed the stalled finalization in session #113915.

Response Actions

  • A war room was established within minutes.
  • By 10:30 UTC, the Marketing team joined the war room to coordinate external communication.
  • At 10:45 UTC, a unified message was posted across all channels, informing the community about the issue and ongoing resolution efforts.

Community Involvement

  • The community quickly noticed the outage, with validators and exchange partners reaching out for updates.
  • By 11:15 UTC, direct outreach to validators ensured they upgraded their nodes promptly.

Recovery & Resolution

Solution Implemented

  • Engineers initially considered a chainspec override, but a simpler solution was proposed:
    • Releasing a new aleph-node binary (14.1.0) that recognized finality version 5 as valid.
    • Signing a sudo call to revert the finality version back to 3, ensuring long-term stability.
  • The fixed binary was published at 11:15 UTC, and validators began upgrading.

Final Resolution

  • By 12:02 UTC, enough validators in session #113915 had upgraded, restoring finalization.
  • However, in session #113916 (12:14 UTC), finalization stalled again due to a few remaining outdated nodes.
  • By 13:10 UTC, enough validators had upgraded, and finalization was fully restored.

Preventative Measures & Process Improvements

To prevent similar incidents in the future, the sudo call process will be formalized with additional safeguards:

Pre-Execution Documentation:

  • Every sudo call will be documented in a dedicated Notion page, detailing the date, time, and exact parameters.

Approval Process:

  • Before execution, the sudo call must be reviewed by at least two people:
    • The second signatory.
    • A trusted reviewer outside the sudo holders group.
  • Only after both approvals are recorded will the sudo call be executed.

Verification Step:

  • Both signers must manually verify parameters before execution.

By implementing these changes, we aim to eliminate human error risks and enhance operational resilience.

Timeline of Events

All times in UTC.

TimeEvent
10:04Incorrect sudo call executed (finality version set to 5).
10:22Session #113915 begins—validators crash due to incompatible finality version.
10:22War room established (sync call in progress).
10:30Marketing team joins response efforts.
10:45External communications posted to general channels.
11:15Fixed aleph-node binary (14.1.0) published; validators begin updating.
12:02Finalization restored for session #113915.
12:14Session #113916 begins—finalization halts again.
13:10Enough validators upgrade, finalization restored.

Conclusion

This incident highlighted the need for improved safeguards around sudo calls. Despite the human error, the quick response from the Aleph Zero team and validator community ensured a rapid recovery.We appreciate the cooperation of our validators and community members during this process. Moving forward, we are committed to implementing stronger safeguards to enhance network reliability.

Related articles