Aleph Zero Blog
Technology

Recycling the Agreement: Distributed Consensus on DAGs

Aug 20, 2019

In recent years, we have witnessed a growing number of new-wave DAG-based cryptocurrencies. It started (most likely) with DAGCoin¹ in 2015, which was followed by many other projects, such as IOTA, Byteball, Hashgraph, Chainspace, PARSEC, Fantom, and, of course, Aleph.

In this article, we will first share our view of why DAGs are receiving increased attention in the space and then explain what we, Aleph, are bringing to the table. Our primary focus will be the core consensus layer, and we will skip the crypto-economic considerations. Usually, these protocols are deployed in such a way that the core consensus is established among a chosen committee (except for IOTA), and that any user who wishes to perform a transaction must contact a committee member (or members) in order to include the transaction for consideration in the consensus on behalf of the user. The committee can be partially predefined, as in solutions such as the Hedera Hashgraph, or fully elected and subsequently reelected by the community, as in the case of Aleph or, for example, Lisk.

Since our view on the whole DAG-space is heavily rooted in the academic setting of classical consensus protocols, we will start with a very brief historical introduction.

A Bit of History

Back in the ’90s, a new family of consensus protocols was introduced called PAXOS. PAXOS allows multiple servers to agree on a common decision and allows for the implementation of State Machine Replication (SMR) systems, i.e., fault-tolerant services that coordinate client interactions with the service. The main idea is to coordinate the storage of data on multiple machines in such a way that, from the user perspective, an interaction with the system is indistinguishable from communication with a single server. In other words, a user has his/her “drive” online and is not even aware that this “drive” is hosted on multiple servers because the inner workings are hidden under a friendly user interface.

How does this connect with blockchain? The way we see it, the primary goal of all distributed ledger technologies (DLT) is to reimplement services that traditionally rely on a trusted third party (usually a corporation or government) in a distributed and trustless manner. Sounds like SMR with a twist, right?

So why are there no 25-year-old PAXOS-based cryptocurrencies around? One important technical reason comes to mind. Constructing a reliable system in a permissionless open-world scenario — where users can actively try to disrupt the entire system — is entirely different than constructing a system in a permissioned setting where you own all the servers, and the worst thing that can happen is a simple server failure.

The Name of the Game

Using traditional terminology, what is needed to construct a modern cryptocurrency-like system is an atomic broadcast (ABC) or a total order broadcast, a scheme in which all the processes agree on a common ordering of all messages/transactions. Note that a properly implemented atomic broadcast is enough to stop double-spending attacks. Consider a case where a user tries to commit multiple conflicting transactions. In such a scenario, the transactions are accounted for according to the established order ensuring the same funds will not be spent twice. Often this problem is phrased in terms of constructing a common clock showing the time of every transaction based on the internal clocks of all processes (which can be skewed for faulty processes).

The “chain” in blockchain represents precisely such an ordering. While the Nakamoto consensus offers a revolutionary permissionless solution to this problem, we can now see that it suffers from inherent problems of speed and scalability².

Ordering all transactions is actually overkill when trying to solve the problem of double-spending. Someone buying an ice tea in the United States should not need to wait for this transaction to be ordered according to a transaction of someone buying a coffee in China, even if it happens simultaneously. These two transactions can be processed independently and validated faster by using a data structure that keeps in mind their independence.

An easier problem to solve in a distributed system designed for currencies is a reliable broadcast (RBC). This allows all processes to agree on the validity of a given transaction in a finite time (without ordering all transactions) and is usually significantly faster than a total order broadcast. The application for cryptocurrency is pretty straightforward: the transactions of a given user are indexed with consecutive natural numbers, and any transaction is processed as soon as it and all of its predecessors are deemed valid by the consensus. Finally, all of this user’s transactions are accounted for in an order that corresponds to their indices, thus safeguarding against any conflicting transactions.

While RBC cannot replace ABC in all applications, it is a worthy add-on since it provides a secure way of performing accelerated transactions. Most significant about RBC is that utilizing a little trick, it is possible to attain RBC while implementing ABC without any additional communication between processes.

What Happened Before

In 1978, Leslie Lamport defined a notion that greatly simplifies thinking about the ordering of events in distributed systems, namely, the happened-before relation³. We will paraphrase the definition to better fit our illustrative purpose:

The happened-before relation is the partial order on events such that:

1. events occurring on the same process are totally ordered with respect to their time signatures, and

2. events on different processes are ordered with respect to the synchronization between processes. If process A exchanges information with process B at time t, all events in process A occurring before t are before all events occurring in process B after t, and, conversely, all events in process B before t are before all events in A after t.

This relation guarantees that the clocks of different processes are synchronized with respect to communication that occurs between them. This is a rather weak relation, but in some applications it is sufficient, whereas in others it can be used as a foundation for a clock with stronger properties.

In a real-life scenario, processes can cheat and forge their timestamps. This makes timestamps alone unsuitable as a deciding factor for construction of a common total ordering. Usually, a complex algorithm is needed for this, but one observation is worth noting: if events are totally ordered in a “reasonable” way, then events executed within a process, and the events of send/receive transactions should agree with the happens-before relation.

Back to DAG

Let’s make one point clear: DAGs and partial orders (posets) are two different ways of (essentially) describing the same object. Every DAG can be transformed into a poset by adding transitivity (i.e., finding the transitive closure), and every poset can be interpreted as a DAG by focusing only on elements that are directly before/after each other. Slightly abusing the notation, we will consider a poset as a DAG structure since we will never be interested in whether one transaction is directly above the another, so the notion of the set partially ordered with some relation > is a more natural description.

The common idea of new-wave protocols is the use of posets as an intermediate step in the process of finding the total order. Usually, new transactions (or transaction containers) are appended to the structure by referring to the hashes of some prior transactions, similar to Merkle trees. For the purpose of this article, we’ll refer to such structures as Merkle posets⁴. As such, various protocols use a different set of rules for transforming the intermediate structure into the chosen final ordering. How this connects with the happened-before relation is shown by the following remark:

Any Merkle poset constructed by a consensus protocol needs to be a subposet of a happened-before relation, i.e., if T_1<T_2 in the Merkle poset, then T_1 happened-before T_2.

Proof: Since T_2>T_1, between their creation, the issuer of T_1 (say, process A) needed to pass the information about it (directly or indirectly) to the issuer of T_2 (process B). Hence, by definition, T_2 happened-before T_1.

Layers of Consensus

Therefore, every Merkle poset constructed by one of the new-wave protocols needs to be a subposet (not necessarily an induced one) of the happened-before poset. Then, each of the processes keeps its version of the Merkle poset and regularly synchronizes it with the others. The rationale behind using this data structure is that it allows the protocols to separate the two layers in which they can operate:

  • In the first layer, processes add elements (for example, transactions) to the Merkle posets. Then a simpler way — for example, a gossip protocol— is used to synchronize their posets. This layer is pretty straightforward: processes are randomly connecting and exchanging information about new transactions. The trick is that the transactions are structured as “Merkle posets” so they retain some information about communication between processes on this layer. No safeguards against a double-spending attack are needed on this level, and transaction spamming is easily countered by small fees.
  • The second layer is where the protocol-unique magic happens. Once we have a locally stored Merkle poset, processes can compute the total ordering. The rules of this computation differ from one protocol to another, as do the assumptions under which protocols can safely operate. One common feature is that all computations on this layer are done locally, so processes do not need to exchange any additional information.

While this division may not be surprising for many crypto-enthusiasts since it is already fairly common in the field, it is not that common among classical consensus algorithms. Even though it feels so natural in crypto, it went somewhat unnoticed until Hashgraph started to market this concept as their “gossip-about-gossip”⁵ technology.

Consensus Is in the Eye of the Beholder

Why is this important? Mainly, it reduces communication costs so that each transaction needs to be sent only once to each process. Using only structural poset information, consensus can be achieved with the idea that new transactions can be used to validate and order the old ones.

In traditional protocols, when processes agree on the validity of a particular transaction, multiple rounds of communication are necessary (depending on the particular assumptions of the underlying network), during which they inform each other about their opinion only about the transaction in question. The key innovation of Merkle posets is that information possessed by processes at the time of creating a transaction can be inferred from the position of the transaction in the poset. Having this information, a process does not have to be asked about its decision since we can predict the answer.. Hence, it can be said that the actual consensus is “simulated” locally in the Merkle poset, or “consensus is in the eye of the beholder,” as poetically stated by Danezis and Hrycyszyn⁶.

Reduce

In Aleph, we reduce the atomic broadcast (total order broadcast) problem to a binary agreement (BA) with a little twist. Achieving binary agreement, as the name suggests, is a fairly straightforward notion. To reach a binary agreement, processes need to agree on a single binary value. But what is the process of reducing ABC to BA?

First, our Merkle poset is built from units that are containers for transactions and possibly other data. Units are divided into consecutive batches, and then each batch is ordered separately in a deterministic manner, and then batches are totally ordered by their indices. The rationale behind this is that after agreeing on the composition of a specific batch, it is very easy for all processes to compute the same ordering — for example, basing the order on a function involving the hashes of the units⁷.

To divide all units into batches, our protocol chooses a few of them as timing units to be used as universal timestamps. After choosing the timing units, the construction of batches proceeds in the following way: for a given unit U, its batch index equals the lowest index k such that U is under the k-th timing unit. This definition has the benefit of finiteness in the sense that once all timing units with indices up to k are chosen, no new unit can be added to batches with indices up to k. This stems from the fact that new units can be added only ABOVE the old ones and not below.

How do we choose the timing units? When the poset is constructed, we define a level of a unit to be roughly its distance from the genesis unit (technical details omitted) and call the first unit of every process produced on each level a prime unit. Then, for every process, we run a binary decision based on this question: “Is the prime unit produced by this process on level k commonly known?” Again, we omit some technical details, but the BA is performed in a way that guarantees a positive outcome for at least two-thirds of all prime units. When choosing one of the prime units out of all possible candidates that have a positive outcome from BA, we use the precomputed ordering of all processes, a different one for every level, and choose the first one according to this ordering.

Wary readers may be asking themselves these questions now: Do we need all this trouble? Why don’t we use poset levels to divide units into batches? The reasoning is simple as levels are never final because any process can add a unit to any existing level even if there are already higher levels in the poset.

To sum up, our reduction consists of three steps:

  1. Reduce the atomic broadcast problem to a “common batch division.”
  2. Further, reduce it to a common subset problem (choice of timing units).
  3. Reduce it to a simple binary agreement.

Reuse

An important feature of these reductions is that they do not require any additional communication besides the synchronization of Merkle posets which are integral to the protocol since the synchronization of these structures is how transactions propagate among processes. The last step is the design of an appropriate BA algorithm. We use a modified and adapted version of ABBA (Asynchronous Byzantine Binary Agreement)⁸ due to the ease of incorporating it into the poset structure and its optimal properties in terms of network assumptions (asynchronicity and resilience to up to ⅓N-1 Byzantine faults). Since the reductions are done on the Merkle poset without any additional assumptions, we simply inherit these properties, thus automatically placing us among a small number of protocols that can make this claim and justify it with proofs.

Recycle

In conclusion, by using several reductions and locally simulating the ABBA protocol, we obtain a fully asynchronous and BFT-secure protocol for atomic broadcast. This could be accomplished without the whole Merkle posets trick since there are protocols such as Honey Badger BFT already doing precisely this. Why bother with DAGs and all this poset nonsense?

The key advantage of Merkle posets is their ability to simulate consensus algorithms locally as they can run arbitrarily many instances of this algorithm without the need for any additional data by “recycling the agreement.”

This alone is a huge bandwidth saver, and it allows us to deploy SNAP (the Simplified Non-arranging Agreement Protocol), which is our homebrew protocol for reliable broadcast and allows for quick and secure token transactions.

There is also an unforeseen benefit. In traditional asynchronous protocols, a chance always exists that a given legitimate transaction is broadcast too slowly, and the agreement terminates with a negative decision, i.e., it is not processed. In such cases, this transaction and all the associated communication costs are essentially wasted. If the transaction is to be validated, it needs to be resent, and the underlying protocol for reaching agreement needs to be run again. In essence, this duplication uses up costly resources and reduces efficiency.

In contrast, when using Merkle posets, if a transaction is added to the poset, it will never need to be added again. Even if the first instance of a given protocol fails to validate it (i.e., in our example, it fails to be included in a given batch), it can be rerun again later without any bandwidth cost. Hence, for protocols using Merkle posets, nothing is ever lost, and everything is recycled.

Concluding Remarks

It is evident that the throughput of traditional blockchains is not enough to sustain a large number of future applications that will be built on top of some DLT systems. Hence, it is expected that, in the upcoming years, we will see even more solutions focusing on overcoming this deficiency. Current research in the crypto space is in developmental stages, and we believe the only way to avoid reinventing the wheel is to ground it in classical works. Moreover, we believe that the quest for a full understanding of the implications of using classical protocols on Merkle posets has only just begun.

Don’t miss out. Follow our progress:

WWW | Telegram | Twitter | Facebook | Instagram | LinkedIn

Sources:

  1. https://bitslog.files.wordpress.com/2015/09/dagcoin-v41.pdf
  2. For example, in https://eprint.iacr.org/2016/454.pdf, Pass, et al. prove that there is an intrinsic trade-off between validation time and security for all blockchain-based protocols.
  3. Lamport, L. (1978). Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM. 21. 558–565.
  4. Examples of Merkle posets are IOTA’s Tangle, Fantom’s Opera, Hashgraph, and Blockmania’s “block DAG.”
  5. https://dev.hashgraph.com/docs/hg101/gossip-about-gossip/. The actual difference between Hashgraph’s technology and all the others in the field is that in hashgraph the sole fact of two nodes connecting with each other is recorded as a separate ‘event’, but this difference is subtle and not really exploited.
  6. Danezis, G. & Hrycyszyn, D. Blockmania: from Block DAGs to Consensushttps://arxiv.org/pdf/1809.01620.pdf
  7. Such a solution would have an obvious problem of ordering transactions with low hashes earlier, which could be exploited by some processes, so actually, we use a bit more complex tiebreaker, but it’s just a technical detail.
  8. Cachin, C., Kursawe, K., Shoup, V. (2000). Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement Using Cryptography. Journal of Cryptology. 18.


Tags: