Mastering Blockchain
上QQ阅读APP看书,第一时间看更新

Algorithms

In this section, we will discuss the key algorithms in detail. We'll be looking at the two main types of fault-tolerant algorithms, CFT and BFT.

CFT algorithms

We'll begin by looking at some algorithms that solve the consensus problem with crash fault tolerance. One of the most fundamental algorithms in this space is Paxos.

Paxos

Leslie Lamport developed Paxos. It is the most fundamental distributed consensus algorithm, allowing consensus over a value under unreliable communications. In other words, Paxos is used to build a reliable system that works correctly, even in the presence of faults.

Paxos was proposed first in 1989 and then later, more formally, in 1998 in the following paper:

Lamport, L., 1998. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2), pp.133-169.

The paper is available here:

https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf.

Note that Paxos works under an asynchronous network model and supports the handling of only benign failures. This is not a Byzantine fault-tolerant protocol. However, later, a variant of Paxos was developed that provides Byzantine fault tolerance. The original paper in which a Byzantine fault-tolerant version of Paxos is described is available here:

Lamport, L., 2011, September. Byzantizing Paxos by refinement. International Symposium on Distributed Computing (pp. 211-224). Springer, Berlin, Heidelberg.

A link to the paper is available here:

http://lamport.azurewebsites.net/pubs/web-byzpaxos.pdf

Paxos makes use of 2F + 1 processes to ensure fault tolerance in a network where processes can crash fault, that is, experience benign failures. Benign failure means either the loss of a message or a process stops. In other words, Paxos can tolerate one crash failure in a three-node network.

Paxos is a two-phase protocol. The first phase is called the prepare phase, and the next phase is called the accept phase. Paxos has proposer and acceptors as participants, where the proposer is the replicas or nodes that propose the values and acceptors are the nodes that accept the value.

How Paxos works

The Paxos protocol assumes an asynchronous message-passing network with less than 50% of crash faults. As usual, the critical properties of the Paxos consensus algorithm are safety and liveness. Under safety, we have:

  • Agreement, which specifies that no two different values are agreed on. In other words, no two different learners learn different values.
  • Validity, which means that only the proposed values are decided. In other words, the values chosen or learned must have been proposed by a processor.

Under liveness, we have:

  • Termination, which means that, eventually, the protocol is able to decide and terminate. In other words, if a value has been chosen, then eventually learners will learn it.

Processes can assume different roles, which are listed as follows:

  • Proposers, elected leader(s) that can propose a new value to be decided.
  • Acceptors, which participate in the protocol as a means to provide a majority decision.
  • Learners, which are nodes that just observe the decision process and value.

A single process in a Paxos network can assume all three roles.

The key idea behind Paxos is that the proposer node proposes a value, which is considered final only if a majority of the acceptor nodes accept it. The learner nodes also learn this final decision.

Paxos can be seen as a protocol that is quite similar to the two-phase commit protocol. Two-phase commit (2PC) is a standard atomic commitment protocol to ensure that transactions are committed in distributed databases only if all participants agree to commit. Even if a single node cannot agree to commit the transaction, it is fully rolled back.

Similarly, in Paxos, in the first phase, the proposer sends a proposal to the acceptors, if and when they accept the proposal, the proposer broadcasts a request to commit to the acceptors. Once the acceptors commit and report back to the proposer, the proposal is considered final, and the protocol concludes. In contrast with the two-phase commit, Paxos introduced ordering (sequencing to achieve total order) of the proposals and majority-based acceptance of the proposals instead of expecting all nodes to agree (to allow progress even if some nodes fail). Both of these improvements contribute toward ensuring the safety and liveness of the Paxos algorithm.

An excellent explanation of the two-phase commit is available here:

https://en.wikipedia.org/wiki/Two-phase_commit_protocol

We'll now describe how the Paxos protocol works step by step:

  1. The proposer proposes a value by broadcasting a message, <prepare(n)>, to all acceptors.
  2. Acceptors respond with an acknowledgment message if proposal n is the highest that the acceptor has responded to so far. The acknowledgment message <ack(n, v, s)> consists of three variables where n is the proposal number, v is the proposal value of the highest numbered proposal the acceptor has accepted so far, and s is the sequence number of the highest proposal accepted by the acceptor so far. This is where acceptors agree to commit the proposed value. The proposer now waits to receive acknowledgment messages from the majority of the acceptors indicating the chosen value.
  3. If the majority is received, the proposer sends out the "accept" message <accept(n, v)> to the acceptors.
  4. If the majority of the acceptors accept the proposed value (now the "accept" message), then it is decided: that is, agreement is achieved.
  5. Finally, in the learning phase, acceptors broadcast the "accepted" message <accepted(n, v)> to the proposer. This phase is necessary to disseminate which proposal has been finally accepted. The proposer then informs all other learners of the decided value. Alternatively, learners can learn the decided value via a message that contains the accepted value (decision value) multicast by acceptors.

We can visualize this process in the following diagram:

Figure 5.1: How Paxos works

Let's move on to see how Paxos achieves the much-desired properties of safety and liveness.

How Paxos achieves safety and liveness

A natural question arises about how Paxos ensures its safety and liveness guarantees. Paxos, at its core, is quite simple, yet it achieves all these properties efficiently. The actual proofs for the correctness of Paxos are quite in-depth and are not the subject of this chapter. However, the intuition behind each property is presented as follows:

  • Agreement is ensured by enforcing that only one proposal can win votes from a majority of the acceptors.
  • Validity is ensured by enforcing that only the genuine proposals are decided. In other words, no value is committed unless it is proposed in the proposal message first.
  • Liveness or termination is guaranteed by ensuring that at some point during the protocol execution, eventually there is a period during which there is only one fault-free proposer.

For detailed proofs and correctness analyses, refer to the following papers:

Lamport, L., 1998. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2), pp.133-169.

https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf

Lamport, L., 2006. Fast Paxos. Distributed Computing, 19(2), pp.79-103.

https://www.microsoft.com/en-us/research/publication/fast-paxos/

In summary, the key points to remember about Paxos are that:

  • First, a proposer suggests a value with the aim that acceptors achieve agreement on it.
  • The decided value is a result of majority consensus among the acceptors and is finally learned by the learners.

There are many other protocols that have emerged from basic Paxos, such as Multi-Paxos, Fast Paxos, and Cheap Paxos.

Even though the Paxos algorithm is quite simple at its core, it is seen as challenging to understand, and many academic papers have been written to explain it. This slight problem, however, has not prevented it being implemented in many production networks, such as Google's Spanner, as it has proven to be the most efficient protocol to solve the consensus problem.

Nevertheless, there have been attempts to create alternative easy-to-understand algorithms. Raft is such an attempt to create an easy-to-understand CFT algorithm.

Raft

The Raft protocol is a CFT consensus mechanism developed by Diego Ongaro and John Ousterhout at Stanford University. In Raft, the leader is always assumed to be honest.

At a conceptual level, it is a replicated log for a replicated state machine (RSM) where a unique leader is elected every "term" (time division) whose log is replicated to all follower nodes.

Raft is composed of three sub-problems:

  • Leader election (a new leader election in case the existing one fails)
  • Log replication (leader to follower log synch)
  • Safety (no conflicting log entries (index) between servers)

The Raft protocol ensures election safety, leader append only, log matching, leader completeness, and state machine safety.

Each server in Raft can have either a follower, leader, or candidate state.

The protocol ensures election safety (that is, only one winner each election term) and liveness (that is, some candidate must eventually win).

How Raft works

The following steps will describe how the Raft protocol functions. At a fundamental level, the protocol is quite simple and can be described simply by the following sequence:

Node starts up –> Leader election –> Log replication

  1. First, the node starts up.
  2. After this, the leader election process starts. Once a node is elected as leader, all changes go through that leader.
  3. Each change is entered into the node's log.
  4. Log entry remains uncommitted until the entry is replicated to follower nodes and the leader receives write confirmation votes from a majority of the nodes, then it is committed locally.
  5. The leader notifies the followers regarding the committed entry.
  6. Once this process ends, agreement is achieved.

The state transition of the Raft algorithm can be visualized in the following diagram:

Figure 5.2: Raft state transition

We saw earlier that data is eventually replicated across all nodes in a consensus mechanism. In Raft, the log (data) is eventually replicated across all nodes. We describe this process of log replication next.

Log replication

Log replication logic can be visualized in the following diagram. The aim of log replication is to synchronize nodes with each other.

Figure 5.3: Log replication mechanism

Log replication is a simple mechanism. As shown in the preceding diagram, the leader is responsible for log replication. Once the leader has a new entry in its log, it sends out the requests to replicate to the follower nodes. When the leader receives enough confirmation votes back from the follower nodes indicating that the replicate request has been accepted and processed by the followers, the leader commits that entry to its local state machine. At this stage, the entry is considered committed.

With this, our discussion on CFT algorithms is complete. Now we'll introduce Byzantine fault-tolerant algorithms, which have been a research area for many years in distributed computing.

BFT algorithms

We described the formulation of the Byzantine generals problem at the start of this chapter. In this section, we'll introduce the mechanisms that were developed to solve the Byzantine generals (consensus in the presence of faults) problem.

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance (PBFT) was developed in 1999 by Miguel Castro and Barbara Liskov. PBFT, as the name suggests, is a protocol developed to provide consensus in the presence of Byzantine faults. Before PBFT, Byzantine fault tolerance was considered impractical. With PBFT, it was demonstrated for the first time that practical Byzantine fault tolerance is possible.

PBFT comprises three sub-protocols called normal operation, view change, and checkpointing.

Normal operation sub-protocol refers to a scheme that is executed when everything is running normally and no errors are in the system. View change is a sub-protocol that runs when a faulty leader node is detected in the system. Checkpointing is another sub-protocol, which is used to discard the old data from the system.

The PBFT protocol comprises three phases or steps. These phases run in a sequence to achieve consensus. These phases are pre-prepare, prepare, and commit, which we will cover in detail shortly.

The protocol runs in rounds where, in each round, an elected leader node, called the primary node, handles the communication with the client. In each round, the protocol progresses through the three previously mentioned phases. The participants in the PBFT protocol are called replicas, where one of the replicas becomes primary as a leader in each round, and the rest of the nodes acts as backups. PBFT is based on the SMR protocol introduced earlier. Here, each node maintains a local log, and the logs are kept in sync with each other via the consensus protocol: that is, PBFT.

As we saw earlier, in order to tolerate Byzantine faults, the minimum number of nodes required is N = 3F + 1, where N is the number of nodes and F is the number of faulty nodes. PBFT ensures Byzantine fault tolerance as long as the number of nodes in a system stays N >= 3F + 1.

We will now look at how the PBFT protocol works.

In summary, when a client sends a request to a primary, the protocol initiates a sequence of operations between replicas, which eventually leads to consensus and a reply back to the client. These sequences of operations are divided into different phases:

  • Pre-prepare
  • Prepare
  • Commit

In addition, each replica maintains a local state comprising three main elements:

  • Service state
  • A message log
  • A number representing that replica's current view

Now we'll discuss the phases mentioned above one by one, starting with pre-prepare.

Pre-prepare:

This is the first phase in the protocol, where the primary node, or primary, receives a request from the client. The primary node assigns a sequence number to the request. It then sends the pre-prepare message with the request to all backup replicas.

When the pre-prepare message is received by the backup replicas, it checks a number of things to ensure the validity of the message:

  • First, whether the digital signature is valid.
  • After this, whether the current view number is valid.
  • Then, that the sequence number of the operation's request message is valid.
  • Finally, if the digest/hash of the operation's request message is valid.

If all of these elements are valid, then the backup replica accepts the message. After accepting the message, it updates its local state and progresses toward the prepare phase.

Prepare:

A prepare message is sent by each backup to all other replicas in the system. Each backup waits for at least 2F + 1 prepare messages to be received from other replicas. They also check whether the prepare message contains the same view number, sequence number, and message digest values. If all these checks pass, then the replica updates its local state and progresses toward the commit phase.

Commit:

In the commit phase, each replica sends a commit message to all other replicas in the network. The same as the prepare phase, replicas wait for 2F + 1 commit messages to arrive from other replicas. The replicas also check the view number, sequence number, and message digest values. If they are valid for 2F+ 1 commit messages received from other replicas, then the replica executes the request, produces a result, and finally, updates its state to reflect a commit. If there are already some messages queued up, the replica will execute those requests first before processing the latest sequence numbers. Finally, the replica sends the result to the client in a reply message.

The client accepts the result only after receiving 2F+ 1 reply messages containing the same result.

Now, let's move on and look at how PBFT works in some more detail.

How PBFT works

At a high level, the protocol works as follows:

  1. A client sends a request to invoke a service operation in the primary.
  2. The primary multicasts the request to the backups.
  3. Replicas execute the request and send a reply to the client.
  4. The client waits for replies from different replicas with the same result; this is the result of the operation.

Now we'll describe each phase of the protocol (pre-prepare, prepare, and commit) in more formal terms.

The pre-prepare sub-protocol algorithm:

  1. Accepts a request from the client.
  2. Assigns the next sequence number.
  3. Sends the pre-prepare message to all backup replicas.

The prepare sub-protocol algorithm:

  1. Accepts the pre-prepare message. If the backup has not accepted any pre-prepare messages for the same view or sequence number, then it accepts the message.
  2. Sends the prepare message to all replicas.

The commit sub-protocol algorithm:

  1. The replica waits for 2F prepare messages with the same view, sequence, and request.
  2. Sends a commit message to all replicas.
  3. Waits until a 2F + 1 valid commit message arrives and is accepted.
  4. Executes the received request.
  5. Sends a reply containing the execution result to the client.

In summary, the primary purpose of these phases is to achieve consensus, where each phase is responsible for a critical part of the consensus mechanism, which after passing through all phases, eventually ends up achieving consensus.

One key point to remember about each phase is listed as follows:

Pre-prepare: This phase assigns a unique sequence number to the request. We can think of it as an orderer.

Prepare: This phase ensures that honest replicas/nodes in the network agree on the total order of requests within a view. Note that the pre-prepare and prepare phases together provide the total order to the messages.

Commit: This phase ensures that honest replicas/nodes in the network agree on the total order of requests across views.

The normal view of the PBFT protocol can be visualized as shown as follows:

Figure 5.4: PBFT protocol

During the execution of the protocol, the integrity of the messages and protocol operations must be maintained to provide an adequate level of security and assurance. This is maintained by the use of digital signatures. In addition, certificates are used to ensure the adequate majority of participants (nodes).

Note that these certificates are not usual digital certificates commonly used in PKI and IT infrastructures to secure assets such as servers.

We'll describe the concept of certificates next.

Certificates in PBFT:

Certificates in PBFT protocols are used to demonstrate that at least 2F + 1 nodes have stored the required information. In other words, the collection of 2F + 1 messages of a particular type is considered a certificate. For example, if a node has collected 2F + 1 messages of type prepare, then combining it with the corresponding pre-prepare message with the same view, sequence, and request represents a certificate, called a prepared certificate. Similarly, a collection of 2F +1 commit messages is called a commit certificate.

There are also a number of variables that the PBFT protocol maintains in order to execute the algorithm. These variables and the meanings of these are listed as follows:

We can now look at the types of messages and their formats, which becomes quite easy to understand if we refer to the preceding variables table shown.

Types of messages:

The PBFT protocol works by exchanging several messages. A list of these messages is presented as follows with their format and direction.

The following table contains message types and relevant details:

Let's look at some specific message types that are exchanged during the PBFT protocol.

View-change:

View-change occurs when a primary is suspected faulty. This phase is required to ensure protocol progress. With the view change sub-protocol, a new primary is selected, which then starts normal mode operation again. The new primary is selected in a round-robin fashion.

When a backup replica receives a request, it tries to execute it after validating the message, but for some reason, if it does not execute it for a while, the replica times out and initiates the view change sub-protocol.

In the view change protocol, the replica stops accepting messages related to the current view and updates its state to view-change. The only messages it can receive in this state are checkpoint messages, view-change messages, and new-view messages. After that, it sends a view-change message with the next view number to all replicas.

When this message arrives at the new primary, the primary waits for at least 2F view-change messages for the next view. If at least 2F view-change messages are received it broadcasts a new view message to all replicas and progresses toward running normal operation mode once again.

When other replicas receive a new-view message, they update their local state accordingly and start normal operation mode.

The algorithm for the view-change protocol is shown as follows:

  1. Stop accepting pre-prepare, prepare, and commit messages for the current view.
  2. Create a set of all the certificates prepared so far.
  3. Broadcast a view-change message with the next view number and a set of all the prepared certificates to all replicas.

The view change protocol can be visualized in the following diagram:

Figure 5.5: View-change sub-protocol

The view-change sub-protocol is a mechanism to achieve liveness. Three smart techniques are used in this sub-protocol to ensure that, eventually, there is a time when the requested operation executes:

  1. A replica that has broadcast the view-change message waits for 2F+1 view-change messages and then starts its timer. If the timer expires before the node receives a new-view message for the next view, the node will start the view change for the next sequence but will increase its timeout value. This will also occur if the replica times out before executing the new unique request in the new view.
  2. As soon as the replica receives F+1 view-change messages for a view number greater than its current view, the replica will send the view-change message for the smallest view it knows of in the set so that the next view change does not occur too late. This is also the case even if the timer has not expired; it will still send the view change for the smallest view.
  3. As the view change will only occur if at least F+1 replicas have sent the view-change message, this mechanism ensures that a faulty primary cannot indefinitely stop progress by successively requesting view changes.

Next, let's look in more detail at the checkpoint sub-protocol, another important PBFT process.

The checkpoint sub-protocol:

Checkpointing is a crucial sub-protocol. It is used to discard old messages in the log of all replicas. With this, the replicas agree on a stable checkpoint that provides a snapshot of the global state at a certain point in time. This is a periodic process carried out by each replica after executing the request and marking that as a checkpoint in its log. A variable called low watermark (in PBFT terminology) is used to record the sequence number of the last stable checkpoint. This checkpoint is then broadcast to other nodes. As soon as a replica has at least 2F+1 checkpoint messages, it saves these messages as proof of a stable checkpoint. It discards all previous pre-prepare, prepare, and commit messages from its logs.

PBFT advantages and disadvantages:

PBFT is indeed a revolutionary protocol that has opened up a new research field of practical Byzantine fault-tolerant protocols. The original PBFT does have some strengths, but it does have some limitations. We'll discuss most of the commonly cited strengths and limitations in the following section.

Strengths:

PBFT provides immediate and deterministic transaction finality. This is in contrast with the PoW protocol, where a number of confirmations are required to finalize a transaction with high probability.

PBFT is also energy efficient as compared to PoW, which consumes a tremendous amount of electricity.

Weaknesses:

PBFT is not very scalable. This is the reason it is more suitable for consortium networks, instead of public blockchains. It is, however, much faster than PoW protocols.

Sybil attacks can be carried out on a PBFT network, where a single entity can control many identities to influence the voting and subsequently the decision. However, the fix is trivial and, in fact, this is not very practical in consortium networks where all identities are known on the network. This problem can be addressed simply by increasing the number of nodes in the network.

PBFT in blockchain:

In the traditional client-server model, PBFT works well; however, in the case of blockchain, directly implementing PBFT in its original state may not work correctly. This is because PBFT's original design was not developed for blockchain. In the following section, we present a few changes that are required to create a blockchain version of PBFT. This is not an exhaustive list; however, it does provide a baseline of requirements.

This research resulted in IBFT and PBFT implementation in Sawtooth and other blockchains. In all these scenarios, some changes have been made in the core protocol to ensure that they're compatible with the blockchain environment.

PBFT has been implemented in Hyperledger Sawtooth. More details on this implementation can be found here: https://github.com/hyperledger/sawtooth-rfcs/blob/master/text/0019-pbft-consensus.md

We will now introduce an algorithm called IBFT, which was inspired by PBFT. Another algorithm that has been inspired by PBFT and DLS is Tendermint, which we will also present shortly.

Istanbul Byzantine Fault Tolerance

IBFT was developed by AMIS Technologies as a variant of PBFT suitable for blockchain networks. It was presented in EIP 650 for the Ethereum blockchain.

The differences between PBFT and IBFT

Let's first discuss the primary differences between the PBFT and IBFT protocols. They are as follows:

  • There is no distinctive concept of a client in IBFT. Instead, the proposer can be seen as a client, and in fact, all validators can be considered clients.
  • There is a concept of dynamic validators, which is in contrast with the original PBFT, where the nodes are static. However, in IBFT, the validators can be voted in and out as required.
  • There are two types of nodes in an IBFT network, nodes and validators. Nodes are synchronized with the blockchain without participating in the IBFT consensus process. In contrast, validators are the nodes that participate in the IBFT consensus process.
  • IBFT relies on a more straightforward structure of view-change (round change) messages as compared to PBFT.
  • In contrast with PBFT, in IBFT there is no concrete concept of checkpoints. However, each block can be considered an indicator of the progress so far (the chain height).
  • There is no concept of garbage collection in IBFT.

Now we've covered the main points of comparison between the two BFT algorithms, we'll examine how the IBFT protocol runs, and its various phases.

How IBFT works

IBFT assumes a network model under which it is supposed to run. The model is composed of at least 3F+1 processes (standard BFT assumption), a partially synchronous message-passing network, and sound cryptography. By sound cryptography, it is assumed that digital signatures and relevant cryptographic protocols such as cryptographic hash functions are secure.

The IBFT protocol runs in rounds. It has three phases: pre-prepare, prepare, and commit. In each round, usually, a new leader is elected based on a round-robin mechanism. The following flowchart shows how the IBFT protocol works:

Figure 5.6: IBFT flow chart

In the preceding diagram, the flow of IBFT is shown. We'll discuss this process step by step in the following section:

  1. The protocol starts with a new round. In the new round, the selected proposer broadcasts a proposal (block) as a pre-prepare message.
  2. The nodes that receive this pre-prepare message validate the message and accept it if it is a valid message. The nodes also then set their state to pre-prepared.
  3. At this stage, if a timeout occurs, or a proposal is seen as invalid by the nodes, they will initiate a round change. The normal process then begins again with a proposer, proposing a block.
  4. Nodes then broadcast the prepare message and wait for 2F+1 prepare messages to be received from other nodes. If the nodes do not receive 2F+1 messages in time, then they time out, and the round change process starts. The nodes then set their state to prepared after receiving 2F+1 messages from other nodes.
  5. Finally, the nodes broadcast a commit message and wait for 2F+1 messages to arrive from other nodes. If they are received, then the state is set to committed, otherwise, timeout occurs and the round change process starts.
  6. Once committed, block insertion is tried. If it succeeds, the protocol proceeds to the final committed state and, eventually, a new round starts. If insertion fails for some reason, the round change process triggers. Again, nodes wait for 2F+1 round change messages, and if the threshold of the messages is received, then round change occurs.

Now that we've understood the flow of IBFT, lets now further explore which states it maintains and how.

Consensus states

IBFT is an SMR algorithm. Each validator maintains a state machine replica in order to reach block consensus, that is, agreement. These states are listed as follows with an explanation:

  • New round: In this state, a new round of the consensus mechanism starts, and the selected proposer sends a new block proposal to other validators. In this state, all other validators wait for the PRE-PREPARE message.
  • Pre-prepared: A validator transitions to this state when it has received a PRE-PREPARE message and broadcasts a PREPARE message to other validators. The validator then waits for 2F + 1 PREPARE or COMMIT messages.
  • Prepared: This state is achieved by a validator when it has received 2F+1 prepare messages and has broadcast the commit messages. The validator then awaits 2F+1 commit messages to arrive from other validators.
  • Committed: The state indicates that a validator has received 2F+1 COMMIT messages. The validator at this stage can insert the proposed block into the blockchain.
  • Final committed: This state is achieved by a validator when the newly committed block is inserted successfully into the blockchain. At this state, the validator is also ready for the next round of consensus.
  • Round change: This state indicates that the validators are waiting for 2F+1 round change messages to arrive for the newly proposed new round number.

The IBFT protocol can be visualized using a diagram, similar to PBFT, as follows:

Figure 5.7: IBFT, PBFT-like flow

An additional mechanism that makes IBFT quite appealing is its validator management mechanism. By using this mechanism, validators can be added or removed by voting between members of the network. This is quite a useful feature and provides the right level of flexibility when it comes to managing validators efficiently, instead of manually adding or removing validators from the validator set.

IBFT has been implemented in several blockchains. Sample implementations include Quorum and Celo.

A Quorum implementation is available at the following link:

https://github.com/jpmorganchase/quorum

IBFT, with some variations, has also been implemented in the Celo blockchain, which is available at:

https://github.com/celo-org/celo-blockchain

Several other algorithms have been inspired by PBFT and have emerged as a result of deep interest in blockchain research. One such algorithm is Tendermint.

Tendermint

Tendermint is another variant of PBFT. It was inspired by both the DLS and PBFT protocols. Tendermint also makes use of the SMR approach to providing consensus. As we saw before, state machine replication is a mechanism that allows synchronization between replicas/nodes of the network.

Traditionally, a consensus mechanism used to run with a small number of participants and thus performance and scalability was not a big concern. However, with the advent of blockchain, there is a need to develop algorithms that can work on wide area networks and in asynchronous environments. Research into these areas of distributed computing is not new and especially now, due to the rise of cryptocurrencies and blockchain, the interest in these research topics has grown significantly in the last few years.

More details on the DLS algorithm can be found in the original paper:

https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf

Dwork, C., Lynch, N. and Stockmeyer, L., 1988. Consensus in the presence of partial synchrony. Journal of the ACM (JACM), 35(2), pp.288-323

The preceding paper proposes a consensus mechanism named after its authors (DLS: Dwork, Lynch, Stockmeyer). This protocol consists of two initial rounds. This process of rounds with appropriate message exchange ensures that agreement is eventually achieved on a proposed or default value. As long as the number of nodes in the system is more than 3F, this protocol achieves consensus.

The Tendermint protocol also works by running rounds. In each round, a leader is elected, which proposes the next block. Also note that in Tendermint, the round change or view-change process is part of the normal operation, as opposed to PBFT, where view-change only occurs in the event of errors, that is, a suspected faulty leader. Tendermint works similarly to PBFT, where three phases are required to achieve a decision. Once a round is complete, a new round starts with three phases and terminates when a decision is reached. A key innovation in Tendermint is the design of a new termination mechanism. As opposed to other PBFT-like protocols, Tendermint has developed a more straightforward mechanism, which is similar to PBFT-style normal operation. Instead of having two sub-protocols for normal mode and view-change mode (recovery in event of errors), Tendermint terminates without any additional communication costs.

The model under which Tendermint is supposed to run will now be presented. We saw earlier, in the introduction, that each consensus model is studied and developed under a system model with some assumptions about the system. Tendermint is also designed with a system model in mind. Now we'll define and introduce each element of the system model.

  • Processes: A process is the fundamental key participant of the protocol. It is also called a replica (in PBFT traditional literature), a node, or merely a process. Processes can be correct or honest. Processes can also be faulty or Byzantine. Each process possesses some voting power. Also, note that processes are not necessarily connected directly; they are only required to connect loosely or just with their immediate subset of processes/nodes. Processes have a local timer that they use to measure timeout.
  • Network model: The network model is a network of processes that communicate using messages. In other words, the network model is a set of processes that communicate using message passing. Particularly, the gossip protocol is used for communication between processes. The standard assumption of N >= 3F +1 BFT is also taken into consideration. This means that the protocol operates correctly as long as the number of nodes in the network is more than 3F, where F is the number of faulty nodes. This implies that, at a minimum, there have to be four nodes in a network to tolerate Byzantine faults.
  • Timing assumptions: Under the network model, Tendermint assumes a partially synchronous network. This means that there is an unknown bound on the communication delay, but it only applies after an unknown instance of time called global stabilization time or GST.
  • Security and cryptography: It is assumed that the public key cryptography used in the system is secure and the impersonation or spoofing of accounts/identities is not possible. The messages on the network are authenticated and verified via digital signatures. The protocol ignores any messages with an invalid digital signature.
  • State machine replication: To achieve replication among the nodes, the standard SMR mechanism is used. One key observation that is fundamental to the protocol is that in SMR, it is ensured that all replicas on the network receive and process the same sequence of requests. As noted in the Tendermint paper, agreement and order are two properties that ensure that all requests are received by replicas and order ensures that the sequence in which the replicas have received requests is the same. Both of these requirements ensure total order in the system. Also, Tendermint ensures that requests themselves are valid and have been proposed by the clients. In other words, only valid transactions are accepted and executed on the network.

There are three fundamental consensus properties that Tendermint solved. As we discussed earlier in the chapter, generally in a consensus problem, there are safety and liveness properties that are required to be met. Similarly, in Tendermint, these safety and liveness properties consist of agreement, termination, and validity.

These properties are defined as follows:

  • Agreement: No two correct processes decide on different values.
  • Termination: All correct processes eventually decide on a value.
  • Validity: A decided upon value is valid, that is, it satisfies the predefined predicate denoted valid().

Now we'll describe how the algorithm works.

State transition in Tendermint is dependent on the messages received and timeouts. In other words, the state is changed in response to messages received by a processor or in the event of timeouts. The timeout mechanism ensures liveness and prevents endless waiting. It is assumed that, eventually, after a period of asynchrony, there will be a round or communication period during which all processes can communicate in a timely fashion, which will ensure that processes eventually decide on a value.

The following diagram depicts the Tendermint protocol at a high level:

Figure 5.8: Tendermint high-level overview

Types of messages in Tendermint

There are three types of messages in Tendermint. We'll discuss each type individually here:

  1. Proposal: As the name suggests, this is used by the leader of the current round to propose a value or block.
  2. Pre-vote: This message is used to vote on a proposed value.
  3. Pre-commit: This message is also used to vote on a proposed value.

These messages can be considered somewhat equivalent to PBFT's PRE-PREPARE, PREPARE, and COMMIT messages.

Note that in Tendermint, only the proposal message carries the original value and the other two messages, pre-vote and pre-commit, operate on a value identifier, representing the original proposal.

All of the aforementioned messages also have a corresponding timeout mechanism, which ensures that processes do not end up waiting indefinitely for some conditions to meet. If a processor cannot decide in an expected amount of time, it will time out and trigger a round change.

Each type of message has an associated timeout. As such, there are three timeouts in Tendermint, corresponding to each message type:

  1. Timeout-propose
  2. Timeout-prevote
  3. Timeout-precommit

These timeout mechanisms prevent the algorithm from waiting infinitely for a condition to be met. They also ensure that processes progress through the rounds. A clever mechanism to increase timeout with every new round ensures that after reaching GST, eventually the communication between correct processes becomes reliable and a decision can be reached.

State variables

All processes in Tendermint maintain a set of variables, which helps with the execution of the algorithm. Each of these variables holds critical values, which ensure the correct execution of the algorithm.

These variables are listed and discussed as follows:

  • Step: The step variable holds information about the current state of the algorithm, that is, the current state of the Tendermint state machine in the current round.
  • lockedValue: The lockedValue variable stores the most recent value (with respect to a round number) for which a pre-commit message has been sent.
  • lockedRound: The lockedRound variable contains information about the last round in which the process sent a non-nil pre-commit message. This is the round where a possible decision value has been locked. This means that if a proposal message and corresponding 2F + 1 messages have been received for a value in a round, then, due to the reason that 2F + 1 prevotes have already been received for this value, this is a possible decision value.
  • validValue: The role of the validValue variable is to store the most recent possible decision value.
  • validRound: The validRound variable is the last round in which validValue was updated.

    lockedvalue, lockedRound, validValue, and validRound are reset to the initial values every time a decision is reached.

Apart from the preceding variables, a process also stores the current consensus instance (called height in Tendermint), and the current round number. These variables are attached to every message. A process also stores an array of decisions. Tendermint assumes a sequence of consensus instances, one for each height.

How Tendermint works

Tendermint works in rounds and each round comprises phases: propose, pre-vote, pre-commit, and commit. In this section, we'll explore how Tendermint works:

  1. Every round starts with a proposal value being proposed by a proposer. The proposer can propose any new value at the start of the first round for each height.
  2. After the first round, any subsequent rounds will have the proposer, which proposes a new value only if there is no valid value present, that is, null. Otherwise the validValue, that is, the possible decision value, is proposed, which has already been locked in a previous round. The proposal message also includes a value of valid round, which denotes the last round in which there was a valid value updated.
  3. The proposal is accepted by a correct process only if:
    1. The proposed value is valid
    2. The process has not locked on a round
    3. Or, the process has a value locked
  4. If the preceding conditions are met, then the correct process will accept the proposal and send a pre-vote message.
  5. If the preceding conditions are not met, then the process will send a pre-vote message with a nil value.
  6. In addition, there is also a timeout mechanism associated with the proposal phase, which initiates timeout if a process has not sent a pre-vote message in the current round or the timer expires in the proposal stage.
  7. If a correct process receives a proposal message with a value and 2F + 1 pre-vote messages, then it sends the pre-commit message.
  8. Otherwise, it sends out a nil pre-commit.
  9. A timeout mechanism associated with the pre-commit will initialize if the associated timer expires or if the process has not sent a pre-commit message after receiving a proposal message and 2F + 1 pre-commit messages.
  10. A correct process decides on a value if it has received the proposal message in some round and 2F + 1 pre-commit messages for the ID of the proposed value.
  11. There is also an associated timeout mechanism with this step, which ensures that the processor does not wait indefinitely to receive 2F + 1 messages. If the timer expires before the processor can decide, the processor starts the next round.
  12. When a processor eventually decides, it triggers the next consensus instance for the next block proposal and the entire process of proposal, pre-vote, and pre-commit starts again.

The process is as simple as the flow shown here:

Proposal —> Pre-vote —> Pre-commit

We can visualize the protocol flow with the diagram shown here:

Figure 5.9: Tendermint flowchart

Now we'll discuss how the termination mechanism works in Tendermint. A new termination mechanism has been introduced in Tendermint. For this purpose, there are two variables, namely validValue and validRound, which are used by the proposal message. Both of these variables are updated by a correct process when the process receives a valid proposal message and subsequent/corresponding 2F + 1 pre-vote messages.

This process works by utilizing the gossip protocol, which ensures that if a correct process has locked a value in a round, all correct processes will then update their validValue and validRound variables with the locked values by the end of the round during which they have been locked. The key idea is that once these values have been locked by a correct processor, they will be propagated to other nodes within the same round and each processor will know the locked value and round, that is, the valid values. Now, when the next proposal is made, the locked values will be picked up by the proposer, which have already been locked as a result of the valid proposal and corresponding 2F + 1 pre-vote messages. This way, it can be ensured that the value that processes eventually decide upon is acceptable as specified by the validity conditions described above.

Tendermint implementation

Tendermint is developed in Go. It can be implemented in various different scenarios. The source code is available at https://github.com/tendermint/tendermint.

It is released as Tendermint Core, which is a language-agnostic programming middleware that takes a state transition machine and replicates it on many machines.

Tendermint Core is available here: https://tendermint.com/core/

The algorithms discussed so far are variants of traditional Byzantine fault-tolerant algorithms. Now, we'll introduce the protocols that are specifically developed for blockchain protocols. A prime example is, of course, PoW, which was first introduced with Bitcoin.

Nakamoto consensus

Nakamoto consensus, or PoW, was first introduced with Bitcoin in 2009. Since then, it has stood the test of time and is the longest-running blockchain network. This test of time is a testament to the efficacy of the PoW consensus mechanism. Now, we will explore how PoW works.

At a fundamental level, the PoW mechanism is designed to mitigate Sybil attacks, which facilitates consensus and the security of the network.

A Sybil attack is a type of attack that aims to gain a majority influence on the network to control the network. Once a network is under the control of an adversary, any malicious activity could occur. A Sybil attack is usually conducted by a node generating and using multiple identities on the network. If there are enough multiple identities held by an entity, then that entity can influence the network by skewing majority-based network decisions. The majority in this case is held by the adversary.

It is quite easy to obtain multiple identities and try to influence the network. However, in Bitcoin, due to the hashing power requirements, this attack is mitigated.

How PoW works

In a nutshell, PoW works as follows:

  • PoW makes use of hash puzzles.
  • A node proposes a block has to find a nonce such that H (nonce || previous hash || Tx || Tx|| . . . ||Tx) < Threshold value.

The process can be summarized as follows:

  • New transactions are broadcast to all nodes on the network.
  • Each node collects the transactions into a candidate block.
  • Miners propose new blocks.
  • Miners concatenate and hash with the header of the previous block.
  • The resultant hash is checked against the target value, that is, the network difficulty target value.
  • If the resultant hash is less than the threshold value, then PoW is solved, otherwise, the nonce is incremented and the node tries again. This process continues until a resultant hash is found that is less than the threshold value.

We can visualize how PoW works with the diagram shown here:

Figure 5.10: PoW diagram

PoW as a solution to the Byzantine generals problem

The original posting on this by Satoshi Nakamoto can be found here:

https://satoshi.nakamotoinstitute.org/emails/cryptography/11/

The key idea behind PoW as a solution to the Byzantine generals problem is that all honest generals (miners in the Bitcoin world) achieve agreement on the same state (decision value). As long as honest participants control the majority of the network, PoW solves the Byzantine generals problem. Note that this is a probabilistic solution and not deterministic.

Nakamoto versus traditional consensus

Nakamoto consensus was the first of its kind. It solved the consensus problem, which had been traditionally solved using pre-Bitcoin protocols like PBFT. A natural question that arises is, are there any similarities between the Nakamoto world and traditional distributed consensus? The short answer is yes, because we can map the properties of PoW consensus to traditional Byzantine consensus. This mapping is useful to understand what properties of traditional consensus can be applied to the Nakamoto world and vice versa.

In traditional consensus algorithms, we have agreement, validity, and liveness properties, which can be mapped to Nakamoto-specific properties of the common prefix, chain quality, and chain growth properties respectively.

The common prefix property means that the blockchain hosted by honest nodes will share the same large common prefix. If that is not the case, then the agreement property of the protocol cannot be guaranteed, meaning that the processors will not be able to decide and agree on the same value.

The chain quality property means that the blockchain contains a certain required level of correct blocks created by honest nodes (miners). If chain quality is compromised, then the validity property of the protocol cannot be guaranteed. This means that there is a possibility that a value will be decided that is not proposed by a correct process, resulting in safety violation.

The chain growth property simply means that new correct blocks are continuously added to the blockchain. If chain growth is impacted, then the liveness property of the protocol cannot be guaranteed. This means that the system can deadlock or fail to decide on a value.

PoW chain quality and common prefix properties are introduced and discussed in the following paper:

Garay, J., Kiayias, A. and Leonardos, N., 2015, April. The bitcoin backbone protocol: Analysis and applications. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (pp. 281-310). Springer, Berlin, Heidelberg.

https://eprint.iacr.org/2014/765.pdf

Variants of PoW

There are two main variants of PoW algorithms, based on the type of hardware used for their processing.

CPU-bound PoW

CPU-bound PoW refers to a type of PoW where the processing required to find the solution to the cryptographic hash puzzle is directly proportional to the calculation speed of the CPU or hardware such as ASICs. Because ASICs have dominated the Bitcoin PoW and provide somewhat undue advantage to the miners who can afford to use ASICs, this CPU-bound PoW is seen as shifting toward centralization. Moreover, mining pools with extraordinary hashing power can shift the balance of power towards them. Therefore, memory-bound PoW algorithms have been introduced, which are ASIC-resistant and are based on memory-oriented design instead of CPU.

Memory-bound PoW

Memory-bound PoW algorithms rely on system RAM to provide PoW. Here, the performance is bound by the access speed of the memory or the size of the memory. This reliance on memory also makes these PoW algorithms ASIC-resistant. Equihash is one of the most prominent memory-bound PoW algorithms.

PoW consumes tremendous amounts of energy; as such, there are several alternatives suggested by researchers. One of the first alternatives proposed is PoS.

Proof of stake (PoS)

PoS is an energy-efficient alternative to the PoW algorithm, which consumes an enormous amount of energy. PoS was first used in Peercoin, and now, prominent blockchains such as EOS, NxT, Steem, and Tezos are using PoS algorithms. Ethereum, with its Serenity release, will soon transition to a PoS-based consensus mechanism.

The stake represents the number of coins (money) in the consensus protocol staked by a blockchain participant. The key idea is that if someone has a stake in the system, then they will not try to sabotage the system. Generally speaking, the chance of proposing the next block is directly proportional to the value staked by the participant. However, there are a few intricate variations, which we will discuss shortly. There are different variations of PoS, such as chain-based PoS, committee-based PoS, BFT-based PoS, delegated PoS, leased PoS, and master node PoS.

In PoS systems, there is no concept of mining in the traditional Nakamoto consensus sense. However, the process related to earning revenue is sometimes called virtual mining. A PoS miner is called either a validator, minter, or stakeholder.

The right to win the next proposer role is usually assigned randomly. Proposers are rewarded either with transaction fees or block rewards. Similar to PoW, control over the majority of the network in the form of the control of a large portion of the stake is required to attack and control the network.

PoS mechanisms generally select a stakeholder and grant appropriate rights to it based on their staked assets. The stake calculation is application-specific, but generally, is based on balance, deposit value, or voting among the validators. Once the stake is calculated, and a stakeholder is selected to propose a block, the block proposed by the proposer is readily accepted. The probability of selection increases with a higher stake. In other words, the higher the stake, the better the chances of winning the right to propose the next block.

How PoS works

Figure 5.11: PoS diagram

In the preceding diagram, a generic PoS mechanism is shown where a stake calculator function is used to calculate the amount of staked funds and, based on that, select a new proposer.

Types of PoS

There are several types of PoS algorithm.

Chain-based PoS

This mechanism is very similar to PoW. The only change from the PoW mechanism is the block generation method. A block is generated in two steps by following a simple protocol:

  • Transactions are picked up from the memory pool and a candidate block is created.
  • A clock is set up with a constant tick interval, and at each clock tick, whether the hash of the block header concatenated with the clock time is less than the product of the target value and stake value is checked. This process can be shown in a simple formula:

Hash (B || clock time) < target x stake value

The stake value is dependent on the way the algorithm is designed. In some systems, it is directly proportional to the amount of stake, and in others, it is based on the amount of time the stake has been held by the participant (also called coinage). The target is the mining difficulty per unit of the value of stake.

This mechanism still uses hashing puzzles, as in PoW. But, instead of competing to solve the hashing puzzle by consuming a high amount of electricity and specialized hardware, the hashing puzzle in PoS is solved at regular intervals based on the clock tick. A hashing puzzle becomes easier to solve if the stake value of the minter is high.

Peercoin was the first blockchain to implement PoS.

Nxt (https://www.jelurida.com/nxt) and Peercoin (https://www.peercoin.net) are two examples of blockchains where chain-based PoS is implemented.

Committee-based PoS

In this scheme, a group of stakeholders is chosen randomly, usually by using a verifiable random function (VRF). This VRF, once invoked, produces a random set of stakeholders based on their stake and the current state of the blockchain. The chosen group of stakeholders becomes responsible for proposing blocks in sequential order.

This mechanism is used in the Ouroboros PoS consensus mechanism, which is used in Cardano. More details on this are available here:

https://www.cardano.org/en/ouroboros/

VRFs were introduced in Chapter 4, Public Key Cryptography. More information on VRF can be found here:

https://tools.ietf.org/id/draft-goldbe-vrf-01.html.

Delegated PoS

DPoS is very similar to committee-based PoS, but with one crucial difference. Instead of using a random function to derive the group of stakeholders, the group is chosen by stake delegation. The group selected is a fixed number of minters that create blocks in a round-robin fashion. Delegates are chosen via voting by network users. Votes are proportional to the amount of the stake that participants have on the network. This technique is used in Lisk, Cosmos, and EOS. DPoS is not decentralized as a small number of known users are made responsible for proposing and generating blocks.

HotStuff

HotStuff is the latest class of BFT protocol with a number of optimizations. There are several changes in HotStuff that make it a different and, in some ways, better protocol than traditional PBFT. HotStuff was introduced by VMware Research in 2018. Later, it was presented at the Symposium on Principles of Distributed Computing.

The paper reference is as follows:

Yin, M., Malkhi, D., Reiter, M.K., Gueta, G.G. and Abraham, I., 2019, July. HotStuff: BFT consensus with linearity and responsiveness. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing (pp. 347-356). ACM.

The paper can be downloaded from: https://dl.acm.org/ft_gateway.cfm?id=3331591&type=pdf

There are three key properties that HotStuff has addressed. These properties are listed as follows:

Linear view change

Linear view change results in reduced communication complexity. It is achieved by the algorithm where after GST is reached, a correct designated leader will send only O(n) authenticators (either a partial signature or signature) to reach consensus. In the worst case, where leaders fail successively, the communication cost is O(n2)—a quadratic. In simpler words, quadratic complexity means that the performance of the algorithm is proportional to the squared size of the input.

Optimistic responsiveness

Optimistic responsiveness ensures that any correct leader after GST is reached only requires the first N - F responses to ensure progress.

Chain quality

This property ensures fairness and liveness in the system by allowing fast and frequent leader rotation.

All these properties together have been addressed for the first time in the HotStuff protocol.

Another innovation in HotStuff is the separation of safety and liveness mechanisms. The separation of concerns allows better modularity, cleaner architecture, and control over the development of these features independently. Safety is ensured through voting and commit rules for participant nodes in the network. Liveness, on the other hand, is the responsibility of a separate module, called Pacemaker, which ensures a new, correct, and unique leader is elected.

In comparison with traditional PBFT, HotStuff has introduced several changes, which result in improved performance:

  • PBFT-style protocols work using a mesh communication topology, where each message is required to be broadcast to other nodes on the network. In HotStuff, the communication has been changed to the star topology, which means that nodes do not communicate with each other directly, but all consensus messages are collected by a leader and then broadcast to other nodes. This immediately results in reduced communication complexity.

A question arises here: what happens if the leader somehow is corrupt or compromised? This issue is solved by the same BFT tolerance rules where, if a leader proposes a malicious block, it will be rejected by other honest validators and a new leader will be chosen. This scenario can slow down the network for a limited time (until a new honest leader is chosen), but eventually (as long as a majority of the network is honest), an honest leader will be chosen, which will propose a valid block. Also, for further protection, usually, the leader role is frequently (usually, with each block) rotated between validators, which can neutralize any malicious attacks targeting the network. This property ensures fairness, which helps to achieve chain quality, introduced previously.

However, note that there is one problem in this scheme where, if the leader becomes too overloaded, then the processing may become slow, impacting the whole network.

Mesh and star topologies can be visualized in the following diagram:

Figure 5.12: Mesh topology

Figure 5.13: Star topology

  • PBFT has two main sub-protocols, namely normal mode and view-change mode. View-change mode is triggered when a leader is suspected of being faulty. This approach does work to provide a liveness guarantee to PBFT but increases communication complexity significantly. HotStuff addresses this by merging the view-change process with normal mode. This means that nodes can switch to a new view directly without waiting for a threshold of view-change messages to be received by other nodes. In PBFT, nodes wait for 2F+1 messages before the view change can occur, but in HotStuff, view change can occur directly without requiring a new sub-protocol. Instead, the checking of the threshold of the messages to change the view becomes part of the normal view.

Just like PBFT, HotStuff also solves the SMR problem. Now, we'll describe how this protocol works. As we saw earlier, consensus algorithms are described and work under a system model. The model and relevant assumptions under which HotStuff works is introduced in the next section.

Model

The system is based on a standard BFT assumption of N=3F+1 nodes in the system, where F is a faulty node and N is the number of nodes in the network. Nodes communicate with each other via point-to-point message-passing, utilizing reliable and authenticated communication links. The network is supposed to be partially synchronous.

If you are unfamiliar with this terminology, we discussed these topics in the Analysis and design section of this chapter.

HotStuff makes use of threshold signatures where a single public key is used by all nodes, but a unique private key is used by each replica. The use of threshold signatures results in the decreased communication complexity of the protocol. In addition, cryptographic hash functions are used to provide unique identifiers for messages.

We discussed threshold signatures in Chapter 4, Public Key Cryptography. You can refer to the details there if required.

How HotStuff works

HotStuff works in phases, namely the prepare phase, pre-commit phase, commit phase, and decide phase.

We'll now introduce these phases:

A common term that we will see in the following section is qorum certificate. It is a data structure that represents a collection of signatures produced by N – F replicas to demonstrate that the required threshold of messages has been achieved. In other words, it is simply a set of votes from N – F nodes.

Prepare:

Once a new leader has collected new-view messages from N - F nodes, the protocol for the new leader starts. The leader collects and processes these messages to figure out the latest branch in which the highest quorum certificate of prepare messages was formed.

Pre-commit:

As soon as a leader receives N - F prepare votes, it creates a quorum certificate called "prepare quorum certificate." This "prepare quorum certificate" is broadcast to other nodes as a PRE-COMMIT message. When a replica receives the PRE-COMMIT message, it responds with a pre-commit vote. The quorum certificate is the indication that the required threshold of nodes has confirmed the request.

Commit:

When the leader receives N - F pre-commit votes, it creates a PRE-COMMIT quorum certificate and broadcasts it to other nodes as the COMMIT message. When replicas receive this COMMIT message, they respond with their commit vote. At this stage, replicas lock the PRE-COMMIT quorum certificate to ensure the safety of the algorithm even if view change-occurs.

Decide:

When the leader receives N - F commit votes, it creates a COMMIT quorum certificate. This COMMIT quorum certificate is broadcast to other nodes in the DECIDE message. When replicas receive this DECIDE message, replicas execute the request, because this message contains an already committed certificate/value. Once the state transition occurs as a result of the DECIDE message being processed by a replica, the new view starts.

This algorithm can be visualized in the following diagram:

Figure 5.14: HotStuff protocol

HotStuff, with some variations, is also used in LibraBFT, which is a distributed database designed to support a global currency proposed by Facebook.

More details on this protocol are available here:

https://libra.org/

HotStuff guarantees liveness (progress) by using Pacemaker, which ensures progress after GST within a bounded time interval. This component has two elements:

  • There are time intervals during which all replicas stay at a height for sufficiently long periods. This property can be achieved by progressively increasing the time until progress is made (a decision is made).
  • A unique and correct leader is elected for the height. New leaders can be elected deterministically by using a rotating leader scheme or pseudo-random functions.

Safety in HotStuff is guaranteed by voting and relevant commit rules. Liveness and safety are separate mechanisms that allow independent development, modularity, and separation of concerns.

HotStuff is a simple yet powerful protocol that provides linearity and responsiveness properties. It allows consensus without any additional latency, at the actual speed of the network. Moreover, it is a protocol that manifests linear communication complexity, thus reducing the cost of communication compared to PBFT-style protocols. It is also a framework in which other protocols, such as DLS, PBFT, and Tendermint can be expressed.

In this section, we have explored various consensus algorithms in detail. We discussed pre-Bitcoin distributed consensus algorithms and post-Bitcoin, Proof of Work (PoW)-style algorithms. Now, a question arises naturally: with the availability of all these different algorithms, which algorithm is best, and which one should you choose for a particular use case? We'll try to answer this question in the next section.