Achieving consensus

It seems as if I start every piece here with an apology for taking so long to continue writing. That’s the nature of my life right now, it seems. I’m sorry.

So let’s dive right in. Last time, I was writing about how different channels interact with each other, and what that means about messaging in fairly abstract terms. I also wrote before how I’m not a big fan of negotiating protocol features, and prefer to instead iterate to a new and improved protocol faster. I’ll touch on that in this post again. Also, I wrote that packet headers will have to include some reserved bits, which I will also refer to here.


Ok, so we left things off with establishing consensus for channel identifiers. In particular, I mentioned that since at this level we’re talking about establishing a connection between two machines, we can skip over the complexities of many distributed consensus protocols, since they tend to assume that three or more peers are involved in the decision making.

In the abstract, the upshot of a two peer consensus protocol is for one party to propose a value, and the other party to either agree (simple case), or disagree. If they disagree, this could be solved with either the initiating peer making a new proposal, or the responding peer to make a counter proposal. In either case, we would have to bounce back and forth some messages until an agreement is reached.

Should we model things like that? No. Well, not quite.

First, let’s see how other protocols do it, starting with TCP.


TCP has a three-way handshake, meaning it exchanges three messages to establish a connection.

But hang on, what does that have to do with consensus?

Well, two things. On the one hand, the two peers establish consensus on sequence numbering. Also, in the more broader sense, the handshake allows both sides to agree that the connection is established.

  1. First the initiating peer sends a SYN message, containing a sequence number X.

  2. The reply from the second peer is a SYN-ACK message — it sends its own sequence number Y (consider this the SYN part as above), and acknowledges X by sending X + 1 (the ACK part).

  3. Lastly, the initiating peer acknowledges Y with an ACK message containing Y + 1.

X and Y are chosen randomly. Either side can detect spurious a ACK by comparing the number sent there to what they sent themselves; if it’s not an increment of their own sequence number by one, it must belong to a different session.

We’ll get back to TCP later, let’s look at something else.


The Paxos protocol attempts to establish consensus amongst more peers with Byzantine Fault Tolerance. This is not the issue we’re trying to solve here, but it is helpful to understand roughly how it achieves this.

The main difference to a TCP handshake is actually that the initiating peer does not start proposing something immediately. Instead, it first prepares peers for the next round of proposals. To do this, each proposal has an ID, and IDs are strictly incrementing. That allows peers to reject any proposal ID that is outdated.

  1. The initiating peer generates a new proposal ID and sends it in a PREPARE message.

  2. The recipients respond with a PROMISE message the same ID. If the initiating peer receives a sufficient number of promises, it knows that it can now proceed with a proposal.

  3. The initiating peer sends a PROPOSE message with the proposal ID, and the proposed value.

  4. The recipients respond with an ACCEPT message with the proposal ID and the accepted value.


Compare Paxos to TCP here:

  • TCP has no proposal ID. But the sequence number in the SYN message helps disambiguate messages in much the same way.

  • The PREPARE/PROMISE and PROPOSE/ACCEPT messages both establish an agreement on a value. The first establishes agreement on the proposal ID, and the second agreement on the proposed value.

  • Similarly in TCP, each SYN/ACK pair establishes agreement on one of the sequence IDs.

  • The proposal ID in Paxos exists in a sense only because multiple peers need to agree; in TCP, the sequence number works because agreement is binary, either the peer does acknowledge the proposal or it does not. In Paxos, it is majority based.

From this, we can draw some conclusions:

  1. We need a pair of messages — a kind of proposal and a kind of acknowledgement — to establish agreement on a single value.

  2. We also need to agree what we agree on. The first message, therefore, needs to set this context.


Let’s revisit reliability for a moment: we need to establish two channel capabilities that in other protocols are baked in, and both relate to reliability: one is strict ordering of packets, and the other is resending of lost packets.

So let’s look at the two bits — resending and ordering — separately:

  1. !resend && !ordered — this corresponds to UDP.

  2. resend && ordered — this corresponds to TCP.

  3. resend && !ordered — this corresponds to SCTP’s mode introduced to allow for out-of-order reliability versus TCP.

  4. !resend && ordered — this is new.

What does it mean for packet ordering to be preserved, but not having resends?

Well, suppose we get a packet with sequence number N, and then with another with a sequence number N + 2. Should we deliver the latter packet to the application or hold it back?

In TCP, the answer is obviously to hold it back indefinitely — that is, until either packet N + 1 is received, whether by resend or a simple out-of-order delivery. And if a timeout expires before we receive N + 1, TCP just gives up and closes the connection.

We can do different here: we can keep the buffer of received packets, and use a shorter timeout for delivering packets to the application. The timeout expires, and we deliver packet N to the application, then shift the buffer window to start with N + 1.

When the timeout expires again, either N + 1 has arrived out-of-order or it has not. If it has arrived, we deliver it to the application and shift the window to N + 2. If it has not arrived, we deliver N + 2 to the application, and shift the window to N + 3.

This fairly simple algorithm implements a lack of resend combined with strict ordering of a channel.

Is this useful? Yes, absolutely.

You see, this kind of algorithm is perfect for live streaming audio or video. In terms of perceived quality, losing bits of the stream is less impacting the user experience than stalling the stream, leading to stutter or buffering. We should absolutely support this mode.

Note that there is also space for a hybrid mode: one where we attempt resends, but shift the window instead of closing the connection. This would allow for some amount of recovery, but would still prefer moving on over stalling.

So we have at least three capability bits for the channel. We need some kind of matrix for that now…

Ok, so the table lists the bits. We’ve marked the combinations that behave like UDP and TCP, as well as the aforementioned SCTP mode respectively. The streaming modes we discussed here are listed as Streaming 1 and Streaming 2.

The full matrix contains a few other modes that may or may not be useful:

  • mode X does not resend or order, but closes immediately on loss. This could be useful for point-to-point connections over a reliable medium such as a cable. The mode would effectively detect when the plug is pulled.

  • mode Y would be a hybrid of Streaming 1 and mode X. It may similarly be useful for reliable media.

  • mode Z would be a version of the SCTP mode that does not really care about packet losses too much. It’s a kind of best-effort version of a resend. It’s doubtful it will make a whole lot of sense, but there is no reason not to support it.

Now we could discuss all those different modes for a while, but let’s skip this for the time being. There’s another post coming up on those things anyway.

For now, it’s important that any combination of these flags except the UDP-style case where every flag is zero requires our packets to contain a sequence number. And that matters when…

Putting Things Together

The TCP handshake exchanges sequence numbers, because that’s what TCP needs. It appears we need to do the same. We also want to agree on a channel ID.

Looking at TCP and Paxos, it appears we can’t really cut our channel handshake down to fewer than three messages — though, remember, we’re talking about messages here, not packets. It is entirely feasible for two peers to establish multiple channels in parallel in three packets, containing multiple messages each.

We’ll resort to a little trick here with the channel ID. Remember how TCP establishes two sequence numbers in three messages? Well, one could conceivably view the tuple of the two sequence numbers as a single value.

Or, to rephrase this, we can split our channel ID from the N bits it should be into two N/2 parts. And then we can use three messages to reach agreement on both parts.

The rationale for doing so is simple: we looked at Paxos as an exchange of four messages, but really that is the ideal case. If for whatever reason insufficient peers send an ACCEPT message, Paxos starts all over again with a new proposal round. That’s something we want to avoid.

So instead of having one peer propose a single value that may or may not be rejected, let’s give both peers full control over half of the value. That way they can each select values that do not create local conflicts, and therefore must be accepted.

Oh, you like that idea? Thank you, I stole it. This is how port numbers work in TCP and UDP. But I’ll take the credit anyway.

So the channel ID is split into two N/2 parts, we’ll call them c1 and c2. And we need sequence numbers, s1 and s2. Our channel establishment then looks like this in full:

  1. The initiator chooses c1 and s1, and sends both in a MSG_CHANNEL_NEW message. The message furthermore contains the capability bits above.

  2. The responder generates c2. It sends (c1||c2) as the channel ID. Responding with c2 gives the initiator context, much like with the proposal ID in Paxos. It also generates and sends s2. Finally, it sends s1 + 1 as an acknowledgment of s1 — though admittedly that is somewhat unnecessary. This message is MSG_CHANNEL_ACKNOWLEDGE.

  3. The initiator responds with the full channel ID (c1||c2) as well as s2 + 1 as an acknowledgement in MSG_CHANNEL_FINALIZE.


This protocol above serves our needs, but it is not a particular improvement over a TCP style handshake. Are there ways in which we can optimize things?

Let’s assume for a moment we did not need sequence IDs. This would reduce the payload of the above messages a little, true. But it also allows for some optimization.

Suppose we already have application data pending on both sides for the channel. In that case, sending MSG_CHANNEL_ACKNOWLEDGE and MSG_CHANNEL_FINALIZE in the default channel, and then the data message in the agreed-upon channel would mean sending two packets in both directions.

However, once the initiator has chosen c1, any packet for a channel ID that begins with c1 is effectively an acknowledgement of c1, and sends c2 in the remaining bits. Similarly, MSG_CHANNEL_FINALIZE is not really required if data can be sent on the channel instead, as both peers already have the full channel ID established.

Of course, if no data is pending in either direction, sending the appropriate message is still required. We can’t optimize that away entirely. In particular, MSG_CHANNEL_FINALIZE serves to tell the responder that their proposal of c2 was acknowledged!

The pesky sequence IDs are what makes this optimization impossible.


Unless, of course, there was a way to derive the sequence numbers from public information in a way that cannot be spoofed. Hmm.

Flood Attacks

Speaking of spoofing. TCP is known to be vulnerable to a number of flooding attacks, the most common being a SYN flood. Here, a number of SYN packets from spoofed IP addresses (or otherwise) are sent to a host. Since the host has to maintain state for each SYN packet until the entire three-way handshake expires, this can lead to resource exhaustion, turning this into a Denial of Service situation.

That means, in our protocol here, we should learn from this and try not to allocate state on the recipient until the handshake has completed.

To date, the most common techniques for mitigating SYN flood attacks are:

  1. A SYN cache. Without going into detail, this effectively limits the amount of information a responder keeps for partially completed handshakes.

  2. SYN cookies. In this technique, the responder does not keep state at all, but generates a cookie it sends back to the initiator. The initiator has to return that cookie, which the server can validate.

Cookies are obviously the better choice here. The less information we need to keep on the responder’s side, the better. Cookies are also the weapon of choice in TLS1.3 to defend against DoS attacks.

So let’s use cookies.

The simple breakdown of a cookie is that it encodes information that identifies the sender in some way, and combines it with a secret only the receiver knows. Typically, this is done with a cryptographic hash function, in some kind of HMAC construct. Without knowing the secret, attackers cannot generate valid cookies, which means responders can easily discard spoofed subsequent packets without having to keep state on incomplete handshakes.

Unlike TLS, though, we’ll use cookies on either side of the connection since we live in a P2P world where there is no clear distinction between client and server.

So what information should we put into the HMAC?

Since we don’t want to rely on being implemented on top of any particular protocol, we arrived at peer identifiers for initiator and responder. Both are public information in the packet header, and available to either side for reconstruction. The other obvious candidate is the channel ID. In MSG_CHANNEL_NEW, this is incomplete, but we can hash the bits that we know.

TLS also specifies a “transcript hash” function, which essentially works like a block chain. Each received message is hashed, and added to the HMAC input. For an initiator with peer identifier pid1 and secret secret1 and a responder with peer identifier pid2 and secret2, the messages would contain the following cookies:

  1. MSG_CHANNEL_NEW: there is no previous message to hash, so we’ll send cookie1 = HMAC(secret1, pid1||pid2||c1) — the secret aside, this is all information we’re sending in plain text as well.
    The responder cannot validate this, and that is not the point. The point is only to return it in the response. No state needs to be kept to do this.

  2. MSG_CHANNEL_ACKNOWLEDGE: cookie2 = HMAC(secret2, pid1||pid2||c1||c2) — of course we’ll also have to return cookie1 in the plain text section for authentication.
    The initiator receiving this message can see cookie1 again, and knows it has received a response to its attempt at establishing a channel ID, because the cookie cannot be spoofed.
    It needs no state to validate cookie1. While for a variety of reasons it does not make sense for an initiator to not remember cookie1, it could in fact be possible to reconstruct this from the secret1 and the contents of this message.

  3. MSG_CHANNEL_FINALIZE: we have to return cookie2 to signal to the responder that we did, in fact, receive this.
    The same thing applies as in the previous message: the responder has all the information available to validate cookie2, and therefore knows a channel has been established with the given ID.

Third parties can’t generate valid cookies without obtaining the respective secrets, and neither peer needs to keep state to validate the cookies. This fulfils some of the requirements of a protection against DoS flood attacks.

The TLS cookie works because returning the cookie proves some reachability at the network source address from which the cookie originated. That is, a MSG_CHANNEL_NEW from a spoofed source IP address would require that IP address to respond to MSG_CHANNEL_ACKNOWLEDGE. This is relatively easily mitigated, as this message needs only be accepted for channels currently undergoing a handshake.

Ok, that’s great.

Unfortunately, having to include cookies into our handshake messages means that optimizing the handshakes as alluded to above becomes even less likely.

Optimization, Revisited

Let’s look at this again: what we really wanted to achieve is to use the channel ID immediately in packets encapsulating application data, to avoid having to send multiple packets: one in the default channel with a handshake message, and one with application data.

Then we concluded that because we have to exchange sequence numbers, too, that this optimization is not really achievable.

The reason we included sequence numbers in the first place is so that we could implement resend and ordering features. That’s fine, we don’t really have to exchange them in the handshake. Naively at least, we can assume that the first packet on a channel we receive contains the sequence number we want to start with.

Of course, if the peer sent other packets on that channel before, but those get lost, we wouldn’t know. We can only start ordering from the first received packet. TCP puts those sequence numbers into the handshake packets to ensure that initial data is not lost, and that is why we included them in our handshake above as well.

But there is an alternative.

Now that we have cookies in one form or another, the presence of a cookie in a message already validates that we’re talking to the right peer. Hang on, I meant in a packet.

Let’s assume the responder did not send a MSG_CHANNEL_ACKNOWLEDGE message in the default channel, but instead wanted to send application data immediately. The initiator does not yet know the c2 part of the channel ID. As in the previous optimization attempt, we can assume that the c2 we’re receiving is valid, but with cookies in the mix, we’d like to also see cookie1 at the same time.

So let’s introduce a new message, MSG_CHANNEL_COOKIE. Unlike the other handshake messages, this one can be sent in a packet already belonging to the channel. Sent from responder to initiator, it would include cookie1, and cookie2 if sent in the other direction.

Between the peer identifiers and channel ID in the packet header, and the secret each peer keeps, we can validate this cookie and process whatever other messages are contained within the packet.

The only problem is that we cannot really process messages in the channel unless this cookie was received. That means the first packet sent on the channel must include MSG_CHANNEL_COOKIE, and this message must be processed before all others. So far, so good.

If that first packet happens to be lost?

Well, then the channel is not established. The usual handshake timeouts apply, and a new attempt at establishing a channel needs to be made. Of course, if the lost packet also included MSG_DATA, this is a message we’d have to retransmit.

I suppose the conclusion is that if — and that’s not a given — the application has data to transmit immediately, then the protocol implementation must be prepared to resend this data until the cookie the peer sent has also been received, whether via a handshake message or a MSG_CHANNEL_COOKIE. That seems like a reasonable enough thing to ask.

A defensive responder would not typically allow the application to do that in the first place. And an initiator wishing to send early data can hereby do so.

In either case, we do not need to send s1 and s2 or any of their derived values in the handshake after all. To summarize:

  1. MSG_CHANNEL_NEW: The initiator sends c1 and cookie1, as well as channel capability bits.

  2. MSG_CHANNEL_ACKNOWLEDGE: The responder sends (c1||c2) as the channel ID, as well as cookie2.

  3. MSG_CHANNEL_FINALIZE: The initiator responds with (c1||c2) as well as cookie2.

  4. Either of the previous two messages can be substituted with a MSG_CHANNEL_COOKIE containing the respective cookie in a packet already belonging to the channel.

Feb 25th Addendum:

If the responder is supposed not to keep state, the capability bits actually have to be sent in one of the latter messages — i.e. MSG_CHANNEL_FINALIZE, or the initiator’s MSG_CHANNEL_COOKIE. The easiest thing is to define both as always containing the space for capability bits, but when the responder sends MSG_CHANNEL_COOKIE, they are to be unset. MSG_CHANNEL_NEW then does not require capability bits.

Encryption Note

In all this talk about hashes and cookies, I should note again that this handshake is oftentimes already encrypted. The encryption layer should already authenticate peers sufficiently to make many of these precautions unnecessary. But they do not hurt overmuch — the overhead in message payload is not high. And they then permit unencrypted uses of the protocol layer.

We have discussed now how to establish channels, and could implement the protocol. What we haven’t discussed in too much detail yet is how exactly channel capabilities are to be handled, so any implementation at this point would not provide reliability. There are also other capabilities we have not yet discussed.

Beyond that, we have only hinted so far at how exactly to include encryption in this mixture.

Both topics deserve exploration in future. Stay tuned!

A quick note on subscriptions: I’m trying to keep doing this work, and have set up subscriptions so that people can help me pay my bills. That means some articles are effectively paywalled — if you can’t pay, fear not: there is a special launch promotion open.

Get 100% off forever

You’ll get free lifetime access, and I can keep the paywall up for folks coming from the outside. And if you are in the right space to be extra awesome and pay for a subscription, that’s all the more appreciated!