Ok, so in the last post, I was getting into some channel capability flags, and how they impact handling of lost packets. The main conclusion for the purposes of that post was that we need some kind of sequence numbering in every packet, unless we just so happen to switch all reliability flags off.
Saving bits is a noble goal to be sure, especially when you have WiFi induced low MTU on the path. But I’m not sure that the extra branch you have to take for one of currently nine cases is worth saving them. You know what, let’s leave this decision for later. Personally, I am leaning to always sending sequence numbers, which can sometimes be ignored.
This post is all about how the capability bits I introduced last affect packet handling, in particular with respect to send and receive windows.
To start off, let’s remind ourselves of the capability flags again:
Resend — should lost packets be sent again?
Ordering — should received packets be delivered to the application in the same order as they were sent?
Close-on-Loss — if we cannot resend a packet, should we close the connection or just continue?
And here are they put together into a handy matrix.
Remember, kids, these are channel specific flags. You can have different channels with different sets of flags running in parallel. And that means the majority of this post is actually related to what happens per channel, and on a packet level.
To start with, understand that we will always need a send and receive buffer. Even UDP uses them, if only to provide a temporary place for the application to hand over packets to the protocol implementation, which often lives in-kernel.
With sockets, send and receive buffers can be resized, usually with a setsockopt(3) call. Assume we’ll provide something similar to the application here.
The real question is how to handle the packets in this buffer with these flags present or absent. And of course, in the simplest UDP case, at least send handling is simple: once a packet has been sent to the wire, we can discard it from the buffer. And if the buffer is full? The application can’t write to it. Done.
On the receive side, UDP is a tad more complicated. In prinicple, packets should stay in the buffer until the application reads them. But what happens when the application is slower to read than the buffer fills up? Well, strictly speaking this is implementation dependent, but in practice, implementations will often drop packets until there is space again in the buffer. That is, an overfull receive buffer is identical to a lost packet.
To be fair, in TCP we see much the same behaviour. The difference is that in TCP, ZeroWindow packets are returned to the sender to indicate that packets are being dropped at the moment. In fact, every TCP header contains the window size, i.e. how much data the sender of the header is prepared to accept at the moment. Since this amount can fluctuate based on traffic load, advertising it serves as an effective congestion control mechanism.
Ok, so now we have a basic understanding of send and receive buffers, we can move on.
Depending on whether ordering is expected, applications cannot read a packet with a sequence number N until they have read the packet with sequence number N-1.
That’s really all that ordering is.
Ordering is important when you want to present your network traffic as contiguous byte streams. Because individual packets can take different paths along the network, they can arrive in a different order than they were being sent. The above restriction then means that a packet may have arrived, and stay in the receive buffer of the recipient, but cannot be delivered to the application just yet.
So the difference between an ordered and an unordered receive buffer is that, conceptually at least, an unordered receive buffer is filled in the order that packets arrive, while the ordered receive buffer is filled in sequence number order, leaving gaps for packets that haven’t arrived yet.
In practice, that’s unlikely to be the best way to implement it. But stick with the mental image.
When the application reads from the receive buffer, then, in the unordered buffer it just advances packet by packet. In the ordered receive buffer, it has to stop reading at the gaps. We can imagine an implementation having a read pointer advancing packet by packet, and being stalled at the missing packet.
Ordering is closely related to the other capability flags.
And the easiest candidate to explain is the close-on-loss flag.
In unordered receive buffers, this flag is a channel killer. If any read attempt was made, and the current packet to read does not have a sequence number one greater than the last actually read packet, we need to close the channel. This could happen for any number of temporary reasons, such as a packet taking a slightly slower path.
Personally, I strongly doubt that this flag makes much sense for unordered channels. My best guess, as I explained in the last post, is that it may find usage as a fast detection method for when a plug was pulled. That would require the protocol to run on a cabled connection that is strictly ordered under normal circumstances. (Got a real use case? Drop me a line!)
In ordered buffers, what it means is that the channel should be closed when a gap is encountered. Again, the reaction seems pretty extreme. But here, the extremity does not come out of an inherent lack of order, but from the lack of resends that TCP would offer. Because TCP is a protocol with ordering and close-on-loss behaviour.
The reason TCP gets away with close-on-loss behaviour is because it doesn’t close immediately — it first attempts to resend the packet a number of times before accepting the loss as permanent.
But this behaviour is independent of ordering.
In unordered receive buffers, we can still detect whether a packet coming in has a sequence number of one more than the last packet to arrive. If not, we can consider the packets between them either late or lost. In fact, we can assume this whether the receive buffer is ordered or not.
What close-on-loss does when resends are disabled is aggressive. When resends are enabled, it simply provides a decision for what to do when the loss is accepted after several resend attempts.
When it’s on? Behave like TCP and close the channel. And when it’s off? Well, then we’ll try returning the next packet according to our buffer ordering policy.
For unordered buffers, resends and reading can progress in parallel: here, a missing sequence number merely triggers the resend process. If resent, the packet will arrive when it arrives and get appended to the buffer. Eventually, the application can read it.
With ordered buffers that stall reading, the resend mechanism will effectively determine for how long the reading stalls. With close-on-loss off, we will then just skip ahead to the next packet.
So how do we want to encourage a peer to send packets we didn’t receive? Well, TCP’s method is to send acknowledgements in ACK packets, or their more modern counterpart, SACK. The upshot of both of them is the same, however: in TCP, a peer reports the sequence numbers of packets it received. The sender then can remove them from their send buffer.
There are a number of benefits to that mechanism.
First of all, the sending peer does not send packets unnecessarily. Unless explicitly instructed to resend packets (by not having them listed in the acknowledgement), senders can stay quiet. This avoids unnecessary usage of the link, which is great.
Second, the mechanism allows for requesting the same packet over and over if all of the previous attempts to send it failed. It is up to the recipient to decide when enough is enough, and close the connection. Let’s keep that as well.
Third, having packets acknowledged is an immediate instruction that they can be cleared from the send buffer. The alternative is to retire them.
There is a downside, though, and it manifests when link quality is bad. The acknowledgment packets can also get lost, and have to be sent time and again until their arrival or the sender side is stalled — it can’t clear its send buffer without receiving acknowledgements.
ACK and SACK packets have slightly different payloads. It’s worth understanding these briefly:
ACK contains a single sequence number, and that is the last sequence number the recipient received in a contiguous stream. If three packets (1, 2, and 3) were sent, but only 1 and 3 arrived, the recipient would send ACK with 1 as the payload, expecting the sender to re-send 2.
Each received packet results in an ACK, though, so the sender would get two ACK(1) in return. It can then deduce that packet 2 is missing, while 3 has arrived.
SACK instead contains blocks of acknowledgements, where each block consists of the beginning and end sequence numbers of a contiguous block received. A single SACK is sufficient for communicating precisely when several packets in a stream are missing.
SACK is definitely an improvement over ACK. The downside of the SACK mechanism manifests when a SACK packet is lost, which can happen with unreliable links.
Consider the following scenario: the sender sends a large number of packets, which we’ll conceptually subdivide into two sequences. Individual packets are lost in either or both sequences, it doesn’t matter much.
The reason for having two sequences is that SACK packets are bounded in length. If a sufficient number of packets has been lost, a single SACK packet may not be enough to acknowledge all received packets from both sequences together. We’ll have to address one sequence in one SACK packet, and the other in another.
Now consider what happens when the first of these SACK packets are lost. The sender received acknowledgements for some packets in the second sequence, and can resend the missing ones just fine. But what about the first sequence? It cannot guess whether the entire sequence was lost, or just some parts. It has to wait for the first SACK to be resent.
Now in TCP where connections strictly deliver byte streams without gaps, this is a fine tradeoff to make. What if the packet loss rate is high — say every second packet is lost? We’ll have a hard time getting the right amount of SACKs through to resolve the situation.
But when you want resends without ordering? Worse, when you are happy with some losses, and don’t want to close the connection aggressively? Not so much.
So we’ll learn a lesson from both of these mechanisms, but we’ll be a bit more frugal, too. We’ll define a MSG_DATA_PROGRESS, with the following payload:
A sequence number much as in ACK packets. Unlike ACK, though, we don’t acknowledge at all that all packets before the given one have arrived. All we’re saying is that we do not care about any earlier packets any longer. This might be because we’ve given up on resends, or because we received them. The exact reason does not have to matter to the sender. This field also helps clear the send buffer.
We’ll also have sequence numbers, similar as in SACK. However, these are explicit requests for resends rather than acknowledgements.
The rationale for the first field is that it allows some progress messages to be lost. In a TCP-like situation, the field won’t change from message to message until the first missing packet has arrived after all. But in more loss-tolerant settings, we can be more aggressive. Consider a streaming case: any packet that has been consumed by the application will move this field forward, because there is no point in resending prior lost packets any longer.
The change in meaning in the second part of the payload comes down to a not-so-often discussed feature of networking equipment: it is fundamentally packet oriented.
When you look at CPUs for embedded routers, a fair few of them contain special instruction sets that allow you to compare multiple memory locations, and do something with each pointer based on the value at their memory location. The reason here lies in packet switching: examining a packet’s destination and determining what to do with it is the fundamental bottleneck of router CPUs, so adding hardware support for performing several such operations in parallel yields a huge performance gain.
What follows from that is that potential network throughput is fundamentally influenced by actual network throughput — CPUs can only process a fixed number of packets per time unit. Any superfluous packet sent on the network will thus reduce the number of useful packets sent in the same time.
When the throughput is relatively low, it barely matters whether you acknowledge every packet received or not. There is bandwidth to spare.
But when throughput is high, you don’t really want to keep sending messages that boil down to saying “yep, all is good, continue”. In particular, you don’t want to send multiple messages as in the above SACK example with relatively high packet loss, when one would be enough.
There’s probably a region for TCP-like connections where the SACK mechanism is better. That’s going to be something one can theoretically model, and use mathematical analysis to figure out. I’ll leave that as an exercise for the reader (really, do it, write something up, drop me a line). But for other combinations of capability flags, it’s not going to be as good as this proposed MSG_DATA_PROGRESS.
For what it’s worth, there are a few obvious optimizations in encoding these values, which could yield some advantages. Note, though, that they will disable the use of such CPU features discussed above, so their actual usefulness would have to be determined.
With the first field interpreted as above, all sequence numbers in the second field must be strictly larger than in the first. We can then model the second field as offsets from the first.
Sequence number blocks are wasteful for individual lost packets, and individual resend requests are wasteful when a block of packets is missing. It’ll be possible to encode individual numbers vs. blocks with a single bit prefix, and allow for either to be sent.
Similarly, it would be possible with a single bit prefix to signal whether the sequence number or block is an acknowledgement or a request for resend, allowing us to optimize the message payload for all kinds of combinations of capabilities.
To summarize: we can shift progress forward and communicate with the sending peer precisely what we still want to receive. Since this MSG_DATA_PROGRESS message is channel oriented, it should go without saying that it’ll be sent in the channel it relates to. It does not need its own channel ID payload.
As mentioned earlier, in TCP the recipient manages the throughput from the sender by advertising the size of its receive window — i.e. the unused part of its receive buffer. As you will recall, this can be used as a congestion control mechanism: the recipient tells the sender not to empty its send buffer, which also means the application will know that it must stop writing data for the moment.
We’ll revisit congestion control in a later post. For the moment, let’s just agree that advertising the size of the receive buffer is a good thing. There are just a couple of things we want to do differently from TCP.
Firstly, and I know it’s been repeated to death, TCP is Byte stream oriented. That is, writing a packet from the network into the receive buffer and reading from the receive buffer into the application is entirely unrelated with regards to how much data is written and read. It is common enough for applications to, in some situations, read single Bytes from a handle. Sending single Bytes in individual packets, though, would be very wasteful. So you have a fundamental disconnect here.
We do, however, support non-TCP like modes of operation, and those support only full packet reads. I think it’s not too far of a stretch to say we’ll do the same. If an application wants entirely TCP like behaviour, it must then manage a packet-sized buffer itself. That’s easy enough to hide behind an abstraction function; so easy, in fact, that we can provide it in the API. But from a receive buffer perspective, the application layer reads entire packets.
Packet size, as discussed before, is going to be largely static. That is, it is dependent on MTU, and in the absence of path MTU discovery, which assumes such a thing as a managed network path in the first place, we’ll just have to settle on an MTU that is fixed to less than the MTU of the most widely deployed link layer technology. We arrived at ca. 1200 Bytes.
It’s fair, then, to define window size not in terms of Bytes, but in terms of free packet slots in the buffer. Besides being a smaller number to encode, one of the advantages here is that it directly relates to the sender’s ability to send packets or not. In fact, the relationship is so direct that the sender can predict the recipients receive window size based on a number of values:
A previously advertised receive window size.
The number of packets sent in the meantime.
The payload of received MSG_DATA_PROGRESS messages.
Such a prediction should be reset with each new window advertisement. But because the prediction is so easy to make, it means we do not have to advertise the receive window very often. We can either do it periodically to reset the peer’s prediction, or can do something clever and only send it when the peer appears to push more packets than we can handle. That is really something to figure out in future.
For now, let’s assume we occasionally send a MSG_DATA_RECEIVE_WINDOW with a single value specifying the number of packets a recipient can receive. Peers receiving this window must, in the absence of smarter things to do, ensure not to send more packets than this window specifies.
In this post, we’ve looked at channel capabilities, and how those translate to send and receive window management. We’ve also discussed two new messages, MSG_DATA_PROGRESS and MSG_DATA_RECEIVE_WINDOW, which help with active channel management, both from a reliability and traffic shaping point of view.
One thing we’ve skipped over is how different channels interact in a connection. Packet loss in one channel can be handled with resends for sure. But packet loss may also be a function of how much traffic the overall connection sees at the moment. It may be necessary to tune the receive window of all channels to respect an overall throughput estimate for the link.
In some sense, these are optimizations and implementation details. A first implementation should not be overly concerned with these, but it certainly will become the topic of future research.
In the meantime, we’re now at the point where we can implement a functioning multi-channel protocol with varying combinations of reliability capabilities.
Next up is how encryption plays into this.
A quick note on subscriptions: I’m trying to keep doing this work, and have set up subscriptions so that people can help me pay my bills. That means some articles are effectively paywalled — if you can’t pay, fear not: there is a special launch promotion open.
You’ll get free lifetime access, and I can keep the paywall up for folks coming from the outside. And if you are in the right space to be extra awesome and pay for a subscription, that’s all the more appreciated!