Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyStart implementation Linux vs. picoquic #1694

Open
joergdeutschmann-i7 opened this issue May 17, 2024 · 7 comments
Open

HyStart implementation Linux vs. picoquic #1694

joergdeutschmann-i7 opened this issue May 17, 2024 · 7 comments

Comments

@joergdeutschmann-i7
Copy link

We just compared the Linux and picoquic HyStart implementation.
https://github.com/torvalds/linux/blob/ea5f6ad9ad9645733b72ab53a98e719b460d36a6/net/ipv4/tcp_cubic.c#L427-L445

int picoquic_hystart_test(picoquic_min_max_rtt_t* rtt_track, uint64_t rtt_measurement, uint64_t packet_time, uint64_t current_time, int is_one_way_delay_enabled)

We wondered to which extend the picoquic implementation is aligned with the Linux code.

Two specific questions:

  1. rtt_track->rtt_filtered_min > rtt_track->sample_max) {
    rtt_track->rtt_filtered_min = rtt_track->sample_max;

    Why is rtt_track->rtt_filtered_min compared/set to rtt_track->sample_max and not rtt_track->sample_min?

  2. In Linux, the delay threshold is ca->delay_min >> 3 while in picoquic there is delta_max = rtt_track->rtt_filtered_min / 4. We wondered why these dividers are different.

PS: Sorry for posting three issues in a row :-)

@huitema
Copy link
Collaborator

huitema commented May 17, 2024

No need to be sorry for posting issues!

About the 'filtered min'. The idea there is to remove the delay jitter common in wireless links, see https://www.privateoctopus.com/2019/11/11/implementing-cubic-congestion-control-in-quic/. So instead of comparing max to the "recent max", it is compared to the "recent min", trying to minimize the effect of jitter.

The threshold is a bit arbitrary. Neither 1/4th or 1/8th is based on science. Too low and you get false positives and exit too early, too high and you wait too long and cause queue overflow. I guess what is really needed is to implement Hystart++.

@huitema
Copy link
Collaborator

huitema commented May 17, 2024

I have been thinking about this "hystart++" problem for a while. Of course, the first step would be to implement RFC 9406. But if the goal is "proceed safely and find the limit", there may be a bit more research to do. My main concern is that when the node probes for 25% larger window, it will only see the results after 2 RTT, by which time another 25% increase will have been started. During that second RTT, we might get a window 50% larger than ideal. That will probably not cause packet losses if the buffers are ready to absorb at least 1 BDP, but it will be creating queues. Maybe we could lift some ideas from BBR and prevent that.

@hfstco
Copy link
Collaborator

hfstco commented Jun 6, 2024

I have been thinking about this "hystart++" problem for a while. Of course, the first step would be to implement RFC 9406.

I already have implemented RFC 9406 in one of my branches. (https://github.com/hfstco/picoquic/tree/hystart%2B%2B) I want to compare HyStart and HyStart++ for high BDP connections.
But I'm a bit busy at the moment. So there is no time to finish, refactor and test it right now. (Currently only works for cubic and newreno.) I might be able to finish it in the next few weeks.

@huitema
Copy link
Collaborator

huitema commented Aug 5, 2024

@hfstco Any progress? Do you need help?

@hfstco
Copy link
Collaborator

hfstco commented Aug 6, 2024

"Short" update from my side:

RFC 9406 is currently implemented and works in CUBIC and NEWRENO.
https://github.com/hfstco/picoquic/tree/hystart%2B%2B

But I'm not entirely satisfied with the results.

Short explanation of my code:

cc_common.c:
Increase cwnd function of Hystart++. Return growth of cwnd depending on SS/CSS.
https://github.com/hfstco/picoquic/blob/96b9691215baf87c5156a460ec9231583f2ab1d3/picoquic/cc_common.c#L227C10-L227C53

Check whether a transition from CSS to SS and vice versa is necessary.
https://github.com/hfstco/picoquic/blob/96b9691215baf87c5156a460ec9231583f2ab1d3/picoquic/cc_common.c#L257

cubic.c/newreno.c:
I have implemented the whole Hystart++ code in the slow start ACK notification of the cc notification API, because you mentioned in your BBR code, that the RTT measurement notification can be removed once other CC algorithms are updated.

Following the procedure for every ACK:

  1. increase cwnd and keep track of min RTT
  2. transition from SS to CSS and vice versa if necessary.
  3. check whether the end of the round has been reached and start a new round.
    https://github.com/hfstco/picoquic/blob/96b9691215baf87c5156a460ec9231583f2ab1d3/picoquic/cubic.c#L250

My observations:

These are all just individual tests and my interpretation of them. Unfortunately, I don't currently have the time to run through and evaluate several iterations.

  1. It takes 5 RTTs to go through all 5 phases of the CSS. In all my runs with different latencies HyStart++ is interrupted by a loss/congestion event. (picoquic simulator, HyStart++, 100 MB transfer size, 50/5 Mbit/s, 300 ms latency, 600 ms buffer, 3.75 MB BDP)
    Screenshot 2024-08-06 at 19 59 46
    If the buffer can absorb this, it becomes filled up over time and this naturally also affects the RTTs/CSS phases. (NOTE: Plateu in CA looks strange to me.) (picoquic simulator, HyStart++, 100 MB transfer size, 50/5 Mbit/s, 300 ms latency, 6000 ms buffer, 3.75 MB BDP)
    Screenshot 2024-08-06 at 17 44 32
    With HyStart, the BDP is found much faster. (picoquic simulator, HyStart, 100 MB transfer size, 50/5 Mbit/s, 300 ms latency, 6000 ms buffer, 3.75 MB BDP)
    Screenshot 2024-08-06 at 19 59 24

  2. Under real-world conditions HyStart++ performs relatively equal to HyStart.
    (Starlink, HyStart++, 100 MB transfer size, xx/xx Mbit/s, ~40 ms latency, xx Buffer, xx BDP)
    Screenshot 2024-08-06 at 20 27 51
    (Starlink, HyStart, 100 MB transfer size, xx/xx Mbit/s, ~40 ms latency, xx Buffer, xx BDP)
    Screenshot 2024-08-06 at 20 28 00

  3. https://blog.cloudflare.com/cubic-and-hystart-support-in-quiche notes that we expect fewer losses. The RFC also says:

HyStart++ reduces packet loss and retransmissions, and improves goodput in lab measurements and real-world deployments.

But I cannot confirm this at the moment. Several iterations may give more reliable results.

  1. netem results
    (netem, HyStart++, 100 MB transfer size, 50/5 Mbit/s, 300 ms latency, 600 ms buffer, 3.75 MB BDP)
    Screenshot 2024-08-06 at 21 09 33
    (netem, HyStart, 100 MB transfer size, 50/5 Mbit/s, 300 ms latency, 600 ms buffer, 3.75 MB BDP)
    Screenshot 2024-08-06 at 21 09 45

To discuss:

  1. Does it make sense to include a "packet_sent" notification in the API? In my opinion, this is not necessary and creates additional processing. Just want to know your opinion about this. :) In several other implementations there is a callback/notification about a sent packet.

  2. The ACK ration of picoquic is greater than or equal to 2. In HyStart++ we wait for at least 8 samples for each round before switching from SS to CSS and back. Is an ACK range of 2 packets 2 samples? In msquic they also increase the count only once, but I don't know if the function is called for every ACK range or every ACK in an ACK range. https://github.com/microsoft/msquic/blob/679a96544d26a2e7401d87bbf1bb75a87a48e712/src/core/cubic.c#L467

  3. What do we expect from HyStart++? Reaching In my opinion, it should avoid exiting too early to CA on a volatile delay (jitter). Is there a test case in picoquic_t to test this?
    But if it fills up the buffer on connections with a high buffer, I think it's a bad tradeoff. (e.g. geostationary satellite connections)

  4. Test cases in general. Which test cases are the most suitable?

Ok, enough for today. Maybe it is a good place to start discussing.
Any feedback and help are highly appreciated!

@huitema
Copy link
Collaborator

huitema commented Aug 6, 2024

First, that's great work. Yes, the performance comparison between Hystart and Hystart++ is inconclusive, but that in itself is an interesting result. You should find a way to get this published -- at a minimum an internal report, but maybe in a place that takes practical papers. CCA? Or one of the SIGCOM conferences?

My goal with "a better slow start" is to find a way to reach adequate performance quickly, while also avoiding building large queues and causing packet losses. The basic Hystart does a pretty good job here, once we filter the RTT samples to take out the jitter, but all exponential schemes suffer from the lag effect:

image

Suppose that at time X we have a window that's "just right", but we do not know that yet. The delay measurements up to X + 1 RTT will reflect the old value of the congestion window (or pacing rate), which what less than the "just right" value. No queue is being built yet. The measurements between X+1 and X+2 will reflect the increased pressure on the bottleneck: delay increases, maybe we get L4S style ECN/CE marks. The sooner the sender notices that, the better, but in any case, the delay increases will be observed until X+2, which probably arrives 3 min-RTT after X because of the increased queues. At that point, the delay is 2 min RTT, and the CE/ECN marks may be coming in at a fast pace. If the bottleneck buffer is not larger than the BDP, packets will be lost.

What we want is, start reducing the window as soon as we detect something wrong. Of course, there is a tradeoff. Stopping soon means reacting on weaker signals than waiting for the delay to climb, so there is always a risk of stopping too soon. That's why I like the idea behind Hystart++: stop increasing quickly, but then do a safe growth to perform needed corrections. What you found out is that the spec in RFC 9406 does not really deliver that.

Reading that RFC, it is unnecessarily complicated. It uses time constants, divisors, etc. I think a simpler spec would be:

  • do slow start per spec (CWND += 1 for each packet acknowledged) until the delay increase is noticed. The delay threshold there probably should be a function of the min RTT (what i had in my implementation).
  • on delay increase, enter a "stabilization" phase. The delay increase means that there are too many packets in the queues. Reduce the send window or the send rate until the number in transit is low enough, then enter the controlled growth phase.
  • in controlled growth, just do a weaker version of slow start, with CWND += 0.25 for each packet acked. The smaller exponent guarantees that we can stop the growth when the queue is just 25% of the BDP, which is cool.
  • if control growth grew the window by more than 100%, we may reenter slow start. Maybe.

There are a couple of signals that I would like to throw in. One is bandwidth measurement (same idea as BBR). In slow start, this is particularly useful to cope with limited senders. The bandwidth measurement let us compute a "min BDP" as: minRTT times max bandwidth measurement. In Slow Start, we should cap the max CWND at 2 times minBDP. In controlled growth, cap at 1.25 times minBDP. BBR adds to that "exit if the bandwidth measurement does not increase for 3 RTT", but that has interesting interactions with limited senders -- in practice, measuring delays provides an equivalent signal, without the limited sender noise.

Everything gets easier if we have an idea of the underlying rate or underlying BDP. We can in particular exit slow start if the CWND is larger than 80 or 90% of the target bandwidth, and let controlled growth finish that.

@huitema
Copy link
Collaborator

huitema commented Aug 6, 2024

On your specific questions:

  1. Does it make sense to include a "packet_sent" notification in the API? I also don't think that's necessary. It is also complicated, because we use UDP GSO, and send several packets at a time. The CCA code can access state variables of the connection, so tehere is little need for keeping track of packets sent.

  2. The ACK ration of picoquic is greater than or equal to 2. In HyStart++ we wait for at least 8 samples for each round before switching from SS to CSS and back. That's largely a performance issue, fewer acks make for fewer interrupts, etc. There is code that makes the interval saller during slow start, but then the target becomes 8 ACKs per RTT. But we may need to study whether there are constants in the Hystart++ spec with a baked assumption of one ACK every two packets. The RFC explains that they run without pacing, which means they rely on ACK clocking, and that may explain why they need lots of ACKs.

  3. What do we expect from HyStart++? See previous post.

  4. Test cases in general. We have batteries of tests for Reno and Cubic. They will test Hystart++ if we switch it on. Or, we may want to just replicate some of the Cubic tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants