License: arXiv.org perpetual non-exclusive license
arXiv:2310.18030v2 [cs.NI] 07 Feb 2024
\setitemize

leftmargin=4mm \setenumerateleftmargin=4mm

Confucius: Achieving Consistent Low Latency with
Practical Queue Management for Real-Time Communications

Zili Meng Hong Kong University of Science and Technology zilim@ust.hk Nirav Atre Carnegie Mellon University natre@cs.cmu.edu Mingwei Xu Tsinghua University xumw@tsinghua.edu.cn Justine Sherry Carnegie Mellon University sherry@cs.cmu.edu  and  Maria Apostolaki Princeton University apostolaki@princeton.edu
Abstract.

Real-time communication applications require consistently low latency, which is often disrupted by latency spikes caused by competing flows, especially Web traffic. We identify the root cause of disruptions in such cases as the mismatch between the abrupt bandwidth allocation adjustment of queue scheduling and gradual congestion window adjustment of congestion control. For example, when a sudden burst of new Web flows arrives, queue schedulers abruptly shift bandwidth away from the existing real-time flow(s). The real-time flow will need several RTTs to converge to the new available bandwidth, during which severe stalls occur. In this paper, we present Confucius, a practical queue management scheme designed for offering real-time traffic with consistently low latency regardless of competing flows. Confucius slows down bandwidth adjustment to match the reaction of congestion control, such that the end host can reduce the sending rate without incurring latency spikes. Importantly, Confucius does not require the collaboration of end-hosts (e.g., labels on packets), nor manual parameter tuning to achieve good performance. Extensive experiments show that Confucius outperforms existing practical queueing schemes by reducing the stall duration by more than 50%, while the competing flows also fairly enjoy on-par performance.

1. Introduction

Real-time (RT) video communications, including a range of applications from video conferencing to cloud gaming and VR/AR streaming, are becoming the dominant traffic on the Internet. These applications require low and consistent latency to maximize the user experience (sigcomm2022zhuge, ).

Significant research has been dedicated to ensuring a satisfactory user experience through minimizing and stabilizing the end-to-end latency. Indeed, congestion control algorithms (CCAs) reduce the queueing delay (ton2017webrtc, ; ray2022sqp, ; nsdi2018copa, ); forward error correction (FEC) improves the loss recovery (nsdi2023tambur, ; nsdi2024hairpin, ); multiple path transport mitigates fluctuation in wireless settings (mm23twinstar, ; sigcomm2023cellfusion, ; sigcomm2023converge, ); while co-design with the video codec (nsdi2023afr, ; nsdi2018salsify, ) and wireless routers (sigcomm2022zhuge, ; mmsys2015macadapt, ) controls the delay in these components. Unfortunately, these works mainly focus on how to mitigate the effect of network fluctuations after the fact, instead of addressing their root cause. As a result, latency fluctuations still routinely occur, causing stalls and deterioration of the performance of the real-time flow (imc2021can, ; imc2022enabling, ).

Refer to caption
Figure 1. The scenario where the real-time flow is affected by competing flows. When Web flows join the competition with the real-time flow, the available bandwidth of the real-time flow will be immediately reduced. Note that even loading one Web page can have tens of concurrent active flows.

In this paper, we show that unpredictable flow competition in the network layer can cause drastic network fluctuation, which drastically affects real-time flows (§ 2). For instance, loading a single Web page creates nine concurrent Internet connections (on average), drastically reducing the available bandwidth for the competing real-time flows and causing stalls in multiple practical settings such as home routers, as shown in Fig. 1. Congestion control alleviates the issue by reducing the real-time flow’s congestion window or sending rate after end hosts observe latency increases or packet loss, but it is already far too late. Indeed, it will take several RTTs for congestion control to react and converge to the new available bandwidth, while the packets sent in excess of the allocated bandwidth during the convergence period will lead to an increase in the end-to-end delay. These endpoint-based optimizations, in general, cannot fundamentally prevent such performance degradation from happening since the onset of competing traffic is unpredictable.

A natural solution to flow competition is to manage the router queue and prevent the available bandwidth of the real-time flow from reducing. There have been works trying to achieve this for several decades. In differentiated services (DiffServ) (rfc4594diffserv, ) (including L4S (rfc9330l4s, )), the router recognizes the pre-defined labels (priorities) from packet headers and schedules packets based on these labels. However, such a design is not incentive-compatible in practice: applications have the incentive to mark their packets with higher priority, which eventually leads to the Tragedy of Commons – routers will not respect the labels, and endpoints cannot count on using them. Another category of solutions on the router is active queue management (AQM), which tries to notify the sender in advance before the queue builds up (cacm2012codel, ; sigcomm1997red, ; sigcomm2022zhuge, ). We demonstrate in § 2 that these are still reactive mechanisms and cannot prevent stalls from happening.

We argue that the root cause for stalls in the real-time flow is the mismatch in reaction time between the bandwidth allocation mechanisms on routers and rate adaptation mechanisms on endpoints. In Fig. 1, when nine new flows triggered by a single website (in yellow) suddenly compete with an existing real-time flow (in blue), the available bandwidth of the real-time flow is immediately reduced to 1/10 of what it was. However, the sender’s congestion window needs several round-trip times (RTTs) to gradually adjust to match the new available bandwidth. During this adjustment period, packets that are sent in excess of the allocated bandwidth will induce congestion, resulting in bufferbloat and stalls. While the Web flows will complete within one or two seconds and relinquish their bandwidth share, the real-time flow will have already experienced significant degradation. We find that existing queue scheduling and management algorithms ignore the transient temporal behaviors during the network change, leading to the stalls. This highlights a critical need for a queue management scheme that takes into account the convergence time of the congestion control to prevent such stalls.

To this end, we designed Confucius,111 Confucius’ (the philosopher) educational philosophy is teaching students by their essences. In this paper, we serve the flows by their essences. a practical queue management scheme that aims at providing consistent low latency for real-time flows independently of competing flows at the bottleneck. Instead of abruptly changing bandwidth allocations when a burst of new flows arrive, Confucius gradually adjusts the service rates to provide existing flows a few RTTs to detect the change in network conditions and adjust their congestion windows. In this case, the excessively sent packets will be reduced and the latency for the real-time flow can be maintained.

We design Confucius to fulfill three fundamental requirements related to consistency, fairness and incentive-compat-ibility (§ 3.1): First, Confucius needs to provide latency consistency to real-time flows independently of the number, rate, or congestion control of the competing flows. Confucius achieves this by offering a theoretical upper bound for latency fluctuation experienced by real-time flows, which we also validate through experiments Second, Confucius should eventually be fair. For instance, in Fig. 1, the performance of Web flows should not be sacrificed. To achieve this, Confucius smoothly moves service rates towards the fair allocation within a few RTTs. Finally, Confucius’s classification of real-time flow should be practical and on-router, without relying on end hosts for traffic classification. Unfortunately, the alternative – flow classification algorithms – are usually expensive and sensitive to protocols (hotnets2019inc, ). Confucius calssifies flows by how aggressively they occupy the buffer at the bottleneck router, a metric that directly reflects how important low latency is to a flow.

We implement Confucius with both NS-3 simulator and kernel modules on Linux-based routers. Note that Confucius is designed for last-mile routers (e.g., home routers in Fig. 1) where the competition can lead to congestion, and the computation is more flexible since home routers are mainly Linux-based222 A measurement shows that 91% of home routers are Linux-based (HomeRouterSecurity, ). . With real-world bandwidth and Web page traces, we show that compared to FqCoDel (Linux’s default) (rfc8290fqcodel, ), Confucius reduces the stall duration of the real-time flow by 60%-69% and the loading time of Web pages by 39%-48% for top 1000 websites at the same time. Compared to other baselines that do not require labels from end hosts, Confucius can still effectively reduce the stall duration of the real-time flows by at least 21% (§ 7.2) with negligible computation overhead (§ 7.5). In the meantime, long-lived, bulk transfers experience no degradation at all relative to fair queueing, and the impact on the flow completion time over short Web flows is limited to at most 10% even compared with the shortest job first (SJF, which strictly prioritizes the Web flows). We will release all traces and codes of this paper.

2. Motivation

We start by describing recent trends that call for consistent low latency (§ 2.1). Next, we explain via an intuitive example why existing solutions fail to achieve consistent low latency under flow competition (§ 2.2, 2.3 and 2.4).

2.1. The rise of real-time traffic

While the Internet has always been shared among multiple applications, the proliferation of real-time communication applications (e.g., videoconferencing, cloud gaming, virtual reality) has made sharing of bottleneck links particularly challenging. Real-time applications require not just low latency but consistently low latency while sending at moderate to high throughputs (ranging from tens to hundreds of Mbps) (sigcomm2022zhuge, ; sigcomm2020tack, ; imc2022measurement, ). For real-time applications, latency consistency is extremely critical to user experiences. For example, a transient increase in latency to 200 ms might cause cloud gaming users to lose (qomex2015cgsteam, ). Therefore, controlling the latency fluctuation and achieving a consistent low latency for real-time applications is essential.

Setting & Scope: This work focuses on end-user access points (e.g., wireless or wired home routers), where it is well-known that congestion and latency fluctuation are frequent (sigcomm2022zhuge, ; ccr2017lastmile, ; imc2020persistent, ). Despite recent advances in wireless technologies such as 5G and WiFi 6, the last-mile access routers are still likely to be the cause of jitter, irrespective of whether the last-mile is wired (nsdi2023crab, ) or wireless (sigcomm2020measure5g, ; imc2017fastack, ; sigcomm2022zhuge, ). As most such routers are Linux-based (linux-router, ; HomeRouterSecurity, ), they allow for flexible traffic management on software which is a great opportunity for innovation. Our experiments and data involve applications used in those settings. We also include a benchmark in a Linux-based router § 7.5). Congestion in other settings (e.g., losses in the Internet core (sigcomm2018inferring, ) or datacenters (nsdi2015pias, )) are out of scope for this work.

2.2. Motivating example

To better illustrate the problem and the limitations of existing approaches, we revisit the example of Fig. 1 in detail. Consider a user who is on a video call, and their housemate (with whom they share the home router) decides to load a Web page. Technically, one existing real-time flow on the bottleneck will compete with the new flows from one Web page. We simulate the real-time flow’s delay of each video frame using NS-3 and present the results in Fig. 2 (details in § 6). Before considering other queue management mechanisms, let us focus on the performance of FIFO (square markers).

Refer to caption
Figure 2. An existing real-time flow competes with flows of loading the homepage ofamazon.com, as shown in Fig. 1. The real-time flow, using GCC (ton2017webrtc, ), always experiences transient stalls during the competition unless flows are pre-labeled by the end host and differentiated by the router.

Before t=0s, the sending rate of the real-time flow has converged, with the video frame delay fluctuating around 60ms, which is much lower than the stall threshold (190ms333 This is the recommended network delay for video chats by ITU (itu-ddl, ). ) required by the application. However, when flows from loading the homepage ofamazon.com join the competition on the bottleneck, the end-to-end delay for the real-time flow sharply increases. Using FIFO, the delay goes up to more than 400 ms, and stays above the threshold for almost one second, during which a stall occurs and the user experience is impaired. When using FqCoDel, the delay of the real-time flow is even worse since fair queueing shifts more bandwidth away and CoDel drops more packets. We find that the delay always spikes regardless of the underlying CCA (§ 7.2).

2.3. Root cause analysis

We argue that the delay spike is caused by (i) the burst of flows and packets from the competing Web page; (ii) the abrupt reallocation of the available bandwidth by queue management; and (iii) the gradual reaction from the congestion control. Next, we will elaborate on how these common factors result in performance degradation, and explain the limitations of existing works.

Refer to caption
(a) Timeline foramazon.com.
Refer to caption
(b) Distributions.
Figure 3. Number of concurrent flows recorded by NetLog (netlog, ). OPEN and IN_USE are socket states marked by Chrome, and ACTIVE means that the flow is receiving bytes in the last 10 ms.

The source of the burst: One Web page triggers multiple, concurrently-active flows. To understand the burst in Fig. 2, we measure the flows triggered byamazon.com over time. Concretely, we measure the number of sockets that are OPEN and IN_USE marked by NetLog (netlog, ) from Chrome. We also measure the number of active flows that receive bytes every 10 ms through packet captures (ACTIVE). As shown in Fig. 3(a), loading only the homepage generates up to 68 flows in total, where up to 12 flows run simultaneously. This is due to the Web design of hosting different objects (e.g., images, videos, ads, scripts) in various domains. Note that this is not due to the parallel connections in HTTP/1.1 – we later present in Appx. A showing that more than half of the flows go to different unique IPs.

This triggering of multiple flows to load one page is shared across different websites. We measured the homepage of Top 1000 websites in November 2023 from the saved Alexa list and presented the distribution in Fig. 3(b). We find that the median number of concurrent ACTIVE flows is 8 while the 90th percentile is 19. The highest one in the Top 200,dailymail.co.uk, has up to 50 active flows and 250 open sockets at the same time. We present the structure and list some famous websites in Appx. A. Moreover, for some websites (e.g., Wikipedia and Google), loading other pages triggers more flows compared to the almost blank home page, which will further exacerbate the degradation experienced by the real-time flow.

Refer to caption
(a) FIFO.
Refer to caption
(b) FQ.
Refer to caption
(c) Weighted class (1:1).
Refer to caption
(d) Confucius.
Figure 4. Illustration of how bandwidth shares change over time with incoming Web flows and the existing real-time (RT) flow for different schedulers. The dashed red line marks the fair share.

The cause of the delay spike: Queue schedulers sharply reallocate service rates. Queue management typically reacts to the instant conditions of all flows in the queue. Revisiting our example, when the page loading starts, tens of packets of Web flows immediately arrive at the bottleneck, creating a queue. At the same time, the real-time flow only has a few packets in the queue since it always tries to keep the queue near-empty (nsdi2018copa, ; ton2017webrtc, ). We illustrate the bandwidth share of different queue management schemes in Fig. 4. For FIFO (Fig. 4(a)), the service rates for different flows are proportional to the number of bytes per flow in the queue, thus, the available bandwidth for the real-time flow will be drastically reduced. Fair queueing (FQ, Fig. 4(b)) makes matters worse and allocates even less bandwidth to the real-time flow, since those short Web flows are many more than the real-time flow. Concretely, in theamazon.com example, 12 new flows joining the fair queueing router will directly reduce the available bandwidth of the real-time flow to 1/13.

Refer to caption
Figure 5. When new competing flows join, the service rate of the real-time flow will be immediately reduced, but the CCA takes multiple RTTs to converge.

Such a sharp decrease in the available bandwidth causes a delay spike to the real-time flow. This is because the CCA needs to gradually probe and match its sending rate to the new available bandwidth, which takes several RTTs (dashed green line in Fig. 5). While the number of in-flight packets is converging to the new bandwidth-delay product, the excessive in-flight packets will cause bufferbloat and result in high end-to-end latency for the real-time flow.

Active queue management (AQM) algorithms, which notify the sender about the network conditions by dropping or marking ECN on packets, cannot prevent stalls either. This is mainly because flows driven by different congestion control algorithms (CCAs) have different perceptions of congestion (e.g., delay, loss, rate). Therefore, as shown in Fig. 2, CoDel (cacm2012codel, ) leads to a latency spike even higher than that of FIFO. We observe similar limitations in other AQMs (§ 7.2).

The fact hard to change: Congestion control takes a longer time to converge. As we discussed, the issue is when the competing flows join, the available bandwidth for the real-time flow drops immediately, but the end-to-end CCA cannot immediately reduce the inflight packets to fit the new available bandwidth. End-to-end CCAs do not know how much to reduce and have to reduce step-by-step.

Some proposals are designed to help the CCA to quickly converge to the new available bandwidth, such as XCP (sigcomm2002xcp, ), RCP (infocom2008rcp, ), Kickass (icnp2016kickass, ), and ABC (nsdi2020abc, ). However, none of these proposals work unless both end hosts and routers collaboratively deploy these protocols and offer no improvement otherwise. This poses significant barriers to deployment on the Internet (sigcomm2022zhuge, ). Moreover, during the convergence of the CCA, the excessive in-flight packets also inflate the RTT. For most CCAs using RTT to update (e.g., adjust the sending rate every RTT), the update period will, in turn, inflate after the first several packets. In the example in Fig. 2, before the Web flows join, the RTT for the real-time flow is around 40 ms. However, during the competition, the RTT inflates to hundreds of milliseconds. Putting all factors together, we can see that for all baselines that do not require labels, the delay spike of the real-time flow goes up to at least 400 ms.

2.4. Limitations of related works

One line of solution is DiffServ (rfc4594diffserv, ), which labels the flows of interest in advance and schedules them differently on the router using StrictPriority or weighted class as shown in Fig. 2. This also includes the recent proposal L4S (rfc9330l4s, ) which we will later evaluate in § 7.2. While this is deployable in datacenters (sigcomm2016karuna, ), it is not practical on the Internet. End hosts have the incentive to fake their labels if that could help their flows have better performance. It is also challenging to coordinate the end host and router on the Internet in the real world since they usually belong to different entities. Even with perfect labels, achieving optimal performance requires optimal allocation of bandwidth across the different classes of traffic. To understand why this is challenging, consider some canonical solutions. StrictPriority, albeit guaranteeing the latency for the real-time flow, will drastically harm the performance of competing Web flows (§ 7.2). Allocating bandwidth for different classes using pre-defined weights needs accurate estimation of the bandwidth demands from both classes, where inaccurate estimation easily leads to unfairness or latency spikes. For example, if we set the ratio between the real-time flows and Web flows to 1:1, the Web flows will suffer from degraded PLT since they cannot obtain their fair shares (Fig. 4(c)), while 1:5 will lead to the latency spike to the real-time flow as well (Fig. 2).

There are further mechanisms as below, which, unfortunately, still reactively respond to network changes. Zhuge (sigcomm2022zhuge, ) reduces the feedback loop between the router and the endpoint from one RTT to sub-RTT levels, but CCA convergence still requires multiple RTTs (§ 2.3). Using the example in Fig. 5, Zhuge tightens the turning point of the green dashed line, but the dominant contributor to delay – the time it takes for the green dashed line to converge to the blue line – persists. FEC is designed for loss recovery (nsdi2023tambur, ; nsdi2024hairpin, ) and is hardly helpful in our example since most of them have no loss at all. Multipath transport will switch to the new path (sigcomm2023converge, ; sigcomm2023cellfusion, ; nsdi2024augur, ), but this also occurs after the sender observes drastic degradation in the current path. Real-time flows still have to suffer from stalls during the adaptation period. Bandwidth estimations from the wireless link layer and below (sigcomm2022zhuge, ; sigcomm2020pbecc, ) are not effective either since the link capacity does not change in the competition.

3. Confucius Design

Our previous observations motivate Confucius, a practical queue management scheme for achieving consistent and low latency for real-time flows that is designed to work on home routers. We describe Confucius’s design requirements in § 3.1 before we give an overview of Confucius on § 3.2.

3.1. Design Requirements

R1: The performance of the real-time flow should be robust to any competing flows. Confucius stands out among queue management algorithms in that it theoretically guarantees worst-case performance, no matter what congestion control algorithms and competing flows are. This will, in consequence, fundamentally address the root cause of latency fluctuation induced by unpredictable competing traffic. It is easy to vaguely describe Confucius as ‘controlling latency fluctuations’ but it is harder to formulate this into a rigorous service model. We theoretically calculate performance bounds for a few classes of applications that might use Confucius. We demonstrate that with Confucius, real-time flows have a near-constant bound of latency degradation (around 250 ms in § 7.3), no matter how large and how many competing flows join the bottleneck.

R2: Latency consistency should not come at the cost of long-term fairness. Confucius should still follow per-flow fairness in the long run. To do so, Confucius moves rates towards a fair allocation quickly and pushes the blue solid line in Fig. 5 to match the green dashed line. In this case, the latency spike will be controlled and the bandwidth for the competing flows will be largely protected as well. Technically, Confucius adjusts the service rate of flows using exponentially weighted moving average (EWMA) (lucas1990ewma, ), as shown in Fig. 4(d). This allows the CCA to gradually react following the bandwidth share of Confucius, and also converges to the fair share in several RTTs. Note that the RTT is not inflated due to the excessive packets. Our experiments (§ 7.3) and theoretical analysis (§ 4.3) show that such a design can effectively achieve fairness and latency consistency.

R3: The identification of real-time flows should not rely on end hosts. A naive solution is to split the flows by their age. However, this is not practical since flows driven by different CCAs or having distinct objectives should not share the same queue either. Meanwhile, using FQ to split old flows cannot provide low latency to the bursty flows (macgregor2000deficits, ), which is usually the case for real-time video streaming. Thus, we still need to identify different types of flows. To make Confucius incentive-compatible and deployable in practice, we aim to identify the flows of interest at the router itself, without relying on end hosts. The performance improvements should be directly observed by the router vendor without going through endless coordination between end-host content providers and router vendors in IETF. In § 5, we illustrate how Confucius identifies flows based on their queue occupancy: built on the CCA evolution, real-time flows naturally occupy a small fraction of the buffer (e.g., GCC (ton2017webrtc, ; nsdi2018copa, )), while throughput-oriented flows are observed to be buffer-filling (e.g., Cubic). Confucius uses the queue occupancy to differentiate the flows in the queue.

3.2. Design Overview

Refer to caption
Figure 6. Design overview of Confucius. wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the weight for queue i𝑖iitalic_i in the scheduling with DWRR.

At a high level, Confucius classifies flows to queues and strategically assigns a portion of the link capacity to each of them, as illustrated in Fig. 6.

To address the goal R1 and R2, Confucius leverages a simple yet powerful insight from § 3.1: Upon the arrival of competitors, the reduction of the available bandwidth of existing flows is inevitable if we want to preserve long-term throughput fairness. Yet, we can gradually and cautiously control the reduction of the available bandwidth during the transient period. Thus, we can eliminate the mismatch between the sending rate of the CCA and the service rate at the bottleneck link for existing real-time flows, thereby taming the latency fluctuation. We will extend our insight of using the EWMA reweight mechanism in § 4.

For R3, by grouping flows with similar queue occupancy into the same queue, flows with different queue occupancies will not affect each other. Meanwhile, with a fixed number of queues to schedule between (instead of per-flow queues such as FQ), latency-sensitive flows will have a consistent latency. Thus, Confucius uses a set of queues (Q1,Q2,Q3subscript𝑄1subscript𝑄2subscript𝑄3Q_{1},Q_{2},Q_{3}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), each designed to accommodate old flows with different buffer occupancies, and a separate queue (QNEWsubscript𝑄𝑁𝐸𝑊Q_{NEW}italic_Q start_POSTSUBSCRIPT italic_N italic_E italic_W end_POSTSUBSCRIPT) dedicated to new flows. It then adopts a Deficit-Weighted Round-Robin (DWRR) algorithm to schedule between these queues. Finally, Confucius periodically measures flow characteristics and reclassifies flows using a hysteresis-based mechanism to further increase robustness in practice (§ 5).

4. Age-aware Flow Weights Adjustment

In this section, we explain the benefits of exponential bandwidth reallocation (§ 4.1) and dive into Confucius’ weight adjustment (§ 4.2). We then analytically show that it guarantees bounded performance degradation, both for existing real-time flows and newly-arrived competing flows (§ 4.3).

4.1. Exponential bandwidth re-allocation

We first quantitatively demonstrate the advantage of gradually controlling the real-time flow’s bandwidth allocation compared to directly cutting its available bandwidth to its fair share. We measured the stall duration y𝑦yitalic_y for the real-time flow in the scenario of a sudden reduction of available bandwidth for four low-latency CCAs (§ 4). Concretely, y𝑦yitalic_y denotes the stall duration defined by more than 190 ms of end-to-end delay. We plot y𝑦yitalic_y as a function of the Available Bandwidth Reduction Factor (ABRF, the factor we will reduce the available bandwidth) for different CCAs in Fig. 7(a). We find that CCAs respond poorly to sudden, large reductions in bandwidth. For instance, reducing GCC’s available bandwidth to 1/16 of its initial value (i.e., ABRF=16𝐴𝐵𝑅𝐹16ABRF=16italic_A italic_B italic_R italic_F = 16) results in a y>10𝑦10y>10italic_y > 10 seconds stall. The relationship between the stall duration and ABRF (y=fCCA(ABRF)𝑦subscript𝑓𝐶𝐶𝐴𝐴𝐵𝑅𝐹y=f_{CCA}(ABRF)italic_y = italic_f start_POSTSUBSCRIPT italic_C italic_C italic_A end_POSTSUBSCRIPT ( italic_A italic_B italic_R italic_F )) is super-linear.

Refer to caption
(a) Measurements
Refer to caption
(b) Illustration
Figure 7. (a) Stall duration increases with the available-bandwidth-reduction factor (ABRF). (b) An illustration of how gently reducing available bandwidth helps reduce delay duration. Note that (a) is a log-log plot but (b) is a log-lin plot.

To avoid such stalls, Confucius gradually reduces the available bandwidth for the real-time flow. For instance, to achieve a final ABRF of 16, we can reduce the available bandwidth four times, each by half. Fig. 7(b) demonstrates, in the ideal case, the value proposition of this approach. Compared to the super-linear stall duration (solid line copied from Fig. 7(a)), exponentially reducing the sending rate will only increase the stall duration logarithmically with the ABRF (modulated by fCCA(2)subscript𝑓𝐶𝐶𝐴2f_{CCA}(2)italic_f start_POSTSUBSCRIPT italic_C italic_C italic_A end_POSTSUBSCRIPT ( 2 ), a small constant).

Such a smooth reallocation of available bandwidth allows the CCA to learn the reduced bandwidth allocation, and is also robust to the number or size of competing flows. No matter how many flows compete with the real-time flow, the curve of the available bandwidth of the real-time flow is fixed so the delay will remain the same. Meanwhile, adjusting the bandwidth share exponentially yields fast convergence to the fair share, satisfying requirements R1 and R2 together. We prove in § 4.3, that Confucius guarantees that the long-term fairness will not be impaired, and the degradation of the performance for new flows will always be within a constant, additive factor of the FCT under a strictly fair allocation.

4.2. Adjustment Mechanism

To assign service rates to queues, Confucius uses the following process. For each flow, f𝑓fitalic_f, Confucius computes a weight, wfsubscript𝑤𝑓w_{f}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, to represent its share of the bandwidth (service rate). Confucius groups new flows into a separate queue called Qnewsubscript𝑄𝑛𝑒𝑤Q_{new}italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT (depicted in Fig. 6). All existing flows which are mapped to other queues are assigned flow weights of wf=1subscript𝑤𝑓1w_{f}=1italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 1, and are collectively denoted as set extsubscript𝑒𝑥𝑡\mathcal{F}_{ext}caligraphic_F start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT. The flow weights of all flows in Qnewsubscript𝑄𝑛𝑒𝑤Q_{new}italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT are computed as follows:

(1) wf=min(|ext||Qnew|2λt, 1),fQnewformulae-sequencesubscript𝑤𝑓subscript𝑒𝑥𝑡subscript𝑄𝑛𝑒𝑤superscript2𝜆𝑡1𝑓subscript𝑄𝑛𝑒𝑤\small w_{f}=\min\left(\frac{|\mathcal{F}_{ext}|}{|Q_{new}|}\cdot 2^{\lambda t% },\ 1\right),\quad f\in Q_{new}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = roman_min ( divide start_ARG | caligraphic_F start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT | end_ARG start_ARG | italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT | end_ARG ⋅ 2 start_POSTSUPERSCRIPT italic_λ italic_t end_POSTSUPERSCRIPT , 1 ) , italic_f ∈ italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT

Then, for a given queue, Q𝑄Qitalic_Q, the weight is the sum of weights of all flows in Q𝑄Qitalic_Q. There are several considerations in Eq. 1:

Age-aware exponential adjustment (2λt)superscript2𝜆𝑡\left(2^{\lambda t}\right)( 2 start_POSTSUPERSCRIPT italic_λ italic_t end_POSTSUPERSCRIPT ). As described in § 3.1, Confucius exponentially increases the weights of new flows, where the bandwidth shares are illustrated in Fig. 4(d). Here, t𝑡titalic_t represents the age (in milliseconds) of the new flow, and λ𝜆\lambdaitalic_λ is a parameter that controls the speed for the rate adjustment of new flows – their flow weights double every 1λ1𝜆\frac{1}{\lambda}divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG milliseconds. A large λ𝜆\lambdaitalic_λ (e.g., λ𝜆\lambda\to\inftyitalic_λ → ∞) leads to abrupt reductions in available bandwidth and causes latency spike, while a small λ𝜆\lambdaitalic_λ (e.g., λ=0𝜆0\lambda=0italic_λ = 0) results in unfairness for new flows. Consequently, we configure λ𝜆\lambdaitalic_λ so that the available bandwidth for the real-time flow drops as fast as possible but not overtaking the responsiveness of the underlying CCAs.

Moreover, different CCAs have different response times to congestion. For example, Copa needs 5 RTTs to reduce its sending rate, while BBR’s response time is dictated by its probing interval of 6-8 RTTs. To deal with the heterogeneity of CCAs on the Internet (sigmetrics2020gordon, ), we set λ𝜆\lambdaitalic_λ as the inverse of the response time of the least responsive CCA among common latency-sensitive CCAs. This ensures that even the least responsive CCA can smoothly react to bandwidth changes. Recall that we measure how different CCAs respond to bandwidth reductions in Fig. 7(a), which shows BBR being the least responsive CCA: When the ABRF is 2, BBR suffers from the longest stall compared with other CCAs due to its a long probing period of 6-8 RTTs. Thus, given a typical RTT of 30-50 ms for Web services (www2021wisetrans, ), we set λ𝜆\lambdaitalic_λ=0.004 (ms11{}^{-1}start_FLOATSUPERSCRIPT - 1 end_FLOATSUPERSCRIPT) to have a doubling interval of 1λ1𝜆\frac{1}{\lambda}divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG=250 ms, matching BBR’s probing period. Experiments in § 7.2 demonstrate satisfactory results for not only BBR but also other CCAs.

Initial weight (|ext||Qnew|)subscript𝑒𝑥𝑡subscript𝑄𝑛𝑒𝑤\left(\frac{|\mathcal{F}_{ext}|}{|Q_{new}|}\right)( divide start_ARG | caligraphic_F start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT | end_ARG start_ARG | italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT | end_ARG ). To allocate sufficient share for new flows, we scale the initial weight of new flows with the number of existing flows. For each new flow, we set the initial weight to |ext||Qnew|subscript𝑒𝑥𝑡subscript𝑄𝑛𝑒𝑤\frac{|\mathcal{F}_{ext}|}{|Q_{new}|}divide start_ARG | caligraphic_F start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT | end_ARG start_ARG | italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT | end_ARG, where |ext|subscript𝑒𝑥𝑡|\mathcal{F}_{ext}|| caligraphic_F start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT | and |Qnew|subscript𝑄𝑛𝑒𝑤|Q_{new}|| italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT | are the numbers of existing and new flows, respectively. This can limit the bandwidth reduction for existing flows to be less aggressive than a factor-of-2 reduction. In this case, the stall duration can logarithmically scale from fCCA(2)subscript𝑓𝐶𝐶𝐴2f_{CCA}(2)italic_f start_POSTSUBSCRIPT italic_C italic_C italic_A end_POSTSUBSCRIPT ( 2 ), as shown in Fig. 7(b).

Upper bound (min(, 1))normal-…1\left(\min(...,\ 1)\right)( roman_min ( … , 1 ) ). Confucius uses a flow weight threshold of 1111 to ‘age out’ new flows from the Qnewsubscript𝑄𝑛𝑒𝑤Q_{new}italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT queue. Once the flow weight of a flow reaches 1, the flow is no longer considered new and is moved to one of the other queues based on the output of the Flow Classifier (§ 5).

4.3. Theoretical Analysis

We still follow the same example in § 2.2. Consider one real-time flow running by itself on a bottleneck link. At t=0𝑡0t=0italic_t = 0, N𝑁Nitalic_N new flows, each with size B𝐵Bitalic_B, join the same bottleneck link and compete with the existing flow. B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial congestion window for Web flows. We show that Confucius guarantees bounded stall for the existing real-time flow while yielding FCTs for Web flows within a constant additive factor of what FQ provides. For simplicity, we summarize the results in Tab. 1 and leave the analytical details to Appx. B.

For FQ and FIFO, we observe that the stall duration (qPmaxsubscriptsuperscript𝑞𝑚𝑎𝑥𝑃q^{max}_{P}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) scales linearly with the number of new flows, N𝑁Nitalic_N, and is therefore unbounded, where N𝑁Nitalic_N can go to more than 100 in some Web pages (Fig. 3). This is quite straightforward – when N𝑁Nitalic_N flows start to compete with the real-time flow, the available bandwidth of the real-time flow drops to 1/N1𝑁1/N1 / italic_N. Intuitively, as N𝑁Nitalic_N increases, the more the available bandwidth for the real-time flow drops, resulting in drastic delay fluctuation.

Policy P𝑃Pitalic_P qPmaxsubscriptsuperscript𝑞𝑚𝑎𝑥𝑃q^{max}_{P}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT TPTFQsubscript𝑇𝑃subscript𝑇𝐹𝑄T_{P}-T_{FQ}italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT
FQ 𝑵(232k+q0+τ)absent𝑵232𝑘subscript𝑞0𝜏\approx{\color[rgb]{1,0,0}\textit{{N}}}\left(\frac{2}{3}\sqrt{\frac{2}{k}}+q_{% 0}+\tau\right)≈ N ( divide start_ARG 2 end_ARG start_ARG 3 end_ARG square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_k end_ARG end_ARG + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ ) 0
FIFO (𝑵B0q0C+1)(232k+q0+τ)absent𝑵subscript𝐵0subscript𝑞0𝐶1232𝑘subscript𝑞0𝜏\approx\left(\frac{{\color[rgb]{1,0,0}\textit{{N}}}B_{0}}{q_{0}C}+1\right)% \left(\frac{2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tau\right)≈ ( divide start_ARG N italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_C end_ARG + 1 ) ( divide start_ARG 2 end_ARG start_ARG 3 end_ARG square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_k end_ARG end_ARG + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ ) 0absent0\lessapprox 0⪅ 0
CBQ 232k+q0+τabsent232𝑘subscript𝑞0𝜏\approx\frac{2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tau≈ divide start_ARG 2 end_ARG start_ARG 3 end_ARG square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_k end_ARG end_ARG + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ (𝑵1)𝑩Cabsent𝑵1𝑩𝐶\approx\frac{({\color[rgb]{1,0,0}\textit{{N}}}-1){\color[rgb]{1,0,0}\textit{{B% }}}}{C}≈ divide start_ARG ( N - 1 ) B end_ARG start_ARG italic_C end_ARG
Confucius 6q0+15τ+8λk+(10q0+15τ)λ2kabsent6subscript𝑞015𝜏8𝜆𝑘10subscript𝑞015𝜏superscript𝜆2𝑘\approx 6q_{0}+15\tau+\frac{8\lambda}{k}+\frac{(10q_{0}+15\tau)\lambda^{2}}{k}≈ 6 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 15 italic_τ + divide start_ARG 8 italic_λ end_ARG start_ARG italic_k end_ARG + divide start_ARG ( 10 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 15 italic_τ ) italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k end_ARG log2eλabsentsubscript2𝑒𝜆\approx\frac{\log_{2}e}{\lambda}≈ divide start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e end_ARG start_ARG italic_λ end_ARG
Table 1. Approximations for different schedulers P𝑃Pitalic_P on their maximum queueing delay (qPmaxsubscriptsuperscript𝑞𝑚𝑎𝑥𝑃q^{max}_{P}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) and FCT degradation against FQ (TPTFQsubscript𝑇𝑃subscript𝑇𝐹𝑄T_{P}-T_{FQ}italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT). Confucius has a bounded performance degradation for all flows. In the competition, existing schedulers have either unbounded delay, or unbounded FCT degradation. The unbounded terms with workload changes (N𝑁Nitalic_N and B𝐵Bitalic_B) are marked in red.

For class-based queues (CBQ, weighted class), pre-labeling the real-time flow enables the scheduler to allocate the real-time flow with a fixed bandwidth share, resulting in a constant stall. However, if the weights are not accurate (i.e., not matching the traffic ratio), CBQ converges unfairly, and the FCT degradation for new flows becomes unbounded (§ 2.4).

Finally, Confucius yields bounded performance degradation for both sets of flows. On one hand, Confucius ensures that the stall for real-time flows is constant only depending on the CCA’s latency sensitivity (denoted by q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), the responsiveness (k𝑘kitalic_k), the feedback loop (τ𝜏\tauitalic_τ), and Confucius’s decay parameter (λ𝜆\lambdaitalic_λ)444 When using Copa with an RTT of 40ms, q𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌maxsuperscriptsubscript𝑞𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌𝑚𝑎𝑥q_{\textsf{Confucius}}^{max}italic_q start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT is \approx640 ms. As we show experimentally in § 7.2, the actual delay using Confucius is much lower. . On the other hand, Confucius can also ensure the FCT degradation for new flows is bounded by an additive constant factor to the decay parameter (λ𝜆\lambdaitalic_λ), which goes to negligible with the increase of the flow sizes.

5. Occupancy-aware Flow Classification

As described in § 3.2, Confucius seeks to classify flows into groups, each with a dedicated queue based on how aggressively they consume buffer space. We find that flows implicitly demonstrate their preferences and objectives based on how they utilize the bottleneck queue. We measure the buffer occupancy of 7 CCAs (the top-5 CCAs used in websites (sigmetrics2020gordon, ) plus two recent latency-sensitive CCAs, GCC and Copa), over real-world bandwidth traces (§ 7.1). We further measure the network RTT at the sender and the queue utilization on the bottleneck router. A lower RTT indicates that this CCA is more latency-sensitive. As we can see in Fig. 9, GCC, Copa, and Vegas have a low network RTT. Such CCAs achieve low latency by trying to keep the bottleneck queue as short as they can. Real-time applications can choose these CCAs to achieve low latency. In contrast, throughput-oriented CCAs (Cubic, Yeah, and Illinois) will maximize the queue utilization for high throughput. This allows us to identify the latency sensitivity of flows by their queue occupancy: if one flow has a low queue occupancy at the bottleneck, it indicates that (i) that flow tries to not overutilize the queue; and (ii) that flow can co-exist with other flows with similar behaviors.

In this section, we present our hysteresis-based mechanism to robustly identify the flows (§ 5.1) and our implementation considerations (§ 5.2).

Figure 8. The relationship between queue utilization and delay in different CCAs. Experiments are simulated with real WiFi traces from (sigcomm2022zhuge, ).
Refer to caption
Refer to caption
Figure 8. The relationship between queue utilization and delay in different CCAs. Experiments are simulated with real WiFi traces from (sigcomm2022zhuge, ).
Figure 9. Confucius’s hysteresis reclassification mechanism for flows. Only when the buffer occupancy of a flow has significantly deviated from the current class will it be moved to another class.

5.1. Hysteresis-based Adjustment

Confucius puts short flows into a separate queue Qnewsubscript𝑄𝑛𝑒𝑤Q_{new}italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT and classifies long flows with different buffer occupancy aggressiveness into Q1,Q2,,Qnsubscript𝑄1subscript𝑄2subscript𝑄𝑛Q_{1},Q_{2},\cdots,Q_{n}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Queue indices increase with buffer target i.e., Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will be shorter than Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, as shown in Fig. 6. Each queue Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT targets a buffer occupancy of q0(i)superscriptsubscript𝑞0𝑖q_{0}^{(i)}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. We robustly classify flows as follows:

Classification of new flows. The buffer aggressiveness of flow may take a long time to manifest. Thus, Confucius will not characterize short flows lasting only a few RTTs (§ 2). When the new flow is ready to be moved out from the new-flow queue Qnewsubscript𝑄𝑛𝑒𝑤Q_{new}italic_Q start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT to one of the old queues (its weight reaching one, which we elaborated on in § 4.2), we measure the buffer occupancy of that flow qfsubscript𝑞𝑓q_{f}italic_q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT i.e., the number of packets of this queue that belong to flow f𝑓fitalic_f. We then find the queue i𝑖iitalic_i with the nearest target q0(i)superscriptsubscript𝑞0𝑖q_{0}^{(i)}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to accommodate this flow.

Periodic adaptation. Confucius periodically examines flows and queues and moves flows accordingly in two steps. While seemingly complex, these operations are well within the capabilities of Linux-based routers (§ 6).

Intra-queue examination identifies outstanding flows among other flows in the current queue. Confucius examines the buffer each flow occupies (qfgQiqgsubscript𝑞𝑓subscript𝑔subscript𝑄𝑖subscript𝑞𝑔\frac{q_{f}}{\sum_{g\in Q_{i}}q_{g}}divide start_ARG italic_q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_g ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG) and its fair share (1|Qi|1subscript𝑄𝑖\frac{1}{|Q_{i}|}divide start_ARG 1 end_ARG start_ARG | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG). If the buffer occupancy of a flow is larger than its fair share:

(2) qfgQiqg1|Qi|+αsubscript𝑞𝑓subscript𝑔subscript𝑄𝑖subscript𝑞𝑔1subscript𝑄𝑖𝛼\small\frac{q_{f}}{\sum_{g\in Q_{i}}q_{g}}\geqslant\frac{1}{|Q_{i}|}+\alphadivide start_ARG italic_q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_g ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG ⩾ divide start_ARG 1 end_ARG start_ARG | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG + italic_α

the flow is too aggressive in the current queue, where α>0𝛼0\alpha>0italic_α > 0 is a hysteresis. Confucius wll promote that flow from queue Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Qi+1subscript𝑄𝑖1Q_{i+1}italic_Q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT to keep Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT near its control target. Similarly, a flow with an outstandingly lower buffer occupancy, i.e.:

(3) qffQiqf1|Qi|αsubscript𝑞𝑓subscript𝑓subscript𝑄𝑖subscript𝑞𝑓1subscript𝑄𝑖𝛼\small\frac{q_{f}}{\sum_{f\in Q_{i}}q_{f}}\leqslant\frac{1}{|Q_{i}|}-\alphadivide start_ARG italic_q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_f ∈ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG ⩽ divide start_ARG 1 end_ARG start_ARG | italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG - italic_α

will be demoted from queue Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Qi1subscript𝑄𝑖1Q_{i-1}italic_Q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Here we set α𝛼\alphaitalic_α to 10% based on our previous observations in Fig. 9. Our evaluation in §7 shows that the performance of Confucius is not sensitive to the workloads and CCAs.

Queue-level examination checks if the length of a queue fits the queue’s control target. If the length of a queue exceeds a safe region between the control target of the neighbor queue, Confucius moves all flows in the current queue to that queue, as shown in Fig. 9. This is needed because the intra-queue examination only focuses on cross-flow relative occupancy. Thus, it cannot identify when flows in the current queue are comparably aggressive but more aggressive than the target of this queue. For example, assume that two Cubic flows were previously classified to Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (the least aggressive) due to being throttled elsewhere. When these Cubic flows start to be aggressive, Confucius needs to move them to a different queue to protect incoming latency-sensitive flows.

5.2. Design Considerations

In practice, Confucius has two following considerations.

Number of queues to set. We observe that the CCAs are concentrated in three clusters (circles in Fig. 9). Concretely, GCC, Copa, and Vegas have a queue occupancy of less than 20%; Cubic, Illinois, and Yeah have a queue occupancy of more than 80%; and BBR’s stays in-between. Therefore, we set three queues and use the average queue occupancy in these three clusters as our targets {q0(i)}superscriptsubscript𝑞0𝑖\{q_{0}^{(i)}\}{ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT }. We expect other CCAs to fall into one of these three representative categories, if not we can configure Confucius to work with more queues.

Variation of buffer aggressiveness. A flow’s buffer aggressiveness can change over time. For example, a Cubic flow throttled/congested elsewhere (on a different router) will not be aggressive in buffer occupancy (although Cubic, the algorithm, would). Such a Cubic flow can share the queue with other delay-sensitive flows. However, when the bottleneck moves to the current router, this Cubic flow will be aggressive on the buffer occupancy, where the flow can no longer share the queue anymore. Our reclassification mechanism is capable of correctly moving the flows, as evaluated in § 7.6.

6. Confucius implementation

Implementing Confucius in Linux kernel has some challenges. We discuss them and our solutions below.

Order-preserving during reclassification. Flows can be moved to another class in the runtime. Thus, we need to ensure the order-preservation during the reclassification of Confucius of a certain flow. In response, we adopt a virtual class design in Confucius. During the enqueue process of new packets, we bind the sk_buff to each flow. During the dequeue process, we search for all flows that are bound to the determined class and dequeue the packet with the earliest enqueue time. In this way, when moving a flow to another class, we can just rebind the pointer of the flow from the previous class to the new class.

Reducing computational overhead. To implement Confucius in Linux kernel and optimize the execution overhead, we need to strictly optimize the computational overhead. Specifically, we have the following two implementations:

(i) Bit-shifts for exponential operations. Confucius reweights flows based on their ages with an exponential function, yet the floating number calculation in the kernel is expensive. Therefore, we quantize the weight of new flows with the unit of 11281128\frac{1}{128}divide start_ARG 1 end_ARG start_ARG 128 end_ARG and use bit shifts for the exponential weight updates, i.e., left shifting the weight by one bit every 1λ1𝜆\frac{1}{\lambda}divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG milliseconds.

(ii) Periodical reweighting and reclassification. The reweighting and reclassification are not necessary for each packet. For the reweighting, we only need to reweight for a flow every 1λ1𝜆\frac{1}{\lambda}divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG milliseconds. When we set λ=0.004𝜆0.004\lambda=0.004italic_λ = 0.004, this means to reweight every 250 ms. For the reclassification, we should observe the results after moving one flow to a new class for at least one RTT to measure the queue utilization and observe the behavior of the flow in the new class. Therefore, we also reclassify the flows in a periodic way – we set the reclassification interval to 100ms.

7. Evaluation

We first present our experimental setup (§ 7.1); then we evaluate Confucius by answering the following questions:

  • How does Confucius behave compared to baselines on real-world Web and bandwidth traces? Confucius reduce the stall duration of a real-time flow by 21% to 87% with various CCAs while maintaining comparable FCTs (§ 7.2).

  • How sensitive is Confucius to workloads? Confucius is consistently performant with different sizes and numbers of Web flows, following our theoretical analysis (§ 7.3).

  • How does Confucius scale to multiple flows with different CCAs? We demonstrate that Confucius can correctly separate coexisting flows with different CCAs and provide consistent performance to all of them (§ 7.4).

  • How does Confucius perform in the testbed prototype? We implement Confucius in Linux kernel and show that Confucius reduces the stall duration by more than 60% with reasonable overhead over real Web traces (§ 7.5).

  • We further show that Confucius can outperform baselines when working with multiple real-time flows, bandwidth-probing CCAs, and different bottlenecks (§ 7.6).

Refer to caption
(a) When the RT flow uses Copa.
Refer to caption
(b) When the RT flow uses GCC.
Refer to caption
(c) When the RT flow uses BBR.
Refer to caption
Figure 10. The trade-off between the real-time (RT) flow (stall duration) and Web flows (page loading time) on bandwidth traces C1. The dashed line denotes the Pareto front of baselines that do not require labels from end hosts. We mark baselines in green if they rely on labels from end hosts, and in blue if not. We change the CCA that the real-time flow uses in different subfigures.

7.1. Experiment Setup

Ns-3 setup. In § 7.2, 7.3 and 7.4, we evaluate the performance of Confucius with ns-3.34. We use the example in Fig. 1 and limit the capacity of the bottleneck link based on the bandwidth traces from (sigcomm2022zhuge, ). The dataset contains 3 sets of cellular traces (C1: mixed; C2: 4G; C3: 5G), and 2 sets of WiFi traces (W1: Office; W2: Restaurant), where the average bandwidth ranging from 22 Mbps to 375 Mbps (details in § C.1). The round-trip propagation delay is set to 40ms based on (sigcomm2022zhuge, ). We adopt the RTC library in ns-3 from (nsdi2024hairpin, ; sigcomm2022zhuge, ). We evaluate the real-time flow with different delay-sensitive CCAs, including Copa (nsdi2018copa, ) (used by Meta Live (flive-quic, )), GCC (ton2017webrtc, ) (used by WebRTC), and BBR (queue2016bbr, ). The Web flows use the mostly deployed CCA (sigmetrics2020gordon, ) – Cubic (sigops2008cubic, ).

Web traces. To compose a realistic and relevant dataset of Web traffic, we collected the Alexa Top-1000 websites555 Although the Alexa Top website list has been deprecated, we still use this list since it is the most well-known list for top websites.  (alexa-top1k, ). We use selenium (selenium, ) to automatically load Web pages through Google Chrome (version 120.0.6099.218), and use NetLog (netlog, ) and browsermobproxy (browsermob, ) to record the packets and socket states. The measurement was run in November 2023, with distribution in Fig. 3. The version of HTTP is negotiated with the website, where the majority is HTTP/1.1 and HTTP/3.0. We show in Appx. A that although the parallel connection of HTTP/1.1 contributes to the concurrency, the majority of concurrency still comes from the diverse objects on the Web page. We replay the Web traces to test a variety of scenarios.

Baselines. We compare Confucius with multiple schedulers, categorized and listed below. We use the default parameters in the Linux kernel 4.4.0 and ns-3.34 for baselines.

Not require labels (Note that Confucius does not require either):

  • (1)

    FIFO and (2) FQ+CoDel (rfc8290fqcodel, ), the default qdiscs in Linux (before and after systemd v217 (systemd-qdisc, )).

  • (3)

    FQ+FIFO is the fair queueing without the AQM.

  • (4)

    CoDel (cacm2012codel, ) and (5) RED (ton1993red, ) will drop packets before the queue overflows to notify the sender.

  • (6)

    SJF (shortest job first) prioritizes short flows over long flows, which is exactly opposite to what Confucius tries to do. We take the implementation from PIAS (nsdi2015pias, ).

  • (7)

    HHF (sigcomm2002hhf, ) (heavy-hitter filter) differentiates between small flows and heavy-hitters and schedules separately.

Require labels:

  • (8)

    DualQ (rfc9332dualq, ) is a recently proposed scheduler in L4S (rfc9330l4s, ) that protects latency-sensitive flows with labels, using the DSCP bits to identify the traffic and notify the sender.

  • (9)

    DualQ+Prague. DualQ provides ECN signals and works best with TCP Prague (briscoe-iccrg-prague-congestion-control-03, ) in the L4S framework by design. We adopt the implementation from (prague-ns3, ).

  • (10)

    CBQ (1:1) and (11) CBQ (1:5) are the weighted class-based queues, which put flows into different classes based on application labels. We set the weights for two classes (RT:Web) to 1:1 and 1:5 and evaluate respectively.

  • (12)

    StrictPriority strictly prioritizes traffic from real-time flows if they are labeled accordingly.

Metrics. We focus on the following metrics in experiments.

  • Stall duration for video frames is the duration for which the delay of the video frame is greater than 190 ms. This reflects users’ experiences on video stalls (itu-ddl, ; sigcomm2022zhuge, ; infocom2022dams, ). We use this metric to evaluate how the RT flow is affected.

  • Page Load Time (PLT) is the time till the last HTTP request in a web page is completed. We use this metric to evaluate the performance of web traffic. PLT degradation refers to the increase of delay compared to FQ.

Besides, we also evaluate other metrics in different experiments, which we will elaborate on accordingly.

7.2. Confucius under a realistic workload

Simulation scenario. We have a long-running real-time flow from the RTC module in ns-3. We then randomly select a website and replay the traces we collected to compete with the real-time flow. We set the interval of loading two websites to 53 seconds, which is the average Web page viewing time (page-stay, ). In each run, we measure the duration where the frame delay of the video flow is larger than 190 ms (stall). We also measure the Web PLT for websites. We repeat the experiment for three CCAs for the real-time flow. In this subsection, we present the results over C1 bandwidth traces and leave others to Appx. C.1. We present the average PLTs and stall durations in Fig. 10, and dive into distributions later.

Confucius strikes a balance between video and web performance that is consistent across CCAs. In Fig. 10(a), we observe that schedulers not relying on labels from the end host (marked in blue) suffer from long video stalls. For example, when the real-time flow adopts Copa, using FQ-CoDel or FIFO, the real-time flow experiences a stall of 200 ms to 300 ms on average when loading different websites. AQMs such as CoDel will further impair the performance of the Web flows since AQMs will drop packets for those flows. In contrast, Confucius can reduce the stall duration to less than 100 ms and improve by more than a half against all other baselines not requiring baselines.

Schedulers requiring labels (marked in green) protect prelabeled RT flows, but considerably degrade the PLT for the Web traffic. DualQ+Prague improves DualQ since the CCA on the end-host can react more effectively, but still incur considerable penalty on Web flows since it incurs packet drops to the Web flows (non-scalable from DualQ’s design). Note that even when using StrictPriority, the real-time flow still suffers from 60 ms degradation on average due to bandwidth fluctuation. Confucius is almost on par with schemes relying on end-host labels in terms of the delay of the real-time flows. Remember that it is unrealistic to assume that an end-host will correctly label all traffic (§ 2.4).

Most importantly, Confucius does not incur too much penalty on the PLT of Web pages, and pushes the Pareto front of the schedulers not requiring labels (the dashed blue line) forward. Confucius reduces the PLT compared to CoDel, RED, FQ+CoDel, and HHF since Confucius gently adjusts the bandwidth share for the Web flows. Even compared to FQ+FIFO, Confucius only increases the average PLT by up to 8% (up to 88 ms) in three subfigures, which is much smaller than the improvement on the stall duration. The improvement when using BBR is not as significant as the other two CCAs – this is due to the suboptimal performance of BBR in controlling the delay for the real-time flow, as we also saw in Fig. 9.

Refer to caption
(a) Number of websites that lead to non-zero stalls for the RT flow.
Refer to caption
(b) Number of websites that suffer from PLT of longer than 2 seconds.
Figure 11. The number of runs, out of 1000 websites, that lead to the degradation of the real-time and Web flows. The lower the better. Fig. 11(a) cuts Fig. 12(a) at 190 ms.

Confucius protects the real-time flow when competing with traffic from 86% of the websites. We further break down the details for different websites in Fig. 10(a) into Fig. 11. In Fig. 11(a), we present the number of websites that do not affect the delay of the real-time flow (the stall is negligible of shorter than 190 ms). The lower the number is, the better performance the scheduler is. When using Confucius, the real-time flow will only suffer from stall when competing with 136 out of 1000 websites. However, for all other baselines that do not require labels, the number ranges from 288 websites (FIFO) to 537 websites (SJF). Confucius reduces the number by 53%-75%. Even for baselines requiring labels excluding CBQ (1:5), the number ranges from 101 to 143. This further shows that Confucius can achieve comparable or sometimes even better protection to the real-time flow as those label-based solutions.

We also measures the number of websites suffering from a long PLT of longer than 2 seconds, which is the threshold for good user experience (sigcomm2019e2e, ). When using FQ+CoDel, 227 websites will suffer from long PLT, while Confucius can reduce this number by half to 127. Confucius’s results is comparable to FQ+FIFO, demonstrating the fairness of Confucius for competing Web flows. For baselines requiring labels that behave well in Fig. 11(a), at least 198 websites suffer from long PLTs. The closest label-based baseline is CBQ (1:1), which we can also see from Fig. 10. Besides the unrealistic label requirement, we will later show in § 7.3 that CBQ (1:1) does not scale to the variation of workloads. Fig. 10 evaluates with Web flows, but if the competing flows are not Web flows but FTP flows instead, the performance will be drastic since CBQ (1:1) adopts a fixed ratio between classes.

Refer to caption
(a) Stall duration of the RT flow.
Refer to caption
(b) PLT degradation of Web flows, compared to FQ.
Refer to caption
(c) The max frame delay of the RT flow when Web flows arrive.
Refer to caption
(d) The delay distribution of all RT packets during the competition.
Figure 12. The distributions in Fig. 10(a). The legend is the same as Fig. 10. We present a part of the baselines for simplicity.
Refer to caption
(a) The frame delay of different flows.
Refer to caption
(b) The classification results in different time.
Refer to caption
(c) The JFIs and delays among baselines.
Figure 13. Four flows with different CCAs (Cubic, BBR, Copa, and GCC) run in the same bottleneck router. We present the frame delay and classification results of these flows when using Confucius over time in Fig. 13(a) and 13(b). We also compare the fairness (JFI) and the delay of latency-sensitive flows (Copa and GCC) of Confucius and baselines in Fig. 13(c).

Confucius controls the delay and PLT following the theoretical analysis. Fig. 12(a) further presents the distribution of stall duration when the video flow encounters Web flows from different websites in the dataset. With FQ+CoDel or FIFO, the stall for the real-time flow will last for longer than 500 ms for 12% (FIFO)-18% (FQ+CoDel) websites, where the number for Confucius is 1%. In contrast, with Confucius, the real-time flow will not experience any stall when encountered with 95% of the websites, comparable to CBQ. Importantly, besides the PLT measured in Fig. 11(b), Confucius does not over-penalize web traffic – 60% of websites will not suffer from a penalty at all against FQ, as shown in Fig. 12(b), which mostly corroborates our previous theoretical analysis. We further present the distribution of maximum experienced delay for the real-time flow in Fig. 12(c). The fraction of having a maximum delay of >>>500 ms is 1% using Confucius, while for FIFO and FQ+CoDel are 5% and 18%. This further demonstrates that Confucius can control the latency fluctuation in not only the stall duration but also directly the raw delay. The dive into the network delay of all packets of the real-time flow in Fig. 12(d) corroborates this as well. The results when using GCC and BBR are similar.

7.3. Confucius under workload changes

Refer to caption
(a) Stall duration of the existing
real-time flow.
Refer to caption
(b) PLT degradation of Web (new) flows against FQ.
Figure 14. Performance consistency in workloads with different number of Web flows, each flow with the size of 15KB.
Refer to caption
(a) Stall duration of the existing
real-time flow.
Refer to caption
(b) PLT degradation of Web (new) flows against FQ.
Figure 15. Performance consistency in workloads with different size of Web flows, each experiment having 5 flows. The dashed line is the theoretical bounds from Tab. 1.

In this subsection, we test our theoretical analysis by investigating whether Confucius can provide consistent performance in controlled workloads. We vary the workload by changing the number of flows in a Web page and the size of Web flows. We measure the stall duration in different scenarios and the degradation on the PLT against FQ.

Confucius is bounded by theoretical thresholds, confirming our analysis. We vary the number of Web flows from 5 to 100, each with the size of 15KB (medium flow size in our measurement), and summarize our results in Fig. 14(a). The stall duration for FQ and FIFO increases with the number of flows: when the number of Web flows goes to 60, the real-time flow experiences a stall for more than half a second. On the contrary, Confucius maintains zero stall in this setting, similar to CBQ. We further compare the experimental results with our previous analysis in § 4.3. As we can see in the dashed line in Fig. 14, the experimental results corroborate our theoretical analysis for Confucius in Tab. 1.

We further change the size of Web flows (from short flows to long flows) and see if Confucius is capable of handling all types of competing traffic. With the increase of the size of flows, the competing flows are changing from short flows (e.g., Web) to long flows (e.g., FTP). We vary the size of Web flows from 15KB to 9MB, and run 5 flows with the same size to compete with the HRT flow. When using FIFO, the real-time flow will suffer from drastic stall due to failure to provide inter-CCA fairness across flows, as shown in Fig. 15(a). The real-time flow using FQ also has a long stall of hundreds of milliseconds. In contrast, Confucius is still able to achieve negligible stall for the real-time flow and bounded PLT degradation for Web flows at the same time.

7.4. Heterogeneous Flow Classification

In this subsection, we zoom in on Confucius’s flow classification mechanism. We find that Confucius can accurately group flows of the same/similar CCA together without any prior knowledge or labels from end hosts, which in turn leads to better performance compared to the baselines.

We simultaneously run four long flows of different CCAs: one Cubic flow, one BBR flow, one GCC flow, and one Copa flow for 100 seconds. We plot the network delay for each flow over time in Fig. 13(a). We can clearly see that Copa and GCC enjoy a consistent low latency around 40-60 ms, even when they are competing with BBR and Cubic flows.

To understand Confucius’s superior performance, we look at its classification results over time and present them in Fig. 13(b). Four bars represent the classification results of Confucius for four flows over time, while three colors indicate which queue the flow is classified into. Confucius can correctly classify flows using different CCAs into the correct queues: Copa and GCC flows can be stably put into the low occupancy queue (Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, blue), the BBR flow into the median occupancy queue (Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, yellow), and the Cubic flow into the high occupancy queue (Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, green). This follows our previous observation in Fig. 9 – Copa and GCC both demonstrate similar low buffer occupancy, while Cubic occupies the buffer aggressively, and BBR in the middle. Moreover, we notice that the Cubic flow can temporarily be in the same queue as BBR, as shown in the yellow lines in the green bar in Fig. 13(b). This is expected as the Cubic flow has (at times) a low queue occupancy in its probing period. Second, flows with different CCAs can co-exist in the same queue as long as they have similar buffer occupancy. In this experiment, Copa and GCC flows are put into the same queue since they have similar buffer occupancy. As we can see in Fig. 13(a), these two flows still have consistent low latency all the time.

We also measure the Jain’s fairness index (JFI) in Fig. 13(c) to present the fairness when using different schemes. We compare the results (the delay of the Copa and GCC flow, and the JFI of all flows) in the same experiment with other schedulers in Fig. 13(c). With Confucius, the Copa and GCC flows also enjoy a reasonable fair share of the bandwidth as FQ – the JFI in this experiment is 0.98 in Fig. 13(c), where JFI close to one indicates a better fairness.

7.5. Testbed Experiments

Refer to caption
(a) The real-time flow’s stall vs. Web flows’ load time.
Refer to caption
(b) Processing time for each packet. Axes are log-scaled.
Figure 16. Results over our Linux kernel-based testbed.

We implement Confucius as a kernel module of queue disciplines (qdisc) in traffic control in Linux kernel 4.4.0 (1.4k LoCs) and evaluate the performance of Confucius on a software router based on Intel Xeon E5-2620 v4 CPU. We run the official implementation of Copa (copa-ccp, ). We find that Confucius achieves significant benefits in kernel implementations while only adding marginal processing delay.

We stream the video frames using socket and TCP Copa, and measure the end-to-end delay for the real-time flow. We then set up an HTTP server based on Python to replay the Web traces we collected. We also measure the computational overhead of Confucius and the baselines. We record the processing time for the enqueue and dequeue operation in Linux tc using printk, where the reweight and reclassification in Confucius are both implemented.

As shown in Fig. 16(a), Confucius reduces the stall duration by more than 60% without the need for labels on each packet. This result is similar to our simulation in Fig. 10(a). Moreover, from our experiments, 86% of websites when using Confucius do not suffer from stall at all. In contrast, this number is only 56% and 30% for FIFO and FQ.

We vary the number of long-running flows to measure the overhead of Confucius. Note that the processing time of Confucius is insensitive to the number of short flows, as they all belong to the new-flow queue. As shown in Fig. 16(b), Confucius slightly increases the processing time for each packet compared to FQ. Even for 100 concurrent long-running flows, the per-packet processing time of Confucius is still 5 μs𝜇𝑠\mu sitalic_μ italic_s, indicating a processing rate of 200 kpps and is at the same magnitude as FQ. Since Confucius is mainly designed for last-mile routers such as home routers, this can satisfy the daily usage of home access points or last-mile routers. We stress that the kernel implementation of Confucius can be further optimized for high-performance execution in the future. We leave the further exploration of Confucius over numerous flows (e.g., in the routers in the core network) in the future.

7.6. Microbenchmarks

We further evaluate the performance of Confucius in a series of microbenchmarking settings. In Appx. C.3, we demonstrate that the hysteresis mechanism of Confucius5.1) can work with bandwidth-probing CCAs (e.g., BBR) and stably and correctly classify flows. We further show that Confucius will not have side effects on the fairness aspect in Appx. C.2, and investigate if the bottleneck is not the router where Confucius is deployed in Appx. C.4. Even if multiple real-time flows are competing simultaneously, Confucius can still handle those flows and provide significant performance improvements against baselines (Appx. C.5).

8. Limitations

We discuss some other related work besides § 2.3 in Appx. D, and outline some limitations of Confucius here.

Applications using latency-sensitive CCAs. In this paper, we assume that real-time applications will use latency-sensitive CCAs. For example, video conferencing applications will use CCAs such as GCC and Copa but not Cubic (sigcomm2022zhuge, ). This general holds since the operators of applications will optimize towards their goal, where the latency is definitely the goal of real-time applications. If the application does not follow this, that means the application’s CCA itself is still problematic, and the effect of Confucius will be limited.

Web flow characteristics for mobile applications. The measurement of Web flows in § 2 and Appx. A is loading Web pages from desktop browsers (Google Chrome). We do not conduct measurements over mobile Apps due to device limitation. However, the root cause contributing to the bursts still exists in mobile Apps – one page contains diverse objects (images, videos, scripts) from different domains. We leave the investigation of Web pages on mobile Apps to future.

Confucius scales to core routers. We mainly discuss the bottleneck at the edge routers in this paper. This is because when the network delay increases, it is more likely to happen at the edge (sigcomm2022zhuge, ). In core routers, due to high line rate, 100 new flows is not a big number and will likely not result in drastic available bandwidth fluctiation. Meanwhile, tracking per-flow state such as FQ is already burdensome, therefore is out of the scope of Confucius. Nevertheless, encouraged by recent sophisticated AQM on high-performance switches (sigcomm2022abm, ), it might be possible to extend Confucius to core routers.

9. Conclusion

We propose Confucius, the first queue management scheme to protect the real-time flows in the flow competition while not requiring any labels from end hosts. Confucius achieves this by grouping flows based on their latency preferences, which it infers by observing their buffer occupancy over time. Confucius gradually adjusts the service rates of flows to match the reaction of congestion control. Doing so allows Confucius to mitigate latency spikes of real-time flows. Extensive evaluation shows that Confucius protects the real-time flows from stalls when competing with 86% websites, almost doubling over numerous baselines.

This work does not raise any ethical issues.

References

  • [1] Gsoc2020prague - nsnam. https://www.nsnam.org/wiki/GSOC2020Prague.
  • [2] lightbody/browsermob-proxy: A free utility to help web developers watch and manipulate network traffic from their ajax applications. https://github.com/lightbody/browsermob-proxy.
  • [3] Netlog: Chrome’s network logging system. https://www.chromium.org/developers/design-documents/network-stack/netlog/.
  • [4] [systemd-devel] [announce] systemd 217. https://lists.freedesktop.org/archives/systemd-devel/2014-October/024662.html#:~:text=The%20default%20sysctl.d/%20snippets%20will%20now%20set%3A, 2014.
  • [5] Alexa Top Websites >>much-greater-than>>> > ExpiredDomains.net. https://member.expireddomains.net/domains/researchalexamillion/, 2022.
  • [6] Selenium. https://www.selenium.dev/, 2022.
  • [7] Vamsi Addanki, Maria Apostolaki, Manya Ghobadi, Stefan Schmid, and Laurent Vanbever. Abm: Active buffer management in datacenters. In Proc. ACM SIGCOMM, 2022.
  • [8] Mohammad Alizadeh, Abdul Kabbani, Berk Atikoglu, and Balaji Prabhakar. Stability analysis of qcn: the averaging principle. In Proc. ACM SIGMETRICS, 2011.
  • [9] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport. In Proc. ACM SIGCOMM, 2013.
  • [10] Anurag. What os does the router use? is it linux? - quora. https://www.quora.com/What-OS-does-the-router-use-Is-it-Linux#:~:text=Yes%20most%20of%20the%20router,the%20router%20as%20pre%2Dinstalled.
  • [11] Venkat Arun. Implementation of the copa congestion control algorithm using ccp. https://github.com/venkatarun95/ccp_copa, 2020.
  • [12] Venkat Arun, Mohammad Alizadeh, and Hari Balakrishnan. Starvation in End-to-End Congestion Control. In Proc. ACM SIGCOMM, 2022.
  • [13] Venkat Arun and Hari Balakrishnan. Copa: Practical delay-based congestion control for the internet. In Proc. USENIX NSDI, 2018.
  • [14] Wei Bai, Li Chen, Kai Chen, Dongsu Han, Chen Tian, and Hao Wang. Information-agnostic flow scheduling for commodity data centers. In Proc. USENIX NSDI, 2015.
  • [15] Vaibhav Bajpai, Steffie Jacob Eravuchira, and Jürgen Schönwälder. Dissecting last-mile latency characteristics. ACM SIGCOMM Computer Communication Review, 47(5):25–34, 2017.
  • [16] Fred Baker, Jozef Babiarz, and Kwok Ho Chan. Configuration Guidelines for DiffServ Service Classes. IETF RFC 4594, 2006.
  • [17] Nimantha Baranasuriya, Vishnu Navda, Venkata N Padmanabhan, and Seth Gilbert. Qprobe: Locating the bottleneck in cellular communication. In Proc. ACM CoNEXT, pages 1–7, 2015.
  • [18] Apurv Bhartia, Bo Chen, Feng Wang, Derrick Pallas, Raluca Musaloiu-E, Ted Tsung-Te Lai, and Hao Ma. Measurement-based, practical techniques to improve 802.11 ac performance. In Proc. ACM IMC, 2017.
  • [19] Bob Briscoe, Koen De Schepper, Marcelo Bagnulo, and Greg White. Low Latency, Low Loss, and Scalable Throughput (L4S) Internet Service: Architecture. RFC 9330, January 2023.
  • [20] Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. Bbr: Congestion-based congestion control. ACM Queue, 2016.
  • [21] Gaetano Carlucci, Luca De Cicco, Stefan Holmer, and Saverio Mascolo. Congestion control for web real-time communication. IEEE/ACM Transactions on Networking, 2017.
  • [22] Hyunseok Chang, Matteo Varvello, Fang Hao, and Sarit Mukherjee. Can you see me now? a measurement study of zoom, webex, and meet. In Proceedings of the 21st ACM Internet Measurement Conference, pages 216–228, 2021.
  • [23] Gwyn Chatranon, Miguel A Labrador, and Sujata Banerjee. Black: detection and preferential dropping of high bandwidth unresponsive flows. In Proc. IEEE ICC, 2003.
  • [24] Li Chen, Kai Chen, Wei Bai, and Mohammad Alizadeh. Scheduling mix-flows in commodity datacenters with karuna. In Proc. ACM SIGCOMM, 2016.
  • [25] Wei Chen, Liangping Ma, and Chien-Chung Shen. Congestion-aware mac layer adaptation to improve video teleconferencing over wi-fi. In Proceedings of ACM Multimedia Systems Conference (MMSys), 2015.
  • [26] Amogh Dhamdhere, David D Clark, Alexander Gamero-Garrido, Matthew Luckie, Ricky KP Mok, Gautam Akiwate, Kabir Gogia, Vaibhav Bajpai, Alex C Snoeren, and Kc Claffy. Inferring persistent interdomain congestion. In Proc. ACM SIGCOMM, pages 1–15, 2018.
  • [27] Sandesh Dhawaskar Sathyanarayana, Kyunghan Lee, Dirk Grunwald, and Sangtae Ha. Converge: Qoe-driven multipath video conferencing over webrtc. In Proc. ACM SIGCOMM, pages 637–653, 2023.
  • [28] Mo Dong, Qingxi Li, Doron Zarchy, P Brighten Godfrey, and Michael Schapira. Pcc: Re-architecting congestion control for consistent high performance. In Proc. USENIX NSDI, 2015.
  • [29] Nandita Dukkipati, Tiziana Refice, Yuchung Cheng, Jerry Chu, Tom Herbert, Amit Agarwal, Arvind Jain, and Natalia Sutin. An argument for increasing tcp’s initial congestion window. ACM SIGCOMM Computer Communication Review, pages 26–33, 2010.
  • [30] Cristian Estan and George Varghese. New directions in traffic measurement and accounting. In Proc. ACM SIGCOMM, pages 323–336, 2002.
  • [31] Wu-chang Feng, Dilip D Kandlur, Debanjan Saha, and Kang G Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In Proc. IEEE INFOCOM, 2001.
  • [32] Wu-chun Feng, Apu Kapadia, and Sunil Thulasidasan. Green: proactive queue management over a best-effort network. In Proc. IEEE GLOBECOM, 2002.
  • [33] Marcel Flores, Alexander Wenzel, and Aleksandar Kuzmanovic. Enabling router-assisted congestion control on the internet. In Proc. IEEE ICNP, 2016.
  • [34] Sally Floyd and Van Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1993.
  • [35] Romain Fontugne, Anant Shah, and Kenjiro Cho. Persistent last-mile congestion: not so uncommon. In Proc. ACM IMC, pages 420–427, 2020.
  • [36] Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S Wahby, and Keith Winstein. Salsify: Low-latency network video through tighter integration between a video codec and a transport protocol. In Proc. USENIX NSDI, 2018.
  • [37] Nitin Garg. Copa congestion control for video performance - engineering at meta. https://engineering.fb.com/2019/11/17/video-engineering/copa/, 2019.
  • [38] Prateesh Goyal, Anup Agarwal, Ravi Netravali, Mohammad Alizadeh, and Hari Balakrishnan. Abc: A simple explicit congestion controller for wireless networks. In Proc. USENIX NSDI, 2020.
  • [39] Sangtae Ha, Injong Rhee, and Lisong Xu. Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS Operating Systems Review, 2008.
  • [40] Mario Hock, Roland Bless, and Martina Zitterbart. Experimental evaluation of bbr congestion control. In Proc. IEEE ICNP, 2017.
  • [41] Toke Høiland-Jørgensen, Paul McKenney, Dave Taht, Jim Gettys, and Eric Dumazet. The Flow Queue CoDel Packet Scheduler and Active Queue Management Algorithm. IETF RFC 8290.
  • [42] Van Jacobson. Congestion avoidance and control. In Proc. ACM SIGCOMM, 1988.
  • [43] Dina Katabi, Mark Handley, and Charlie Rohrs. Congestion control for high bandwidth-delay product networks. In Proc. ACM SIGCOMM, 2002.
  • [44] Tong Li, Kai Zheng, Ke Xu, Rahul Arvind Jadhav, Tao Xiong, Keith Winstein, and Kun Tan. Tack: Improving wireless transport performance by taming acknowledgments. In Proc. ACM SIGCOMM, 2020.
  • [45] Dong Lin and Robert Morris. Dynamics of random early detection. In Proc. ACM SIGCOMM, 1997.
  • [46] Chengnian Long, Bin Zhao, Xinping Guan, and Jun Yang. The yellow active queue management algorithm. Elsevier Computer Networks, 2005.
  • [47] James M Lucas and Michael S Saccucci. Exponentially weighted moving average control schemes: properties and enhancements. Technometrics, 32(1):1–12, 1990.
  • [48] Mike H MacGregor and Weiguang Shi. Deficits for bursty latency-critical flows: DRR++. In Proc. IEEE International Conference on Networks (ICON), pages 287–293.
  • [49] Zili Meng, Yaning Guo, Chen Sun, Bo Wang, Justine Sherry, Hongqiang Harry Liu, and Mingwei Xu. Achieving Consistent Low Latency for Wireless Real Time Communications with the Shortest Control Loop. In Proc. ACM SIGCOMM, 2022.
  • [50] Zili Meng, Xiao Kong, Jing Chen, Bo Wang, Mingwei Xu, Rui Han, Honghao Liu, Venkat Arun, Hongxin Hu, and Xue Wei. Hairpin: Rethinking packet loss recovery in edge-based interactive video streaming. In To appear at USENIX NSDI, 2024.
  • [51] Zili Meng, Tingfeng Wang, Yixin Shen, Bo Wang, Mingwei Xu, Rui Han, Honghao Liu, Venkat Arun, Hongxin Hu, and Xue Wei. Enabling high quality real-time communications with adaptive frame-rate. In Proc. USENIX NSDI, 2023.
  • [52] Oliver Michel, Satadal Sengupta, Hyojoon Kim, Ravi Netravali, and Jennifer Rexford. Enabling passive measurement of zoom performance in production networks. In Proceedings of the 22nd ACM Internet Measurement Conference, pages 244–260, 2022.
  • [53] Ayush Mishra, Xiangpeng Sun, Atishya Jain, Sameer Pande, Raj Joshi, and Ben Leong. The great internet tcp congestion control census. In Proc. ACM Sigmetrics, 2020.
  • [54] Sándor Molnár, Balázs Sonkoly, and Tuan Anh Trinh. A comprehensive tcp fairness analysis in high speed networks. Computer Communications, 32(13-14):1460–1484, 2009.
  • [55] Yunzhe Ni, Zhilong Zheng, Xianshang Lin, Fengyu Gao, Xuan Zeng, Yirui Liu, Tao Xu, Hua Wang, Zhidong Zhang, Senlang Du, et al. Cellfusion: Multipath vehicle-to-cloud video streaming with network coding in the wild. In Proc. ACM SIGCOMM, pages 668–683, 2023.
  • [56] Kathleen Nichols and Van Jacobson. Controlling queue delay. Communications of the ACM, 2012.
  • [57] Rong Pan, Lee Breslau, Balaji Prabhakar, and Scott Shenker. Approximate fairness through differential dropping. ACM SIGCOMM Computer Communication Review, 33(2):23–39, 2003.
  • [58] Adithya Abraham Philip, Ranysha Ware, Rukshani Athapathu, Justine Sherry, and Vyas Sekar. Revisiting tcp congestion control throughput models & fairness properties at scale. In Proc. ACM IMC, pages 96–103, 2021.
  • [59] Devdeep Ray, Connor Smith, Teng Wei, David Chu, and Srinivasan Seshan. Sqp: Congestion control for low-latency interactive video streaming. arXiv preprint arXiv:2207.11857, 2022.
  • [60] ITU Recommendations. G.1070 : Opinion model for video-telephony applications. https://www.itu.int/rec/T-REC-G.1070, 2018.
  • [61] Jacqueline Renouard. The average time spent on a website: Increase visitor engagement. website/#:~:text=The%20average%20time%20spent%20on%20a%20web%20page%20ranges%20depending,industries%2C%20is%20around%2053%20seconds., 2023.
  • [62] Michael Rudow, Francis Y Yan, Abhishek Kumar, Ganesh Ananthanarayanan, Martin Ellis, and KV Rashmi. Tambur: Efficient loss recovery for videoconferencing via streaming codes. In Proc. USENIX NSDI, pages 953–971, 2023.
  • [63] Saeed Shafiee Sabet, Steven Schmidt, Saman Zadtootaghaj, Babak Naderi, Carsten Griwodz, and Sebastian Möller. A latency compensation technique based on game characteristics to mitigate the influence of delay on cloud gaming quality of experience. In Proc. ACM MMSys, 2020.
  • [64] Kanon Sasaki, Masato Hanai, Kouto Miyazawa, Aki Kobayashi, Naoki Oda, and Saneyasu Yamaguchi. Tcp fairness among modern tcp congestion control algorithms including tcp bbr. In 2018 IEEE 7th international conference on cloud networking (CloudNet), pages 1–4. IEEE, 2018.
  • [65] Koen De Schepper, Bob Briscoe, and Greg White. Dual-Queue Coupled Active Queue Management (AQM) for Low Latency, Low Loss, and Scalable Throughput (L4S). RFC 9332, January 2023.
  • [66] Koen De Schepper, Olivier Tilmans, Bob Briscoe, and Vidhi Goel. Prague Congestion Control. Internet-Draft draft-briscoe-iccrg-prague-congestion-control-03, Internet Engineering Task Force, October 2023. Work in Progress.
  • [67] Ivan Slivar, Mirko Suznjevic, and Lea Skorin-Kapov. The impact of video encoding parameters and game type on qoe for cloud gaming: A case study using the steam platform. In Proc. IEEE International Conference on Quality of Multimedia Experience (QoMEX), 2015.
  • [68] Ammar Tahir and Radhika Mittal. Enabling users to control their internet. In Proc. USENIX NSDI, pages 555–573, 2023.
  • [69] C-H Tai, Jiang Zhu, and Nandita Dukkipati. Making large scale deployment of rcp practical for real networks. In Proc. IEEE INFOCOM, 2008.
  • [70] Pratiksha Thaker, Matei Zaharia, and Tatsunori Hashimoto. Don’t hate the player, hate the game: Safety and utility in multi-agent congestion control. In Proc. ACM HotNets, 2021.
  • [71] Haiping Wang, Zhenhua Yu, Ruixiao Zhang, Siping Tao, Hebin Yu, and Shu Shi. Twinstar: A practical multi-path transmission framework for ultra-low latency video delivery. In Proc. ACM Multimedia, page 9234–9242, 2023.
  • [72] Ranysha Ware, Matthew K Mukerjee, Srinivasan Seshan, and Justine Sherry. Beyond jain’s fairness index: Setting the bar for the deployment of congestion control algorithms. In Proc. ACM HotNets, pages 17–24, 2019.
  • [73] Peter Weidenbach and Johannes vom Dorp. Home router security report 2020. https://www.fkie.fraunhofer.de/content/dam/fkie/de/documents/HomeRouter/HomeRouterSecurity_2020_Bericht.pdf, 2020.
  • [74] Keith Winstein, Anirudh Sivaraman, and Hari Balakrishnan. Stochastic forecasts achieve high throughput and low delay over cellular networks. In Proc. USENIX NSDI, 2013.
  • [75] Yaxiong Xie, Fan Yi, and Kyle Jamieson. Pbe-cc: Congestion control via endpoint-centric, physical-layer bandwidth measurements. In Proc. ACM SIGCOMM, 2020.
  • [76] Zhaoqi Xiong and Noa Zilberman. Do switches dream of machine learning? toward in-network classification. In Proc. ACM HotNEts, pages 25–33, 2019.
  • [77] Dongzhu Xu, Anfu Zhou, Xinyu Zhang, Guixian Wang, Xi Liu, Congkai An, Yiming Shi, Liang Liu, and Huadong Ma. Understanding operational 5g: A first measurement study on its coverage, performance and energy consumption. In Proc. ACM SIGCOMM, 2020.
  • [78] Xiaokun Xu and Mark Claypool. Measurement of cloud-based game streaming system response to competing tcp cubic or tcp bbr flows. In Proceedings of the 22nd ACM Internet Measurement Conference, pages 305–316, 2022.
  • [79] Jia Zhang, Enhuan Dong, Zili Meng, Yuan Yang, Mingwei Xu, Sijie Yang, Miao Zhang, and Yang Yue. Wisetrans: Adaptive transport protocol selection for mobile web service. In Proceedings of the Web Conference, 2021.
  • [80] Xu Zhang, Siddhartha Sen, Daniar Kurniawan, Haryadi Gunawi, and Junchen Jiang. E2e: embracing user heterogeneity to improve quality of experience on the web. In Proc. ACM SIGCOMM, pages 289–302, 2019.
  • [81] Yuhan Zhou, Tingfeng Wang, Liying Wang, Nian Wen, Rui Han, Jing Wang, Chenglei Wu, Jiafeng Chen, Longwei Jiang, Shibo Wang, et al. Augur: Practical mobile multipath transport service for low tail latency in real-time streaming. In To appear at USENIX NSDI, 2024.
  • [82] Xutong Zuo, Yong Cui, Xin Wang, and Jiayu Yang. Deadline-aware multipath transmission for streaming blocks. In Proc. IEEE INFOCOM, pages 2178–2187, 2022.

Appendix A Web Page Connection Analysis

Rank Website ACTIVE IN_USE OPEN
189 dailymail.co.uk 50 134 250
109 tumblr.com 39 82 153
89 w3schools.com 32 95 176
147 speedtest.net 28 87 137
113 cnn.com 27 118 194
186 namu.wiki 27 112 192
173 indiatimes.com 22 99 136
106 rakuten.co.jp 20 68 97
35 fandom.com 19 68 97
7 yahoo.com 19 42 63
Table 2. Websites in Top 200 that have the highest number of ACTIVE flows.
Refer to caption
Figure 17. The screenshot ofdailymail.co.uk, loaded in November 2023.

Many ACTIVE flows. In this section we provide more details about the Web traces we measured in § 2. Tab. 2 presents the websites that have the highest number of concurrent ACTIVE flows. We can see thatdailymail.co.uk has 50 concurrent ACTIVE flows, and 250 OPEN sockets at the same time. This is due to the complicated page structure of the homepage ofdailymail.co.uk. We present a screenshot of the homepage ofdailymail.co.uk in Fig. 17. We can clearly see that there are many objects on the homepage, visible (images, texts, videos) and invisible (scripts, styles). Some objects have dependency over others, so the concurrent ACTIVE flows are fewer than concurrent OPEN sockets, but that still result in 50 flows.

Refer to caption
(a) Connections per Web page.
Refer to caption
(b) Size distributions.
Figure 18. Number of unique IPs/requests and their size for loading each of Alexa top 1000 websites.

Domains and requests. We further present the distribution of the number of unique HTTP requests and source IPs in Fig. 18(a), together with the size distribution in Fig. 18(b). The median number of unique IPs that loading the homepage of a website will request is 15, while flow sizes range from 100 bytes to 100KBytes with a median number of 15KB. This is also the size of Web flows we used in § 7.3.

HTTP/1.0 HTTP/1.1 HTTP/2.0 HTTP/3.0
By website 40 (2.37%) 961 (56.97%) 13 (0.77%) 673 (39.89%)
By request 49 (0.02%) 178341 (87.29%) 75 (0.04%) 25833 (12.64%)
Table 3. The distribution of HTTP versions counted by websites and requests. The sum of websites is greater than 1000 since loading one websites may generate requests to different domains, which can use different HTTP versions.

Composition of HTTP versions. A straightforward understanding of multiple flows of loading one Web page is the parallel connection introduced in HTTP/1.1. Our measurement in Fig. 18(a) does show the effect of parallel connection – the median number of connections counted by flow is 2x of that of source IPs. However, the root cause is still the diverse objects on one page, as shown in Fig. 17. To help to better understand the composition of HTTP requests, we present the distribution of HTTP versions in Tab. 3. We can see that different websites actually have a very diverse structure of HTTP versions, where the majority is HTTP/1.1 and HTTP/3.0.

Appendix B Fluid Model Analysis

In this section, we present the details about how we get the results in Tab. 1. We list the notations we will use in Tab. 4.

Parameters and variables:
B𝐵Bitalic_B Size of each new Web flow.
N𝑁Nitalic_N Number of new Web flows.
k𝑘kitalic_k The responsiveness of a CCA.
q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT The delay target that a CCA will try to achieve.
C𝐶Citalic_C The link capacity.
τ𝜏\tauitalic_τ The feedback loop of a CCA (usually one RTT).
B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT The initial burst of a new flow (e.g., the initial cwnd [29]).
P𝑃Pitalic_P The scheduling policy.
Functions:
s(t)𝑠𝑡s(t)italic_s ( italic_t ) Sending rate of the real-time flow of time t𝑡titalic_t.
r(t)𝑟𝑡r(t)italic_r ( italic_t ) Available bandwidth of the real-time flow of time t𝑡titalic_t.
p(t)𝑝𝑡p(t)italic_p ( italic_t ) Number of packets in the queue of the real-time flow.
q(t)𝑞𝑡q(t)italic_q ( italic_t ) The queueing delay of the real-time flow.
Table 4. Notations

CCA model. We adopt a simplified delay-convergent CCA model [12, 8], where the delay-sensitive CCA has a target queueing delay, q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The CCA seeks to maintain its queueing delay around this target, increasing or decreasing its sending rate proportional to the difference between the current delay and the target:

(4) ds(t)dt=k(q(tτ)q0)d𝑠𝑡d𝑡𝑘𝑞𝑡𝜏subscript𝑞0\small\frac{{\rm d}s(t)}{{\rm d}t}=-k\cdot(q(t-\tau)-q_{0})divide start_ARG roman_d italic_s ( italic_t ) end_ARG start_ARG roman_d italic_t end_ARG = - italic_k ⋅ ( italic_q ( italic_t - italic_τ ) - italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

Here, s(t)𝑠𝑡s(t)italic_s ( italic_t ) and q(t)𝑞𝑡q(t)italic_q ( italic_t ) are the flow’s instantaneous sending rate and queueing delay at time t𝑡titalic_t, and τ𝜏\tauitalic_τ is the feedback loop of the CCA. k𝑘kitalic_k is a coefficient representing the CCA’s responsiveness, indicating how aggressive the CCA is when the delay changes. We explain it quantitatively in Appx. B.5.

Delay model. Next, we analyze the number of packets in the queue, p(t)𝑝𝑡p(t)italic_p ( italic_t ), at time t𝑡titalic_t. At any t>0𝑡0t>0italic_t > 0, this quantity satisfies the following relationship:

(5) p(t)=p(0)+0t(s(t)r(t))dt𝑝𝑡𝑝0superscriptsubscript0𝑡𝑠superscript𝑡𝑟superscript𝑡differential-dsuperscript𝑡\small p(t)=p(0)+\int_{0}^{t}\left(s(t^{\prime})-r(t^{\prime})\right){\rm d}t^% {\prime}italic_p ( italic_t ) = italic_p ( 0 ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) roman_d italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

where p(0)=q0C𝑝0subscript𝑞0𝐶p(0)=q_{0}\cdot Citalic_p ( 0 ) = italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_C is the number of packets in the buffer in steady state with C𝐶Citalic_C being the link capacity. If r(t)𝑟𝑡r(t)italic_r ( italic_t ) represents the service rate for the real-time flow at time t𝑡titalic_t, then the queueing delay can be written as follows:

(6) q(t)=p(t)r(t)𝑞𝑡𝑝𝑡𝑟𝑡\small q(t)=\frac{p(t)}{r(t)}italic_q ( italic_t ) = divide start_ARG italic_p ( italic_t ) end_ARG start_ARG italic_r ( italic_t ) end_ARG

The real-time flow and the competing flows focus on different metrics. The real-time flow focuses on the maximum queueing delay, qPmaxsubscriptsuperscript𝑞𝑚𝑎𝑥𝑃q^{max}_{P}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, for a given scheduling policy P𝑃Pitalic_P:

(7) qPmax=maxt>0q(t)subscriptsuperscript𝑞𝑚𝑎𝑥𝑃subscript𝑡0𝑞𝑡\small q^{max}_{P}=\max_{t>0}\ q(t)italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_t > 0 end_POSTSUBSCRIPT italic_q ( italic_t )

In this context, we find that qPmaxsubscriptsuperscript𝑞𝑚𝑎𝑥𝑃q^{max}_{P}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT serves as a good proxy for the duration of delay degradation since it establishes a lower bound on how quickly previously-queued packets of the real-time flow drain from the bottleneck queue.

The Web flows will focus on flow completion time (FCT), T𝑇Titalic_T, which can be expressed as follows:

(8) 0T(Cr(t))dt=NBsuperscriptsubscript0𝑇𝐶𝑟superscript𝑡differential-dsuperscript𝑡𝑁𝐵\small\int_{0}^{T}\left(C-r(t^{\prime})\right){\rm d}t^{\prime}=N\cdot B∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_C - italic_r ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) roman_d italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_N ⋅ italic_B

Having established our two figures of merit (maximum queueuing delay and FCT degradation to FQ), we evaluate four scheduling policies: FQ, FIFO, CBQ (1:1), and Confucius. We find that the available bandwidths for these policies satisfy the following relationships:

(9a) rFQ(t)subscript𝑟𝐹𝑄𝑡\displaystyle r_{FQ}(t)\ italic_r start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT ( italic_t ) =CN+1absent𝐶𝑁1\displaystyle=\ \textstyle{\frac{C}{N+1}}\quad= divide start_ARG italic_C end_ARG start_ARG italic_N + 1 end_ARG (t>0)𝑡0\displaystyle(t>0)( italic_t > 0 )
(9b) rFIFO(t)subscript𝑟𝐹𝐼𝐹𝑂𝑡\displaystyle r_{FIFO}(t)\ italic_r start_POSTSUBSCRIPT italic_F italic_I italic_F italic_O end_POSTSUBSCRIPT ( italic_t ) CCq0Cq0+NB0absent𝐶𝐶subscript𝑞0𝐶subscript𝑞0𝑁subscript𝐵0\displaystyle\leqslant\ \textstyle{C\cdot\frac{Cq_{0}}{Cq_{0}+NB_{0}}}\quad⩽ italic_C ⋅ divide start_ARG italic_C italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_C italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_N italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG (t>0)𝑡0\displaystyle(t>0)( italic_t > 0 )
(9c) rCBQ(t)subscript𝑟𝐶𝐵𝑄𝑡\displaystyle r_{CBQ}(t)\ italic_r start_POSTSUBSCRIPT italic_C italic_B italic_Q end_POSTSUBSCRIPT ( italic_t ) =C2absent𝐶2\displaystyle=\ \textstyle{\frac{C}{2}}\quad= divide start_ARG italic_C end_ARG start_ARG 2 end_ARG (t>0)𝑡0\displaystyle(t>0)( italic_t > 0 )
(9d) r𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌(t)subscript𝑟𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌𝑡\displaystyle r_{\textsf{Confucius}}(t)\ italic_r start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT ( italic_t ) =max(C22λt,CN+1)absent𝐶2superscript2𝜆𝑡𝐶𝑁1\displaystyle=\ \textstyle{\max\left(\frac{C}{2}\cdot 2^{-\lambda t},\frac{C}{% N+1}\right)}\quad= roman_max ( divide start_ARG italic_C end_ARG start_ARG 2 end_ARG ⋅ 2 start_POSTSUPERSCRIPT - italic_λ italic_t end_POSTSUPERSCRIPT , divide start_ARG italic_C end_ARG start_ARG italic_N + 1 end_ARG ) (t>0)𝑡0\displaystyle(t>0)( italic_t > 0 )

where for FIFO, B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial burst size of these new flows (e.g., the initial congestion window in TCP). We then solve for the performance degradation of the real-time flow, qPmaxsubscriptsuperscript𝑞𝑚𝑎𝑥𝑃q^{max}_{P}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. For FCT, since FQ provides the ‘fairest’ bandwidth allocation (representing one extreme of the per-flow fairness), we use the FCT for Web flows under FQ, TFQsubscript𝑇𝐹𝑄T_{FQ}italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT to normalize and calculate TPTFQsubscript𝑇𝑃subscript𝑇𝐹𝑄T_{P}-T_{FQ}italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT as the degree to which policy P𝑃Pitalic_P degrades Web flow performance relative to FQ. Below we analyze four schedulers in detail.

B.1. Fair Queueing (FQ)

Substituting Eq. 9a into Eq. 4, and taking the derivatives, we have:

(10) d2dt2s(t)+ks(tτ)=kCN+1superscriptd2dsuperscript𝑡2𝑠𝑡𝑘𝑠𝑡𝜏𝑘𝐶𝑁1\small\frac{{\rm d^{2}}}{{\rm d}t^{2}}s(t)+k\cdot s(t-\tau)=k\frac{C}{N+1}divide start_ARG roman_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_s ( italic_t ) + italic_k ⋅ italic_s ( italic_t - italic_τ ) = italic_k divide start_ARG italic_C end_ARG start_ARG italic_N + 1 end_ARG

With loss of generality, we assume s(τ)=C𝑠𝜏𝐶s(\tau)=Citalic_s ( italic_τ ) = italic_C, meaning that before N𝑁Nitalic_N flows join, the sending rate has converged to the link capacity. Note that the measurement loop is usually much smaller than the control loop, i.e. τ1/kmuch-less-than𝜏1𝑘\tau\ll 1/kitalic_τ ≪ 1 / italic_k, we then solve the differential equation above as:

(11) s(t)=(11N+1)cos(k(tτ))+1N+1C(t>τ)𝑠𝑡11𝑁1𝑘𝑡𝜏1𝑁1𝐶𝑡𝜏\small s(t)=\left(1-\frac{1}{N+1}\right)\cos\left(\sqrt{k}(t-\tau)\right)+% \frac{1}{N+1}C\quad(t>\tau)italic_s ( italic_t ) = ( 1 - divide start_ARG 1 end_ARG start_ARG italic_N + 1 end_ARG ) roman_cos ( square-root start_ARG italic_k end_ARG ( italic_t - italic_τ ) ) + divide start_ARG 1 end_ARG start_ARG italic_N + 1 end_ARG italic_C ( italic_t > italic_τ )

Since we are considering the transient conditions with a small t𝑡titalic_t, where t𝑡titalic_t is less than the first time of s(t)=r(t)𝑠𝑡𝑟𝑡s(t)=r(t)italic_s ( italic_t ) = italic_r ( italic_t ), we approximate the formula above with Taylor’s expression:

(12) s(t)=CCNN+1k2(tτ)2(t>τ)𝑠𝑡𝐶𝐶𝑁𝑁1𝑘2superscript𝑡𝜏2𝑡𝜏\small s(t)=C-C\frac{N}{N+1}\cdot\frac{k}{2}\cdot(t-\tau)^{2}\quad(t>\tau)italic_s ( italic_t ) = italic_C - italic_C divide start_ARG italic_N end_ARG start_ARG italic_N + 1 end_ARG ⋅ divide start_ARG italic_k end_ARG start_ARG 2 end_ARG ⋅ ( italic_t - italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t > italic_τ )

Combine with Eq. 6, we have

(13) q(t)=N(q0+τN6k(N+1)(tτ)2)𝑞𝑡𝑁subscript𝑞0𝜏𝑁6𝑘𝑁1superscript𝑡𝜏2\small q(t)=N\left(q_{0}+\tau-\frac{N}{6k(N+1)}(t-\tau)^{2}\right)italic_q ( italic_t ) = italic_N ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ - divide start_ARG italic_N end_ARG start_ARG 6 italic_k ( italic_N + 1 ) end_ARG ( italic_t - italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

We then have the maximum queue delay as:

(14) qFQmaxq(τ+2k)=N(232k+q0+τ)subscriptsuperscript𝑞𝑚𝑎𝑥𝐹𝑄𝑞𝜏2𝑘𝑁232𝑘subscript𝑞0𝜏\small q^{max}_{FQ}\geqslant q\left(\tau+\sqrt{2}{k}\right)=N\left(\frac{2}{3}% \sqrt{\frac{2}{k}}+q_{0}+\tau\right)italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT ⩾ italic_q ( italic_τ + square-root start_ARG 2 end_ARG italic_k ) = italic_N ( divide start_ARG 2 end_ARG start_ARG 3 end_ARG square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_k end_ARG end_ARG + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ )

As N𝑁Nitalic_N increases, qFIFOmaxsubscriptsuperscript𝑞𝑚𝑎𝑥𝐹𝐼𝐹𝑂q^{max}_{FIFO}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F italic_I italic_F italic_O end_POSTSUBSCRIPT will also increase.

Meanwhile, by substituting the available bandwidth in Eq. 8 with Eq. 9a, we have TFQsubscript𝑇𝐹𝑄T_{FQ}italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT:

(15) TFQ=(1+1N)NBCsubscript𝑇𝐹𝑄11𝑁𝑁𝐵𝐶\small T_{FQ}=\left(1+\frac{1}{N}\right)\cdot\frac{NB}{C}italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT = ( 1 + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ) ⋅ divide start_ARG italic_N italic_B end_ARG start_ARG italic_C end_ARG

B.2. FIFO

Since the share of available bandwidth is proportional to the share of buffer occupancy, we estimate rFIFO(t)subscript𝑟𝐹𝐼𝐹𝑂𝑡r_{FIFO}(t)italic_r start_POSTSUBSCRIPT italic_F italic_I italic_F italic_O end_POSTSUBSCRIPT ( italic_t ) as in Eq. 9b. Similar to FQ, we can get:

(16) q(t)1C(NBq0C)(q0C+0ts(t)dttC1NBq0C+1)𝑞𝑡1𝐶𝑁𝐵subscript𝑞0𝐶subscript𝑞0𝐶superscriptsubscript0𝑡𝑠superscript𝑡differential-dsuperscript𝑡𝑡𝐶1𝑁𝐵subscript𝑞0𝐶1\small q(t)\geqslant\frac{1}{C}\left(\frac{NB}{q_{0}C}\right)\left(q_{0}C+\int% _{0}^{t}s(t^{\prime}){\rm d}t^{\prime}-tC\frac{1}{\frac{NB}{q_{0}C}+1}\right)italic_q ( italic_t ) ⩾ divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ( divide start_ARG italic_N italic_B end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_C end_ARG ) ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_C + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_s ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t italic_C divide start_ARG 1 end_ARG start_ARG divide start_ARG italic_N italic_B end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_C end_ARG + 1 end_ARG )

and then

(17) qFIFOmaxq(τ+2k)subscriptsuperscript𝑞𝑚𝑎𝑥𝐹𝐼𝐹𝑂𝑞𝜏2𝑘\small q^{max}_{FIFO}\geqslant q\left(\tau+\sqrt{\frac{2}{k}}\right)italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F italic_I italic_F italic_O end_POSTSUBSCRIPT ⩾ italic_q ( italic_τ + square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_k end_ARG end_ARG )

Consequently

(18) qFIFOmax(NB0q0C+1)(232k+q0+τ)subscriptsuperscript𝑞𝑚𝑎𝑥𝐹𝐼𝐹𝑂𝑁subscript𝐵0subscript𝑞0𝐶1232𝑘subscript𝑞0𝜏\small q^{max}_{FIFO}\geqslant\left(\frac{NB_{0}}{q_{0}C}+1\right)\left(\frac{% 2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tau\right)italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F italic_I italic_F italic_O end_POSTSUBSCRIPT ⩾ ( divide start_ARG italic_N italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_C end_ARG + 1 ) ( divide start_ARG 2 end_ARG start_ARG 3 end_ARG square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_k end_ARG end_ARG + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ )

B.3. DRR

As we can see from Eq. 9c, the rDRR(t)subscript𝑟𝐷𝑅𝑅𝑡r_{DRR}(t)italic_r start_POSTSUBSCRIPT italic_D italic_R italic_R end_POSTSUBSCRIPT ( italic_t ) is a special case of rFQ(t)subscript𝑟𝐹𝑄𝑡r_{FQ}(t)italic_r start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT ( italic_t ) with N=1𝑁1N=1italic_N = 1. Therefore, according to the delay degradation result in Eq. 14, we have:

(19) qDRRmax232k+q0+τsubscriptsuperscript𝑞𝑚𝑎𝑥𝐷𝑅𝑅232𝑘subscript𝑞0𝜏\small q^{max}_{DRR}\geqslant\frac{2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tauitalic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D italic_R italic_R end_POSTSUBSCRIPT ⩾ divide start_ARG 2 end_ARG start_ARG 3 end_ARG square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_k end_ARG end_ARG + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ

The FCT satisfies:

(20) TDRR=2NBCsubscript𝑇𝐷𝑅𝑅2𝑁𝐵𝐶\small T_{DRR}=\frac{2NB}{C}italic_T start_POSTSUBSCRIPT italic_D italic_R italic_R end_POSTSUBSCRIPT = divide start_ARG 2 italic_N italic_B end_ARG start_ARG italic_C end_ARG

In this case,

TDRRTFQ=(N1)BCsubscript𝑇𝐷𝑅𝑅subscript𝑇𝐹𝑄𝑁1𝐵𝐶\small T_{DRR}-T_{FQ}=\frac{(N-1)B}{C}italic_T start_POSTSUBSCRIPT italic_D italic_R italic_R end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT = divide start_ARG ( italic_N - 1 ) italic_B end_ARG start_ARG italic_C end_ARG

diverges with N𝑁Nitalic_N and B𝐵Bitalic_B.

B.4. Confucius

For Confucius, we have:

(21) r𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌(t)=C2eλt(t>0)subscript𝑟𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌𝑡𝐶2superscript𝑒𝜆𝑡𝑡0\small r_{\textsf{Confucius}}(t)=\frac{C}{2}e^{-\lambda t}\quad(t>0)italic_r start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG italic_C end_ARG start_ARG 2 end_ARG italic_e start_POSTSUPERSCRIPT - italic_λ italic_t end_POSTSUPERSCRIPT ( italic_t > 0 )

we could then solve out (using Laplacian transform, and solve with undetermined coefficients):

(22) s(t)=Aeλ(tτ)+Bcosk(tτ)𝑠𝑡𝐴superscript𝑒𝜆𝑡𝜏𝐵𝑘𝑡𝜏\small s(t)=Ae^{-\lambda(t-\tau)}+B\cos\sqrt{k}(t-\tau)italic_s ( italic_t ) = italic_A italic_e start_POSTSUPERSCRIPT - italic_λ ( italic_t - italic_τ ) end_POSTSUPERSCRIPT + italic_B roman_cos square-root start_ARG italic_k end_ARG ( italic_t - italic_τ )

where

(23) A=𝐴absent\displaystyle\small A=italic_A = Ck21λ2+keλτ𝐶𝑘21superscript𝜆2𝑘superscript𝑒𝜆𝜏\displaystyle C\cdot\frac{k}{2}\cdot\frac{1}{\lambda^{2}+k\cdot e^{\lambda\tau}}italic_C ⋅ divide start_ARG italic_k end_ARG start_ARG 2 end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k ⋅ italic_e start_POSTSUPERSCRIPT italic_λ italic_τ end_POSTSUPERSCRIPT end_ARG
(24) B=𝐵absent\displaystyle B=italic_B = CA𝐶𝐴\displaystyle C-Aitalic_C - italic_A

Still using Taylor’s approximation:

(25) s(t)=A(1λ(tτ))+B(112k(tτ)2)=B2k(tτ)2λA(tτ)+A+B𝑠𝑡absent𝐴1𝜆𝑡𝜏𝐵112𝑘superscript𝑡𝜏2missing-subexpressionabsent𝐵2𝑘superscript𝑡𝜏2𝜆𝐴𝑡𝜏𝐴𝐵\small\begin{array}[]{cl}s(t)&=A\left(1-\lambda(t-\tau)\right)+B\left(1-\frac{% 1}{2}k(t-\tau)^{2}\right)\\ &=-\frac{B}{2}k(t-\tau)^{2}-\lambda A(t-\tau)+A+B\end{array}start_ARRAY start_ROW start_CELL italic_s ( italic_t ) end_CELL start_CELL = italic_A ( 1 - italic_λ ( italic_t - italic_τ ) ) + italic_B ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_k ( italic_t - italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - divide start_ARG italic_B end_ARG start_ARG 2 end_ARG italic_k ( italic_t - italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_λ italic_A ( italic_t - italic_τ ) + italic_A + italic_B end_CELL end_ROW end_ARRAY

Denote the root of s(t)=0𝑠𝑡0s(t)=0italic_s ( italic_t ) = 0 on t>τ𝑡𝜏t>\tauitalic_t > italic_τ as t0+τsubscript𝑡0𝜏t_{0}+\tauitalic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ (t0>0subscript𝑡00t_{0}>0italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0), we then have

(26) q(t0+τ)=2eλ(t0+τ)(q0+τ(t0λA2Ct0kB6Ct03))𝑞subscript𝑡0𝜏2superscript𝑒𝜆subscript𝑡0𝜏subscript𝑞0𝜏subscript𝑡0𝜆𝐴2𝐶subscript𝑡0𝑘𝐵6𝐶superscriptsubscript𝑡03\small q(t_{0}+\tau)=2e^{\lambda(t_{0}+\tau)}\left(q_{0}+\tau-\left(t_{0}-% \frac{\lambda A}{2C}t_{0}-\frac{kB}{6C}t_{0}^{3}\right)\right)italic_q ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ ) = 2 italic_e start_POSTSUPERSCRIPT italic_λ ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ ) end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ - ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - divide start_ARG italic_λ italic_A end_ARG start_ARG 2 italic_C end_ARG italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - divide start_ARG italic_k italic_B end_ARG start_ARG 6 italic_C end_ARG italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) )

where t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfies:

(27) t0=λA+(λA)2+2Bk(A+B)Bksubscript𝑡0𝜆𝐴superscript𝜆𝐴22𝐵𝑘𝐴𝐵𝐵𝑘\small t_{0}=\frac{-\lambda A+\sqrt{(\lambda A)^{2}+2Bk(A+B)}}{Bk}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG - italic_λ italic_A + square-root start_ARG ( italic_λ italic_A ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_B italic_k ( italic_A + italic_B ) end_ARG end_ARG start_ARG italic_B italic_k end_ARG

Thus, we have a bound of q𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌maxsubscriptsuperscript𝑞𝑚𝑎𝑥𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌q^{max}_{\textsf{Confucius}}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT:

(28) q𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌maxq(t0+τ)=f(λ;k,τ,q0)subscriptsuperscript𝑞𝑚𝑎𝑥𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌𝑞subscript𝑡0𝜏𝑓𝜆𝑘𝜏subscript𝑞0\small q^{max}_{\textsf{Confucius}}\approx q(t_{0}+\tau)=f(\lambda;k,\tau,q_{0})italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT ≈ italic_q ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_τ ) = italic_f ( italic_λ ; italic_k , italic_τ , italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

independent of B𝐵Bitalic_B or N𝑁Nitalic_N. bounded. We expand the series as:

(29) f(λ)=F0+F1λ+F2λ2+o(λ2)F0=2q0+6τ+82kF1=103k+2q0τ+2τ2+4q0k+16τ3kF2=4q0k+6τk+q0τ2+τ3+6qoτk+11τ2k𝑓𝜆absentsubscript𝐹0subscript𝐹1𝜆subscript𝐹2superscript𝜆2𝑜superscript𝜆2subscript𝐹0absent2subscript𝑞06𝜏82𝑘subscript𝐹1absent103𝑘2subscript𝑞0𝜏2superscript𝜏24subscript𝑞0𝑘16𝜏3𝑘subscript𝐹2absent4subscript𝑞0𝑘6𝜏𝑘subscript𝑞0superscript𝜏2superscript𝜏36subscript𝑞𝑜𝜏𝑘11superscript𝜏2𝑘\small\begin{array}[]{rl}f(\lambda)&=F_{0}+F_{1}\lambda+F_{2}\lambda^{2}+o(% \lambda^{2})\\ F_{0}&=2q_{0}+6\tau+\frac{8}{2\sqrt{k}}\\ F_{1}&=\frac{10}{3k}+2q_{0}\tau+2\tau^{2}+\frac{4q_{0}}{\sqrt{k}}+\frac{16\tau% }{3\sqrt{k}}\\ F_{2}&=\frac{4q_{0}}{k}+\frac{6\tau}{k}+q_{0}\tau^{2}+\tau^{3}+\frac{6q_{o}% \tau}{\sqrt{k}}+\frac{11\tau^{2}}{\sqrt{k}}\end{array}start_ARRAY start_ROW start_CELL italic_f ( italic_λ ) end_CELL start_CELL = italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ + italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_o ( italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL = 2 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 6 italic_τ + divide start_ARG 8 end_ARG start_ARG 2 square-root start_ARG italic_k end_ARG end_ARG end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG 10 end_ARG start_ARG 3 italic_k end_ARG + 2 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_τ + 2 italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 4 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_k end_ARG end_ARG + divide start_ARG 16 italic_τ end_ARG start_ARG 3 square-root start_ARG italic_k end_ARG end_ARG end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG 4 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG + divide start_ARG 6 italic_τ end_ARG start_ARG italic_k end_ARG + italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_τ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + divide start_ARG 6 italic_q start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT italic_τ end_ARG start_ARG square-root start_ARG italic_k end_ARG end_ARG + divide start_ARG 11 italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_k end_ARG end_ARG end_CELL end_ROW end_ARRAY

Given that 1kq0,τmuch-less-than1𝑘subscript𝑞0𝜏\frac{1}{k}\ll q_{0},\taudivide start_ARG 1 end_ARG start_ARG italic_k end_ARG ≪ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_τ, we can simplify and upper bound them into:

(30) q𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌max6q0+15τ+8λk+(10q0+15τ)λ2ksubscriptsuperscript𝑞𝑚𝑎𝑥𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌6subscript𝑞015𝜏8𝜆𝑘10subscript𝑞015𝜏superscript𝜆2𝑘\small q^{max}_{\textsf{Confucius}}\leqslant 6q_{0}+15\tau+\frac{8\lambda}{k}+% \frac{(10q_{0}+15\tau)\lambda^{2}}{k}italic_q start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT ⩽ 6 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 15 italic_τ + divide start_ARG 8 italic_λ end_ARG start_ARG italic_k end_ARG + divide start_ARG ( 10 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 15 italic_τ ) italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k end_ARG

The FCT difference over the fair share for new flows is also bounded compared to other baselines. The FCT of N𝑁Nitalic_N flows with B𝐵Bitalic_B bytes, T𝑇Titalic_T for each flow basically follows:

Recall that r(t)=max(CC22λt,NN+1C)𝑟𝑡𝐶𝐶2superscript2𝜆𝑡𝑁𝑁1𝐶r(t)=\max(C-\frac{C}{2}2^{-\lambda t},\frac{N}{N+1}C)italic_r ( italic_t ) = roman_max ( italic_C - divide start_ARG italic_C end_ARG start_ARG 2 end_ARG 2 start_POSTSUPERSCRIPT - italic_λ italic_t end_POSTSUPERSCRIPT , divide start_ARG italic_N end_ARG start_ARG italic_N + 1 end_ARG italic_C ), we thus have

(31) T𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌=(N+1)BC+1λ(121Nlog2N+1212N)subscript𝑇𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌𝑁1𝐵𝐶1𝜆121𝑁subscript2𝑁1212𝑁\small T_{\textsf{Confucius}}=\frac{(N+1)B}{C}+\frac{1}{\lambda}\cdot\left(% \frac{1}{2}-\frac{1}{N}\log_{2}\frac{N+1}{2}-\frac{1}{2N}\right)italic_T start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT = divide start_ARG ( italic_N + 1 ) italic_B end_ARG start_ARG italic_C end_ARG + divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ⋅ ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_N + 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG )

where t1λlog2N+12𝑡1𝜆subscript2𝑁12t\geqslant\frac{1}{\lambda}\log_{2}\frac{N+1}{2}italic_t ⩾ divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_N + 1 end_ARG start_ARG 2 end_ARG. In this case,

(32) T𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌TFQ1λ(121Nlog2N+1212N)log2eλsubscript𝑇𝖢𝗈𝗇𝖿𝗎𝖼𝗂𝗎𝗌subscript𝑇𝐹𝑄1𝜆121𝑁subscript2𝑁1212𝑁subscript2𝑒𝜆\small T_{\textsf{Confucius}}-T_{FQ}\leqslant\frac{1}{\lambda}\cdot\left(\frac% {1}{2}-\frac{1}{N}\log_{2}\frac{N+1}{2}-\frac{1}{2N}\right)\leqslant\frac{\log% _{2}e}{\lambda}italic_T start_POSTSUBSCRIPT Confucius end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_F italic_Q end_POSTSUBSCRIPT ⩽ divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ⋅ ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_N + 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ) ⩽ divide start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e end_ARG start_ARG italic_λ end_ARG
Refer to caption
(a) Changing k𝑘kitalic_k
Refer to caption
(b) Changing τ𝜏\tauitalic_τ
Refer to caption
(c) Changing q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
Figure 19. The theoretical estimation from Confucius under different parameter settings.

We further plot the unsimplified bound in different k𝑘kitalic_k and other parameter settings in Fig. 19. Remember that the theoretical bounds are much greater than the actual experiment results, as shown in § 7.6.

B.5. Responsiveness for CCAs

For different CCAs, we can fit their responsiveness k𝑘kitalic_k based on their probing period in the steady state. From the differential equations in Eq. 4 and Eq. 6, during the steady state where r(t)C𝑟𝑡𝐶r(t)\equiv Citalic_r ( italic_t ) ≡ italic_C, we can solve that the sending rate s(t)𝑠𝑡s(t)italic_s ( italic_t ) follows:

(33) s(t)=C+Acos(kt+φ)𝑠𝑡𝐶𝐴𝑘𝑡𝜑\small s(t)=C+A\cos(\sqrt{k}t+\varphi)italic_s ( italic_t ) = italic_C + italic_A roman_cos ( square-root start_ARG italic_k end_ARG italic_t + italic_φ )

where A𝐴Aitalic_A and φ𝜑\varphiitalic_φ are undetermined coefficients. In this case, we can know that the probing period of a CCA is 2πk2𝜋𝑘\frac{2\pi}{\sqrt{k}}divide start_ARG 2 italic_π end_ARG start_ARG square-root start_ARG italic_k end_ARG end_ARG. From the respective design of CCAs, the probing period for Copa is 5 RTTs, and for BBR is 8 RTTs. For example, when RTT is 40 ms, we will have kCopa=0.001(ms2k_{Copa}=0.001~{}(ms^{-2}italic_k start_POSTSUBSCRIPT italic_C italic_o italic_p italic_a end_POSTSUBSCRIPT = 0.001 ( italic_m italic_s start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), kBBR=0.0004(ms2k_{BBR}=0.0004~{}(ms^{-2}italic_k start_POSTSUBSCRIPT italic_B italic_B italic_R end_POSTSUBSCRIPT = 0.0004 ( italic_m italic_s start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT).

Appendix C Supplementary Experiments

We further evaluate the performance of Confucius in a series of microbenchmarking settings.

Refer to caption
(a) When the RT flow uses Copa.
Refer to caption
(b) When the RT flow uses GCC.
Refer to caption
(c) When the RT flow uses BBR.
Refer to caption
Figure 20. The trade-off between the real-time (RT) flow (stall duration) and Web flows (page loading time) on bandwidth traces C2 (4G). We mark baselines in green if they rely on labels from end hosts, and in blue if not.
Refer to caption
(a) When the RT flow uses Copa.
Refer to caption
(b) When the RT flow uses GCC.
Refer to caption
(c) When the RT flow uses BBR.
Refer to caption
Figure 21. The trade-off between the real-time (RT) flow (stall duration) and Web flows (page loading time) on bandwidth traces C3 (5G). We mark baselines in green if they rely on labels from end hosts, and in blue if not.
Refer to caption
(a) When the RT flow uses Copa.
Refer to caption
(b) When the RT flow uses GCC.
Refer to caption
(c) When the RT flow uses BBR.
Refer to caption
Figure 22. The trade-off between the real-time (RT) flow (stall duration) and Web flows (page loading time) on bandwidth traces W1 (Office WiFi). We mark baselines in green if they rely on labels from end hosts, and in blue if not.
Refer to caption
(a) When the RT flow uses Copa.
Refer to caption
(b) When the RT flow uses GCC.
Refer to caption
(c) When the RT flow uses BBR.
Refer to caption
Figure 23. The trade-off between the real-time (RT) flow (stall duration) and Web flows (page loading time) on bandwidth traces W2 (Restaurant WiFi). We mark baselines in green if they rely on labels from end hosts, and in blue if not.

C.1. Results for other traces

Refer to caption
Refer to caption
Figure 24. The bandwidth distribution of five trace datasets we use in this paper. X-axes are log-scaled.

We further present the results of Confucius over other bandwidth trace datasets (C2: Cellular 4G; C3: Cellular 5G; W1: Office WiFi; W2: Restaurant WiFi) in Figs. 20, 21, 22 and 23. The experiment setting follows the one in Fig. 10. The average and standard deviation of bandwidth of these traces are presented in Fig. 24. We can clearly see that Confucius always pushes the Pareto-optimal frontier of baselines not requiring labels (blue baselines) front. This demonstrates the robustness of Confucius across different bandwidth datasets.

C.2. Fairness

Classful schemes such as CBQ, which splits packets into classful queues of configurable service rate or strict priority, which only dequeues packets of lower priority if high priority is empty, protect the real-time flow. However, classful schemes also result in unfair allocations because they overpenalize (or even starve) web traffic which experiences high page load times (PLTs) as shown in Fig. 4(c). While, in theory, CBQ could be configured to be fair, that requires knowledge of the exact workload (ratio of flows between classes) over very short time intervals, which is in practice infeasible. For example, we measure the fairness that different schedulers can provide while changing the number of competing flows to the real-time flow in Fig. 25. Modifying CBQ’s configuration improves JFI for a subset of the workloads: CBQ (1:1) works well when there are two flows competing while CBQ (1:5) achieves a good JFI when there are five competing flows – they both degrade as the number of flows changes. Even we change the ratio, such a phenomenon still exists – the JFI of CBQ (1:5) is the highest when there are 5 new flows, but degrades drastically in other workloads. In contrast, Confucius can relatively keep the JFI consistent across different number of competitors.

Refer to caption
Figure 25. Jain’s fairness index (JFI) as the workload changes. The fairness of classful solutions (e.g., CBQ) is heavily sensitive to workload variations. For instance, CBQ with different weights (1:1 or 1:5) will result in poor fairness (JFI<<<0.9) in certain workloads. Y axis is not lin-scaled.
Figure 26. The hysteresis design in Confucius (§ 5.1) is able to absorb the fluctuations caused by probing from CCAs.
Refer to caption
Refer to caption
Figure 26. The hysteresis design in Confucius (§ 5.1) is able to absorb the fluctuations caused by probing from CCAs.
Figure 27. When the bottleneck is elsewhere, Confucius maintains the same performance as existing mechanisms.

C.3. Working with Bandwidth Probing

Some recent CCAs proposed to periodically probe the available bandwidth by overshooting the network, which might introduce noises in classifying the buffer occupancy of flows in Confucius. Some recent examples for video streaming include Sprout [74], PCC (probing up to 5%) [28], and BBR (probing 25%) [20]. We evaluate how Confucius is able to handle the bandwidth probing from CCAs. We first run one BBR flow, which is the most aggressive one among these bandwidth probing CCAs, and change the RTT from 20 ms to 160 ms since the probing period is counted in the unit of RTT. As shown in Fig. 27, with the other settings the same as Fig. 28, the queue fluctuations never go across the threshold of reclassification of the flow. This is due to the hysteresis design in § 5.1Confucius deliberately makes conservative decisions in the classification of flows to smoothize the noises out. This can also be validated from Fig. 13(b): the classification results are stable all the time even if BBR periodically probes the bandwidth. Therefore, Confucius is able to work well with bandwidth-probing CCAs.

C.4. Working with Different Bottleneck

Refer to caption
Figure 28. Experimental setup for multiple bottlenecks.

We further evaluate the end-to-end performance when the bottleneck is not where Confucius is deployed. Confucius is able to reduce the latency volatility when it is deployed on the bottleneck router. Our further experiments show that Confucius does not introduce side effects when the bottleneck is before or after the router deployed with Confucius. We still deploy queue management mechanisms to the router before link B and respectively rate-limit the link A, B, and C in Fig. 28 to 20 Mbps:

  • Btlnk-A. When link A is limited while the other two links are set to 100 Mbps, the bottleneck is before the place of Confucius.

  • Btlnk-B. The case when link B is limited is what we mainly evaluated in this section, where Confucius is at the bottleneck.

  • Btlnk-C. When link C is limited, the bottleneck is after the place of Confucius.

For those unmanaged routers, they adopt FIFO as their default mechanism. As shown in Fig. 27, the performance is only affected by the mechanism deployed at the bottleneck. When Confucius is not at the bottleneck (e.g., link A or C), the performance is the same no matter what mechanism is deployed at link B. It is worth to note that as discussed in a series of papers [49, 17], the last-mile routers (e.g., cellular base stations, home wireless APs) are the bottleneck for most of the congestions, in which case deploying Confucius will achieve significant performance benefits.

Refer to caption
(a) Stall duration of the real-time (existing) flow.
Refer to caption
(b) PLT of Web (new) flows.
Figure 29. We increase the number of simultaneous real-time flows, and measure the results again with the Alexa dataset.

C.5. Multiple Real-time Flows Competition

We further evaluate the performance when there are multiple real-time flows running simultaneously. We reproduce the experiments in Fig. 10(a) but change the number of real-time flows from 1 to 5. The average duration of delay degradation of real-time flows, and the PLT of Web flows are presented in Fig. 29. Confucius is able to provide a consistent performance for multiple real-time flows in the same time – the delay degradation is consistently negligible independent of the number of concurrent real-time flows and the PLT stays roughly the same place compared to the baselines. Note that since Confucius is designed for last-mile routers, 5 concurrent flows should be able to cover most scenarios [49].

Appendix D Related Work

Queue management solutions. There are numerous efforts on queue management for routers. Besides the solutions we introduced in § 2 and § 7.1, there are even more AQMs proposed back to 2000s [31, 32, 46, 23, 57]. As we discussed in § 2, these AQMs cannot meet the requirement of providing consistent performance and fairness during transient events. At the same time, recent delay- or rate-based CCAs, which are commonly used in real-time flows, are not responsive to such dropping-based or ECN marking-based AQMs. Further, datacenter flow scheduling schemes [14, 9] or buffer management [7] are designed for homogeneous flows (sometimes with labelled packets) and are not suitable for heterogeneous flows in home routers in the wide-area network.

Optimizations for latency consistency. Multiple schemes aim at offering consistent low latency for latency-sensitive applications such as videoconferencing either at the end hosts [36, 21, 13], and/or in-network [49, 38]. Besides, there are also application-specific solutions such as frame-rate or bit-rate adaption [51, 36] and latency compensation [63]. Confucius is orthogonal to such solutions.

Inter-flow fairness. The fairness across flows dates back to the birth of congestion control [42]. Recent work analyzes fairness in different scenarios [58] or defines fairness with different applications [70, 72]. There are also measurements investigating the inter-CCA fairness with emerging CCAs [64, 54, 40]. Instead, Confucius is also able to maintain the long-term fairness across flows.