\setitemize

leftmargin=4mm \setenumerateleftmargin=4mm

Confucius: Achieving Consistent Low Latency with
Practical Queue Management for Real-Time Communications

Zili Meng Hong Kong University of Science and Technology zilim@ust.hk , Nirav Atre Carnegie Mellon University natre@cs.cmu.edu , Mingwei Xu Tsinghua University xumw@tsinghua.edu.cn , Justine Sherry Carnegie Mellon University sherry@cs.cmu.edu and Maria Apostolaki Princeton University apostolaki@princeton.edu

Abstract.

Real-time communication applications require consistently low latency, which is often disrupted by latency spikes caused by competing flows, especially Web traffic. We identify the root cause of disruptions in such cases as the mismatch between the abrupt bandwidth allocation adjustment of queue scheduling and gradual congestion window adjustment of congestion control. For example, when a sudden burst of new Web flows arrives, queue schedulers abruptly shift bandwidth away from the existing real-time flow(s). The real-time flow will need several RTTs to converge to the new available bandwidth, during which severe stalls occur. In this paper, we present Confucius, a practical queue management scheme designed for offering real-time traffic with consistently low latency regardless of competing flows. Confucius slows down bandwidth adjustment to match the reaction of congestion control, such that the end host can reduce the sending rate without incurring latency spikes. Importantly, Confucius does not require the collaboration of end-hosts (e.g., labels on packets), nor manual parameter tuning to achieve good performance. Extensive experiments show that Confucius outperforms existing practical queueing schemes by reducing the stall duration by more than 50%, while the competing flows also fairly enjoy on-par performance.

1. Introduction

Real-time (RT) video communications, including a range of applications from video conferencing to cloud gaming and VR/AR streaming, are becoming the dominant traffic on the Internet. These applications require low and consistent latency to maximize the user experience (sigcomm2022zhuge, ).

Significant research has been dedicated to ensuring a satisfactory user experience through minimizing and stabilizing the end-to-end latency. Indeed, congestion control algorithms (CCAs) reduce the queueing delay (ton2017webrtc, ; ray2022sqp, ; nsdi2018copa, ); forward error correction (FEC) improves the loss recovery (nsdi2023tambur, ; nsdi2024hairpin, ); multiple path transport mitigates fluctuation in wireless settings (mm23twinstar, ; sigcomm2023cellfusion, ; sigcomm2023converge, ); while co-design with the video codec (nsdi2023afr, ; nsdi2018salsify, ) and wireless routers (sigcomm2022zhuge, ; mmsys2015macadapt, ) controls the delay in these components. Unfortunately, these works mainly focus on how to mitigate the effect of network fluctuations after the fact, instead of addressing their root cause. As a result, latency fluctuations still routinely occur, causing stalls and deterioration of the performance of the real-time flow (imc2021can, ; imc2022enabling, ).

Refer to caption — Figure 1. The scenario where the real-time flow is affected by competing flows. When Web flows join the competition with the real-time flow, the available bandwidth of the real-time flow will be immediately reduced. Note that even loading one Web page can have tens of concurrent active flows.

In this paper, we show that unpredictable flow competition in the network layer can cause drastic network fluctuation, which drastically affects real-time flows (§ 2). For instance, loading a single Web page creates nine concurrent Internet connections (on average), drastically reducing the available bandwidth for the competing real-time flows and causing stalls in multiple practical settings such as home routers, as shown in Fig. 1. Congestion control alleviates the issue by reducing the real-time flow’s congestion window or sending rate after end hosts observe latency increases or packet loss, but it is already far too late. Indeed, it will take several RTTs for congestion control to react and converge to the new available bandwidth, while the packets sent in excess of the allocated bandwidth during the convergence period will lead to an increase in the end-to-end delay. These endpoint-based optimizations, in general, cannot fundamentally prevent such performance degradation from happening since the onset of competing traffic is unpredictable.

A natural solution to flow competition is to manage the router queue and prevent the available bandwidth of the real-time flow from reducing. There have been works trying to achieve this for several decades. In differentiated services (DiffServ) (rfc4594diffserv, ) (including L4S (rfc9330l4s, )), the router recognizes the pre-defined labels (priorities) from packet headers and schedules packets based on these labels. However, such a design is not incentive-compatible in practice: applications have the incentive to mark their packets with higher priority, which eventually leads to the Tragedy of Commons – routers will not respect the labels, and endpoints cannot count on using them. Another category of solutions on the router is active queue management (AQM), which tries to notify the sender in advance before the queue builds up (cacm2012codel, ; sigcomm1997red, ; sigcomm2022zhuge, ). We demonstrate in § 2 that these are still reactive mechanisms and cannot prevent stalls from happening.

We argue that the root cause for stalls in the real-time flow is the mismatch in reaction time between the bandwidth allocation mechanisms on routers and rate adaptation mechanisms on endpoints. In Fig. 1, when nine new flows triggered by a single website (in yellow) suddenly compete with an existing real-time flow (in blue), the available bandwidth of the real-time flow is immediately reduced to 1/10 of what it was. However, the sender’s congestion window needs several round-trip times (RTTs) to gradually adjust to match the new available bandwidth. During this adjustment period, packets that are sent in excess of the allocated bandwidth will induce congestion, resulting in bufferbloat and stalls. While the Web flows will complete within one or two seconds and relinquish their bandwidth share, the real-time flow will have already experienced significant degradation. We find that existing queue scheduling and management algorithms ignore the transient temporal behaviors during the network change, leading to the stalls. This highlights a critical need for a queue management scheme that takes into account the convergence time of the congestion control to prevent such stalls.

To this end, we designed Confucius,¹¹1 Confucius’ (the philosopher) educational philosophy is teaching students by their essences. In this paper, we serve the flows by their essences. a practical queue management scheme that aims at providing consistent low latency for real-time flows independently of competing flows at the bottleneck. Instead of abruptly changing bandwidth allocations when a burst of new flows arrive, Confucius gradually adjusts the service rates to provide existing flows a few RTTs to detect the change in network conditions and adjust their congestion windows. In this case, the excessively sent packets will be reduced and the latency for the real-time flow can be maintained.

We design Confucius to fulfill three fundamental requirements related to consistency, fairness and incentive-compat-ibility (§ 3.1): First, Confucius needs to provide latency consistency to real-time flows independently of the number, rate, or congestion control of the competing flows. Confucius achieves this by offering a theoretical upper bound for latency fluctuation experienced by real-time flows, which we also validate through experiments Second, Confucius should eventually be fair. For instance, in Fig. 1, the performance of Web flows should not be sacrificed. To achieve this, Confucius smoothly moves service rates towards the fair allocation within a few RTTs. Finally, Confucius’s classification of real-time flow should be practical and on-router, without relying on end hosts for traffic classification. Unfortunately, the alternative – flow classification algorithms – are usually expensive and sensitive to protocols (hotnets2019inc, ). Confucius calssifies flows by how aggressively they occupy the buffer at the bottleneck router, a metric that directly reflects how important low latency is to a flow.

We implement Confucius with both NS-3 simulator and kernel modules on Linux-based routers. Note that Confucius is designed for last-mile routers (e.g., home routers in Fig. 1) where the competition can lead to congestion, and the computation is more flexible since home routers are mainly Linux-based²²2 A measurement shows that 91% of home routers are Linux-based (HomeRouterSecurity, ). . With real-world bandwidth and Web page traces, we show that compared to FqCoDel (Linux’s default) (rfc8290fqcodel, ), Confucius reduces the stall duration of the real-time flow by 60%-69% and the loading time of Web pages by 39%-48% for top 1000 websites at the same time. Compared to other baselines that do not require labels from end hosts, Confucius can still effectively reduce the stall duration of the real-time flows by at least 21% (§ 7.2) with negligible computation overhead (§ 7.5). In the meantime, long-lived, bulk transfers experience no degradation at all relative to fair queueing, and the impact on the flow completion time over short Web flows is limited to at most 10% even compared with the shortest job first (SJF, which strictly prioritizes the Web flows). We will release all traces and codes of this paper.

2. Motivation

We start by describing recent trends that call for consistent low latency (§ 2.1). Next, we explain via an intuitive example why existing solutions fail to achieve consistent low latency under flow competition (§ 2.2, 2.3 and 2.4).

2.1. The rise of real-time traffic

While the Internet has always been shared among multiple applications, the proliferation of real-time communication applications (e.g., videoconferencing, cloud gaming, virtual reality) has made sharing of bottleneck links particularly challenging. Real-time applications require not just low latency but consistently low latency while sending at moderate to high throughputs (ranging from tens to hundreds of Mbps) (sigcomm2022zhuge, ; sigcomm2020tack, ; imc2022measurement, ). For real-time applications, latency consistency is extremely critical to user experiences. For example, a transient increase in latency to 200 ms might cause cloud gaming users to lose (qomex2015cgsteam, ). Therefore, controlling the latency fluctuation and achieving a consistent low latency for real-time applications is essential.

Setting & Scope: This work focuses on end-user access points (e.g., wireless or wired home routers), where it is well-known that congestion and latency fluctuation are frequent (sigcomm2022zhuge, ; ccr2017lastmile, ; imc2020persistent, ). Despite recent advances in wireless technologies such as 5G and WiFi 6, the last-mile access routers are still likely to be the cause of jitter, irrespective of whether the last-mile is wired (nsdi2023crab, ) or wireless (sigcomm2020measure5g, ; imc2017fastack, ; sigcomm2022zhuge, ). As most such routers are Linux-based (linux-router, ; HomeRouterSecurity, ), they allow for flexible traffic management on software which is a great opportunity for innovation. Our experiments and data involve applications used in those settings. We also include a benchmark in a Linux-based router § 7.5). Congestion in other settings (e.g., losses in the Internet core (sigcomm2018inferring, ) or datacenters (nsdi2015pias, )) are out of scope for this work.

2.2. Motivating example

To better illustrate the problem and the limitations of existing approaches, we revisit the example of Fig. 1 in detail. Consider a user who is on a video call, and their housemate (with whom they share the home router) decides to load a Web page. Technically, one existing real-time flow on the bottleneck will compete with the new flows from one Web page. We simulate the real-time flow’s delay of each video frame using NS-3 and present the results in Fig. 2 (details in § 6). Before considering other queue management mechanisms, let us focus on the performance of FIFO (square markers).

Before t=0s, the sending rate of the real-time flow has converged, with the video frame delay fluctuating around 60ms, which is much lower than the stall threshold (190ms³³3 This is the recommended network delay for video chats by ITU (itu-ddl, ). ) required by the application. However, when flows from loading the homepage ofamazon.com join the competition on the bottleneck, the end-to-end delay for the real-time flow sharply increases. Using FIFO, the delay goes up to more than 400 ms, and stays above the threshold for almost one second, during which a stall occurs and the user experience is impaired. When using FqCoDel, the delay of the real-time flow is even worse since fair queueing shifts more bandwidth away and CoDel drops more packets. We find that the delay always spikes regardless of the underlying CCA (§ 7.2).

2.3. Root cause analysis

We argue that the delay spike is caused by (i) the burst of flows and packets from the competing Web page; (ii) the abrupt reallocation of the available bandwidth by queue management; and (iii) the gradual reaction from the congestion control. Next, we will elaborate on how these common factors result in performance degradation, and explain the limitations of existing works.

The source of the burst: One Web page triggers multiple, concurrently-active flows. To understand the burst in Fig. 2, we measure the flows triggered byamazon.com over time. Concretely, we measure the number of sockets that are OPEN and IN_USE marked by NetLog (netlog, ) from Chrome. We also measure the number of active flows that receive bytes every 10 ms through packet captures (ACTIVE). As shown in Fig. 3(a), loading only the homepage generates up to 68 flows in total, where up to 12 flows run simultaneously. This is due to the Web design of hosting different objects (e.g., images, videos, ads, scripts) in various domains. Note that this is not due to the parallel connections in HTTP/1.1 – we later present in Appx. A showing that more than half of the flows go to different unique IPs.

This triggering of multiple flows to load one page is shared across different websites. We measured the homepage of Top 1000 websites in November 2023 from the saved Alexa list and presented the distribution in Fig. 3(b). We find that the median number of concurrent ACTIVE flows is 8 while the 90th percentile is 19. The highest one in the Top 200,dailymail.co.uk, has up to 50 active flows and 250 open sockets at the same time. We present the structure and list some famous websites in Appx. A. Moreover, for some websites (e.g., Wikipedia and Google), loading other pages triggers more flows compared to the almost blank home page, which will further exacerbate the degradation experienced by the real-time flow.

The cause of the delay spike: Queue schedulers sharply reallocate service rates. Queue management typically reacts to the instant conditions of all flows in the queue. Revisiting our example, when the page loading starts, tens of packets of Web flows immediately arrive at the bottleneck, creating a queue. At the same time, the real-time flow only has a few packets in the queue since it always tries to keep the queue near-empty (nsdi2018copa, ; ton2017webrtc, ). We illustrate the bandwidth share of different queue management schemes in Fig. 4. For FIFO (Fig. 4(a)), the service rates for different flows are proportional to the number of bytes per flow in the queue, thus, the available bandwidth for the real-time flow will be drastically reduced. Fair queueing (FQ, Fig. 4(b)) makes matters worse and allocates even less bandwidth to the real-time flow, since those short Web flows are many more than the real-time flow. Concretely, in theamazon.com example, 12 new flows joining the fair queueing router will directly reduce the available bandwidth of the real-time flow to 1/13.

Such a sharp decrease in the available bandwidth causes a delay spike to the real-time flow. This is because the CCA needs to gradually probe and match its sending rate to the new available bandwidth, which takes several RTTs (dashed green line in Fig. 5). While the number of in-flight packets is converging to the new bandwidth-delay product, the excessive in-flight packets will cause bufferbloat and result in high end-to-end latency for the real-time flow.

Active queue management (AQM) algorithms, which notify the sender about the network conditions by dropping or marking ECN on packets, cannot prevent stalls either. This is mainly because flows driven by different congestion control algorithms (CCAs) have different perceptions of congestion (e.g., delay, loss, rate). Therefore, as shown in Fig. 2, CoDel (cacm2012codel, ) leads to a latency spike even higher than that of FIFO. We observe similar limitations in other AQMs (§ 7.2).

The fact hard to change: Congestion control takes a longer time to converge. As we discussed, the issue is when the competing flows join, the available bandwidth for the real-time flow drops immediately, but the end-to-end CCA cannot immediately reduce the inflight packets to fit the new available bandwidth. End-to-end CCAs do not know how much to reduce and have to reduce step-by-step.

Some proposals are designed to help the CCA to quickly converge to the new available bandwidth, such as XCP (sigcomm2002xcp, ), RCP (infocom2008rcp, ), Kickass (icnp2016kickass, ), and ABC (nsdi2020abc, ). However, none of these proposals work unless both end hosts and routers collaboratively deploy these protocols and offer no improvement otherwise. This poses significant barriers to deployment on the Internet (sigcomm2022zhuge, ). Moreover, during the convergence of the CCA, the excessive in-flight packets also inflate the RTT. For most CCAs using RTT to update (e.g., adjust the sending rate every RTT), the update period will, in turn, inflate after the first several packets. In the example in Fig. 2, before the Web flows join, the RTT for the real-time flow is around 40 ms. However, during the competition, the RTT inflates to hundreds of milliseconds. Putting all factors together, we can see that for all baselines that do not require labels, the delay spike of the real-time flow goes up to at least 400 ms.

2.4. Limitations of related works

One line of solution is DiffServ (rfc4594diffserv, ), which labels the flows of interest in advance and schedules them differently on the router using StrictPriority or weighted class as shown in Fig. 2. This also includes the recent proposal L4S (rfc9330l4s, ) which we will later evaluate in § 7.2. While this is deployable in datacenters (sigcomm2016karuna, ), it is not practical on the Internet. End hosts have the incentive to fake their labels if that could help their flows have better performance. It is also challenging to coordinate the end host and router on the Internet in the real world since they usually belong to different entities. Even with perfect labels, achieving optimal performance requires optimal allocation of bandwidth across the different classes of traffic. To understand why this is challenging, consider some canonical solutions. StrictPriority, albeit guaranteeing the latency for the real-time flow, will drastically harm the performance of competing Web flows (§ 7.2). Allocating bandwidth for different classes using pre-defined weights needs accurate estimation of the bandwidth demands from both classes, where inaccurate estimation easily leads to unfairness or latency spikes. For example, if we set the ratio between the real-time flows and Web flows to 1:1, the Web flows will suffer from degraded PLT since they cannot obtain their fair shares (Fig. 4(c)), while 1:5 will lead to the latency spike to the real-time flow as well (Fig. 2).

There are further mechanisms as below, which, unfortunately, still reactively respond to network changes. Zhuge (sigcomm2022zhuge, ) reduces the feedback loop between the router and the endpoint from one RTT to sub-RTT levels, but CCA convergence still requires multiple RTTs (§ 2.3). Using the example in Fig. 5, Zhuge tightens the turning point of the green dashed line, but the dominant contributor to delay – the time it takes for the green dashed line to converge to the blue line – persists. FEC is designed for loss recovery (nsdi2023tambur, ; nsdi2024hairpin, ) and is hardly helpful in our example since most of them have no loss at all. Multipath transport will switch to the new path (sigcomm2023converge, ; sigcomm2023cellfusion, ; nsdi2024augur, ), but this also occurs after the sender observes drastic degradation in the current path. Real-time flows still have to suffer from stalls during the adaptation period. Bandwidth estimations from the wireless link layer and below (sigcomm2022zhuge, ; sigcomm2020pbecc, ) are not effective either since the link capacity does not change in the competition.

3. Confucius Design

Our previous observations motivate Confucius, a practical queue management scheme for achieving consistent and low latency for real-time flows that is designed to work on home routers. We describe Confucius’s design requirements in § 3.1 before we give an overview of Confucius on § 3.2.

3.1. Design Requirements

R1: The performance of the real-time flow should be robust to any competing flows. Confucius stands out among queue management algorithms in that it theoretically guarantees worst-case performance, no matter what congestion control algorithms and competing flows are. This will, in consequence, fundamentally address the root cause of latency fluctuation induced by unpredictable competing traffic. It is easy to vaguely describe Confucius as ‘controlling latency fluctuations’ but it is harder to formulate this into a rigorous service model. We theoretically calculate performance bounds for a few classes of applications that might use Confucius. We demonstrate that with Confucius, real-time flows have a near-constant bound of latency degradation (around 250 ms in § 7.3), no matter how large and how many competing flows join the bottleneck.

R2: Latency consistency should not come at the cost of long-term fairness. Confucius should still follow per-flow fairness in the long run. To do so, Confucius moves rates towards a fair allocation quickly and pushes the blue solid line in Fig. 5 to match the green dashed line. In this case, the latency spike will be controlled and the bandwidth for the competing flows will be largely protected as well. Technically, Confucius adjusts the service rate of flows using exponentially weighted moving average (EWMA) (lucas1990ewma, ), as shown in Fig. 4(d). This allows the CCA to gradually react following the bandwidth share of Confucius, and also converges to the fair share in several RTTs. Note that the RTT is not inflated due to the excessive packets. Our experiments (§ 7.3) and theoretical analysis (§ 4.3) show that such a design can effectively achieve fairness and latency consistency.

R3: The identification of real-time flows should not rely on end hosts. A naive solution is to split the flows by their age. However, this is not practical since flows driven by different CCAs or having distinct objectives should not share the same queue either. Meanwhile, using FQ to split old flows cannot provide low latency to the bursty flows (macgregor2000deficits, ), which is usually the case for real-time video streaming. Thus, we still need to identify different types of flows. To make Confucius incentive-compatible and deployable in practice, we aim to identify the flows of interest at the router itself, without relying on end hosts. The performance improvements should be directly observed by the router vendor without going through endless coordination between end-host content providers and router vendors in IETF. In § 5, we illustrate how Confucius identifies flows based on their queue occupancy: built on the CCA evolution, real-time flows naturally occupy a small fraction of the buffer (e.g., GCC (ton2017webrtc, ; nsdi2018copa, )), while throughput-oriented flows are observed to be buffer-filling (e.g., Cubic). Confucius uses the queue occupancy to differentiate the flows in the queue.

3.2. Design Overview

At a high level, Confucius classifies flows to queues and strategically assigns a portion of the link capacity to each of them, as illustrated in Fig. 6.

To address the goal R1 and R2, Confucius leverages a simple yet powerful insight from § 3.1: Upon the arrival of competitors, the reduction of the available bandwidth of existing flows is inevitable if we want to preserve long-term throughput fairness. Yet, we can gradually and cautiously control the reduction of the available bandwidth during the transient period. Thus, we can eliminate the mismatch between the sending rate of the CCA and the service rate at the bottleneck link for existing real-time flows, thereby taming the latency fluctuation. We will extend our insight of using the EWMA reweight mechanism in § 4.

For R3, by grouping flows with similar queue occupancy into the same queue, flows with different queue occupancies will not affect each other. Meanwhile, with a fixed number of queues to schedule between (instead of per-flow queues such as FQ), latency-sensitive flows will have a consistent latency. Thus, Confucius uses a set of queues ( $Q_{1},Q_{2},Q_{3}$ ), each designed to accommodate old flows with different buffer occupancies, and a separate queue ( $Q_{NEW}$ ) dedicated to new flows. It then adopts a Deficit-Weighted Round-Robin (DWRR) algorithm to schedule between these queues. Finally, Confucius periodically measures flow characteristics and reclassifies flows using a hysteresis-based mechanism to further increase robustness in practice (§ 5).

4. Age-aware Flow Weights Adjustment

In this section, we explain the benefits of exponential bandwidth reallocation (§ 4.1) and dive into Confucius’ weight adjustment (§ 4.2). We then analytically show that it guarantees bounded performance degradation, both for existing real-time flows and newly-arrived competing flows (§ 4.3).

4.1. Exponential bandwidth re-allocation

We first quantitatively demonstrate the advantage of gradually controlling the real-time flow’s bandwidth allocation compared to directly cutting its available bandwidth to its fair share. We measured the stall duration $y$ for the real-time flow in the scenario of a sudden reduction of available bandwidth for four low-latency CCAs (§ 4). Concretely, $y$ denotes the stall duration defined by more than 190 ms of end-to-end delay. We plot $y$ as a function of the Available Bandwidth Reduction Factor (ABRF, the factor we will reduce the available bandwidth) for different CCAs in Fig. 7(a). We find that CCAs respond poorly to sudden, large reductions in bandwidth. For instance, reducing GCC’s available bandwidth to 1/16 of its initial value (i.e., $ABRF=16$ ) results in a $y>10$ seconds stall. The relationship between the stall duration and ABRF ( $y=f_{CCA}(ABRF)$ ) is super-linear.

To avoid such stalls, Confucius gradually reduces the available bandwidth for the real-time flow. For instance, to achieve a final ABRF of 16, we can reduce the available bandwidth four times, each by half. Fig. 7(b) demonstrates, in the ideal case, the value proposition of this approach. Compared to the super-linear stall duration (solid line copied from Fig. 7(a)), exponentially reducing the sending rate will only increase the stall duration logarithmically with the ABRF (modulated by $f_{CCA}(2)$ , a small constant).

Such a smooth reallocation of available bandwidth allows the CCA to learn the reduced bandwidth allocation, and is also robust to the number or size of competing flows. No matter how many flows compete with the real-time flow, the curve of the available bandwidth of the real-time flow is fixed so the delay will remain the same. Meanwhile, adjusting the bandwidth share exponentially yields fast convergence to the fair share, satisfying requirements R1 and R2 together. We prove in § 4.3, that Confucius guarantees that the long-term fairness will not be impaired, and the degradation of the performance for new flows will always be within a constant, additive factor of the FCT under a strictly fair allocation.

4.2. Adjustment Mechanism

To assign service rates to queues, Confucius uses the following process. For each flow, $f$ , Confucius computes a weight, $w_{f}$ , to represent its share of the bandwidth (service rate). Confucius groups new flows into a separate queue called $Q_{new}$ (depicted in Fig. 6). All existing flows which are mapped to other queues are assigned flow weights of $w_{f}=1$ , and are collectively denoted as set $\mathcal{F}_{ext}$ . The flow weights of all flows in $Q_{new}$ are computed as follows:

(1)

\small w_{f}=\min\left(\frac{|\mathcal{F}_{ext}|}{|Q_{new}|}\cdot 2^{\lambda t% },\ 1\right),\quad f\in Q_{new}

Then, for a given queue, $Q$ , the weight is the sum of weights of all flows in $Q$ . There are several considerations in Eq. 1:

Age-aware exponential adjustment $\left(2^{\lambda t}\right)$ . As described in § 3.1, Confucius exponentially increases the weights of new flows, where the bandwidth shares are illustrated in Fig. 4(d). Here, $t$ represents the age (in milliseconds) of the new flow, and $\lambda$ is a parameter that controls the speed for the rate adjustment of new flows – their flow weights double every $\frac{1}{\lambda}$ milliseconds. A large $\lambda$ (e.g., $\lambda\to\infty$ ) leads to abrupt reductions in available bandwidth and causes latency spike, while a small $\lambda$ (e.g., $\lambda=0$ ) results in unfairness for new flows. Consequently, we configure $\lambda$ so that the available bandwidth for the real-time flow drops as fast as possible but not overtaking the responsiveness of the underlying CCAs.

Moreover, different CCAs have different response times to congestion. For example, Copa needs 5 RTTs to reduce its sending rate, while BBR’s response time is dictated by its probing interval of 6-8 RTTs. To deal with the heterogeneity of CCAs on the Internet (sigmetrics2020gordon, ), we set $\lambda$ as the inverse of the response time of the least responsive CCA among common latency-sensitive CCAs. This ensures that even the least responsive CCA can smoothly react to bandwidth changes. Recall that we measure how different CCAs respond to bandwidth reductions in Fig. 7(a), which shows BBR being the least responsive CCA: When the ABRF is 2, BBR suffers from the longest stall compared with other CCAs due to its a long probing period of 6-8 RTTs. Thus, given a typical RTT of 30-50 ms for Web services (www2021wisetrans, ), we set $\lambda$ =0.004 (ms ${}^{-1}$ ) to have a doubling interval of $\frac{1}{\lambda}$ =250 ms, matching BBR’s probing period. Experiments in § 7.2 demonstrate satisfactory results for not only BBR but also other CCAs.

Initial weight $\left(\frac{|\mathcal{F}_{ext}|}{|Q_{new}|}\right)$ . To allocate sufficient share for new flows, we scale the initial weight of new flows with the number of existing flows. For each new flow, we set the initial weight to $\frac{|\mathcal{F}_{ext}|}{|Q_{new}|}$ , where $|\mathcal{F}_{ext}|$ and $|Q_{new}|$ are the numbers of existing and new flows, respectively. This can limit the bandwidth reduction for existing flows to be less aggressive than a factor-of-2 reduction. In this case, the stall duration can logarithmically scale from $f_{CCA}(2)$ , as shown in Fig. 7(b).

Upper bound $\left(\min(...,\ 1)\right)$ . Confucius uses a flow weight threshold of $1$ to ‘age out’ new flows from the $Q_{new}$ queue. Once the flow weight of a flow reaches 1, the flow is no longer considered new and is moved to one of the other queues based on the output of the Flow Classifier (§ 5).

4.3. Theoretical Analysis

We still follow the same example in § 2.2. Consider one real-time flow running by itself on a bottleneck link. At $t=0$ , $N$ new flows, each with size $B$ , join the same bottleneck link and compete with the existing flow. $B_{0}$ is the initial congestion window for Web flows. We show that Confucius guarantees bounded stall for the existing real-time flow while yielding FCTs for Web flows within a constant additive factor of what FQ provides. For simplicity, we summarize the results in Tab. 1 and leave the analytical details to Appx. B.

For FQ and FIFO, we observe that the stall duration ( $q^{max}_{P}$ ) scales linearly with the number of new flows, $N$ , and is therefore unbounded, where $N$ can go to more than 100 in some Web pages (Fig. 3). This is quite straightforward – when $N$ flows start to compete with the real-time flow, the available bandwidth of the real-time flow drops to $1/N$ . Intuitively, as $N$ increases, the more the available bandwidth for the real-time flow drops, resulting in drastic delay fluctuation.

Policy $P$	$q^{max}_{P}$	$T_{P}-T_{FQ}$
FQ	$\approx{\color[rgb]{1,0,0}\textit{{N}}}\left(\frac{2}{3}\sqrt{\frac{2}{k}}+q_{% 0}+\tau\right)$	0
FIFO	$\approx\left(\frac{{\color[rgb]{1,0,0}\textit{{N}}}B_{0}}{q_{0}C}+1\right)% \left(\frac{2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tau\right)$	$\lessapprox 0$
CBQ	$\approx\frac{2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tau$	$\approx\frac{({\color[rgb]{1,0,0}\textit{{N}}}-1){\color[rgb]{1,0,0}\textit{{B% }}}}{C}$
Confucius	$\approx 6q_{0}+15\tau+\frac{8\lambda}{k}+\frac{(10q_{0}+15\tau)\lambda^{2}}{k}$	$\approx\frac{\log_{2}e}{\lambda}$

Table 1. Approximations for different schedulers

P

on their maximum queueing delay (

q^{max}_{P}

) and FCT degradation against FQ (

T_{P}-T_{FQ}

). Confucius has a bounded performance degradation for all flows. In the competition, existing schedulers have either unbounded delay, or unbounded FCT degradation. The unbounded terms with workload changes (

N

and

B

) are marked in red.

For class-based queues (CBQ, weighted class), pre-labeling the real-time flow enables the scheduler to allocate the real-time flow with a fixed bandwidth share, resulting in a constant stall. However, if the weights are not accurate (i.e., not matching the traffic ratio), CBQ converges unfairly, and the FCT degradation for new flows becomes unbounded (§ 2.4).

Finally, Confucius yields bounded performance degradation for both sets of flows. On one hand, Confucius ensures that the stall for real-time flows is constant only depending on the CCA’s latency sensitivity (denoted by $q_{0}$ ), the responsiveness ( $k$ ), the feedback loop ( $\tau$ ), and Confucius’s decay parameter ( $\lambda$ )⁴⁴4 When using Copa with an RTT of 40ms, $q_{\textsf{Confucius}}^{max}$ is $\approx$ 640 ms. As we show experimentally in § 7.2, the actual delay using Confucius is much lower. . On the other hand, Confucius can also ensure the FCT degradation for new flows is bounded by an additive constant factor to the decay parameter ( $\lambda$ ), which goes to negligible with the increase of the flow sizes.

5. Occupancy-aware Flow Classification

As described in § 3.2, Confucius seeks to classify flows into groups, each with a dedicated queue based on how aggressively they consume buffer space. We find that flows implicitly demonstrate their preferences and objectives based on how they utilize the bottleneck queue. We measure the buffer occupancy of 7 CCAs (the top-5 CCAs used in websites (sigmetrics2020gordon, ) plus two recent latency-sensitive CCAs, GCC and Copa), over real-world bandwidth traces (§ 7.1). We further measure the network RTT at the sender and the queue utilization on the bottleneck router. A lower RTT indicates that this CCA is more latency-sensitive. As we can see in Fig. 9, GCC, Copa, and Vegas have a low network RTT. Such CCAs achieve low latency by trying to keep the bottleneck queue as short as they can. Real-time applications can choose these CCAs to achieve low latency. In contrast, throughput-oriented CCAs (Cubic, Yeah, and Illinois) will maximize the queue utilization for high throughput. This allows us to identify the latency sensitivity of flows by their queue occupancy: if one flow has a low queue occupancy at the bottleneck, it indicates that (i) that flow tries to not overutilize the queue; and (ii) that flow can co-exist with other flows with similar behaviors.

In this section, we present our hysteresis-based mechanism to robustly identify the flows (§ 5.1) and our implementation considerations (§ 5.2).

5.1. Hysteresis-based Adjustment

Confucius puts short flows into a separate queue $Q_{new}$ and classifies long flows with different buffer occupancy aggressiveness into $Q_{1},Q_{2},\cdots,Q_{n}$ . Queue indices increase with buffer target i.e., $Q_{1}$ will be shorter than $Q_{3}$ , as shown in Fig. 6. Each queue $Q_{i}$ targets a buffer occupancy of $q_{0}^{(i)}$ . We robustly classify flows as follows:

Classification of new flows. The buffer aggressiveness of flow may take a long time to manifest. Thus, Confucius will not characterize short flows lasting only a few RTTs (§ 2). When the new flow is ready to be moved out from the new-flow queue $Q_{new}$ to one of the old queues (its weight reaching one, which we elaborated on in § 4.2), we measure the buffer occupancy of that flow $q_{f}$ i.e., the number of packets of this queue that belong to flow $f$ . We then find the queue $i$ with the nearest target $q_{0}^{(i)}$ to accommodate this flow.

Periodic adaptation. Confucius periodically examines flows and queues and moves flows accordingly in two steps. While seemingly complex, these operations are well within the capabilities of Linux-based routers (§ 6).

Intra-queue examination identifies outstanding flows among other flows in the current queue. Confucius examines the buffer each flow occupies ( $\frac{q_{f}}{\sum_{g\in Q_{i}}q_{g}}$ ) and its fair share ( $\frac{1}{|Q_{i}|}$ ). If the buffer occupancy of a flow is larger than its fair share:

(2)

\small\frac{q_{f}}{\sum_{g\in Q_{i}}q_{g}}\geqslant\frac{1}{|Q_{i}|}+\alpha

the flow is too aggressive in the current queue, where $\alpha>0$ is a hysteresis. Confucius wll promote that flow from queue $Q_{i}$ to $Q_{i+1}$ to keep $Q_{i}$ near its control target. Similarly, a flow with an outstandingly lower buffer occupancy, i.e.:

(3)

\small\frac{q_{f}}{\sum_{f\in Q_{i}}q_{f}}\leqslant\frac{1}{|Q_{i}|}-\alpha

will be demoted from queue $Q_{i}$ to $Q_{i-1}$ . Here we set $\alpha$ to 10% based on our previous observations in Fig. 9. Our evaluation in §7 shows that the performance of Confucius is not sensitive to the workloads and CCAs.

Queue-level examination checks if the length of a queue fits the queue’s control target. If the length of a queue exceeds a safe region between the control target of the neighbor queue, Confucius moves all flows in the current queue to that queue, as shown in Fig. 9. This is needed because the intra-queue examination only focuses on cross-flow relative occupancy. Thus, it cannot identify when flows in the current queue are comparably aggressive but more aggressive than the target of this queue. For example, assume that two Cubic flows were previously classified to $Q_{1}$ (the least aggressive) due to being throttled elsewhere. When these Cubic flows start to be aggressive, Confucius needs to move them to a different queue to protect incoming latency-sensitive flows.

5.2. Design Considerations

In practice, Confucius has two following considerations.

Number of queues to set. We observe that the CCAs are concentrated in three clusters (circles in Fig. 9). Concretely, GCC, Copa, and Vegas have a queue occupancy of less than 20%; Cubic, Illinois, and Yeah have a queue occupancy of more than 80%; and BBR’s stays in-between. Therefore, we set three queues and use the average queue occupancy in these three clusters as our targets $\{q_{0}^{(i)}\}$ . We expect other CCAs to fall into one of these three representative categories, if not we can configure Confucius to work with more queues.

Variation of buffer aggressiveness. A flow’s buffer aggressiveness can change over time. For example, a Cubic flow throttled/congested elsewhere (on a different router) will not be aggressive in buffer occupancy (although Cubic, the algorithm, would). Such a Cubic flow can share the queue with other delay-sensitive flows. However, when the bottleneck moves to the current router, this Cubic flow will be aggressive on the buffer occupancy, where the flow can no longer share the queue anymore. Our reclassification mechanism is capable of correctly moving the flows, as evaluated in § 7.6.

6. Confucius implementation

Implementing Confucius in Linux kernel has some challenges. We discuss them and our solutions below.

Order-preserving during reclassification. Flows can be moved to another class in the runtime. Thus, we need to ensure the order-preservation during the reclassification of Confucius of a certain flow. In response, we adopt a virtual class design in Confucius. During the enqueue process of new packets, we bind the sk_buff to each flow. During the dequeue process, we search for all flows that are bound to the determined class and dequeue the packet with the earliest enqueue time. In this way, when moving a flow to another class, we can just rebind the pointer of the flow from the previous class to the new class.

Reducing computational overhead. To implement Confucius in Linux kernel and optimize the execution overhead, we need to strictly optimize the computational overhead. Specifically, we have the following two implementations:

(i) Bit-shifts for exponential operations. Confucius reweights flows based on their ages with an exponential function, yet the floating number calculation in the kernel is expensive. Therefore, we quantize the weight of new flows with the unit of $\frac{1}{128}$ and use bit shifts for the exponential weight updates, i.e., left shifting the weight by one bit every $\frac{1}{\lambda}$ milliseconds.

(ii) Periodical reweighting and reclassification. The reweighting and reclassification are not necessary for each packet. For the reweighting, we only need to reweight for a flow every $\frac{1}{\lambda}$ milliseconds. When we set $\lambda=0.004$ , this means to reweight every 250 ms. For the reclassification, we should observe the results after moving one flow to a new class for at least one RTT to measure the queue utilization and observe the behavior of the flow in the new class. Therefore, we also reclassify the flows in a periodic way – we set the reclassification interval to 100ms.

7. Evaluation

We first present our experimental setup (§ 7.1); then we evaluate Confucius by answering the following questions:

•

How does Confucius behave compared to baselines on real-world Web and bandwidth traces? Confucius reduce the stall duration of a real-time flow by 21% to 87% with various CCAs while maintaining comparable FCTs (§ 7.2).
•

How sensitive is Confucius to workloads? Confucius is consistently performant with different sizes and numbers of Web flows, following our theoretical analysis (§ 7.3).
•

How does Confucius scale to multiple flows with different CCAs? We demonstrate that Confucius can correctly separate coexisting flows with different CCAs and provide consistent performance to all of them (§ 7.4).
•

How does Confucius perform in the testbed prototype? We implement Confucius in Linux kernel and show that Confucius reduces the stall duration by more than 60% with reasonable overhead over real Web traces (§ 7.5).
•

We further show that Confucius can outperform baselines when working with multiple real-time flows, bandwidth-probing CCAs, and different bottlenecks (§ 7.6).

7.1. Experiment Setup

Ns-3 setup. In § 7.2, 7.3 and 7.4, we evaluate the performance of Confucius with ns-3.34. We use the example in Fig. 1 and limit the capacity of the bottleneck link based on the bandwidth traces from (sigcomm2022zhuge, ). The dataset contains 3 sets of cellular traces (C1: mixed; C2: 4G; C3: 5G), and 2 sets of WiFi traces (W1: Office; W2: Restaurant), where the average bandwidth ranging from 22 Mbps to 375 Mbps (details in § C.1). The round-trip propagation delay is set to 40ms based on (sigcomm2022zhuge, ). We adopt the RTC library in ns-3 from (nsdi2024hairpin, ; sigcomm2022zhuge, ). We evaluate the real-time flow with different delay-sensitive CCAs, including Copa (nsdi2018copa, ) (used by Meta Live (flive-quic, )), GCC (ton2017webrtc, ) (used by WebRTC), and BBR (queue2016bbr, ). The Web flows use the mostly deployed CCA (sigmetrics2020gordon, ) – Cubic (sigops2008cubic, ).

Web traces. To compose a realistic and relevant dataset of Web traffic, we collected the Alexa Top-1000 websites⁵⁵5 Although the Alexa Top website list has been deprecated, we still use this list since it is the most well-known list for top websites. (alexa-top1k, ). We use selenium (selenium, ) to automatically load Web pages through Google Chrome (version 120.0.6099.218), and use NetLog (netlog, ) and browsermobproxy (browsermob, ) to record the packets and socket states. The measurement was run in November 2023, with distribution in Fig. 3. The version of HTTP is negotiated with the website, where the majority is HTTP/1.1 and HTTP/3.0. We show in Appx. A that although the parallel connection of HTTP/1.1 contributes to the concurrency, the majority of concurrency still comes from the diverse objects on the Web page. We replay the Web traces to test a variety of scenarios.

Baselines. We compare Confucius with multiple schedulers, categorized and listed below. We use the default parameters in the Linux kernel 4.4.0 and ns-3.34 for baselines.

Not require labels (Note that Confucius does not require either):

(1)

FIFO and (2) FQ+CoDel (rfc8290fqcodel, ), the default qdiscs in Linux (before and after systemd v217 (systemd-qdisc, )).
(3)

FQ+FIFO is the fair queueing without the AQM.
(4)

CoDel (cacm2012codel, ) and (5) RED (ton1993red, ) will drop packets before the queue overflows to notify the sender.
(6)

SJF (shortest job first) prioritizes short flows over long flows, which is exactly opposite to what Confucius tries to do. We take the implementation from PIAS (nsdi2015pias, ).
(7)

HHF (sigcomm2002hhf, ) (heavy-hitter filter) differentiates between small flows and heavy-hitters and schedules separately.

Require labels:

(8)

DualQ (rfc9332dualq, ) is a recently proposed scheduler in L4S (rfc9330l4s, ) that protects latency-sensitive flows with labels, using the DSCP bits to identify the traffic and notify the sender.
(9)

DualQ+Prague. DualQ provides ECN signals and works best with TCP Prague (briscoe-iccrg-prague-congestion-control-03, ) in the L4S framework by design. We adopt the implementation from (prague-ns3, ).
(10)

CBQ (1:1) and (11) CBQ (1:5) are the weighted class-based queues, which put flows into different classes based on application labels. We set the weights for two classes (RT:Web) to 1:1 and 1:5 and evaluate respectively.
(12)

StrictPriority strictly prioritizes traffic from real-time flows if they are labeled accordingly.

Metrics. We focus on the following metrics in experiments.

•

Stall duration for video frames is the duration for which the delay of the video frame is greater than 190 ms. This reflects users’ experiences on video stalls (itu-ddl, ; sigcomm2022zhuge, ; infocom2022dams, ). We use this metric to evaluate how the RT flow is affected.
•

Page Load Time (PLT) is the time till the last HTTP request in a web page is completed. We use this metric to evaluate the performance of web traffic. PLT degradation refers to the increase of delay compared to FQ.

Besides, we also evaluate other metrics in different experiments, which we will elaborate on accordingly.

7.2. Confucius under a realistic workload

Simulation scenario. We have a long-running real-time flow from the RTC module in ns-3. We then randomly select a website and replay the traces we collected to compete with the real-time flow. We set the interval of loading two websites to 53 seconds, which is the average Web page viewing time (page-stay, ). In each run, we measure the duration where the frame delay of the video flow is larger than 190 ms (stall). We also measure the Web PLT for websites. We repeat the experiment for three CCAs for the real-time flow. In this subsection, we present the results over C1 bandwidth traces and leave others to Appx. C.1. We present the average PLTs and stall durations in Fig. 10, and dive into distributions later.

Confucius strikes a balance between video and web performance that is consistent across CCAs. In Fig. 10(a), we observe that schedulers not relying on labels from the end host (marked in blue) suffer from long video stalls. For example, when the real-time flow adopts Copa, using FQ-CoDel or FIFO, the real-time flow experiences a stall of 200 ms to 300 ms on average when loading different websites. AQMs such as CoDel will further impair the performance of the Web flows since AQMs will drop packets for those flows. In contrast, Confucius can reduce the stall duration to less than 100 ms and improve by more than a half against all other baselines not requiring baselines.

Schedulers requiring labels (marked in green) protect prelabeled RT flows, but considerably degrade the PLT for the Web traffic. DualQ+Prague improves DualQ since the CCA on the end-host can react more effectively, but still incur considerable penalty on Web flows since it incurs packet drops to the Web flows (non-scalable from DualQ’s design). Note that even when using StrictPriority, the real-time flow still suffers from 60 ms degradation on average due to bandwidth fluctuation. Confucius is almost on par with schemes relying on end-host labels in terms of the delay of the real-time flows. Remember that it is unrealistic to assume that an end-host will correctly label all traffic (§ 2.4).

Most importantly, Confucius does not incur too much penalty on the PLT of Web pages, and pushes the Pareto front of the schedulers not requiring labels (the dashed blue line) forward. Confucius reduces the PLT compared to CoDel, RED, FQ+CoDel, and HHF since Confucius gently adjusts the bandwidth share for the Web flows. Even compared to FQ+FIFO, Confucius only increases the average PLT by up to 8% (up to 88 ms) in three subfigures, which is much smaller than the improvement on the stall duration. The improvement when using BBR is not as significant as the other two CCAs – this is due to the suboptimal performance of BBR in controlling the delay for the real-time flow, as we also saw in Fig. 9.

Confucius protects the real-time flow when competing with traffic from 86% of the websites. We further break down the details for different websites in Fig. 10(a) into Fig. 11. In Fig. 11(a), we present the number of websites that do not affect the delay of the real-time flow (the stall is negligible of shorter than 190 ms). The lower the number is, the better performance the scheduler is. When using Confucius, the real-time flow will only suffer from stall when competing with 136 out of 1000 websites. However, for all other baselines that do not require labels, the number ranges from 288 websites (FIFO) to 537 websites (SJF). Confucius reduces the number by 53%-75%. Even for baselines requiring labels excluding CBQ (1:5), the number ranges from 101 to 143. This further shows that Confucius can achieve comparable or sometimes even better protection to the real-time flow as those label-based solutions.

We also measures the number of websites suffering from a long PLT of longer than 2 seconds, which is the threshold for good user experience (sigcomm2019e2e, ). When using FQ+CoDel, 227 websites will suffer from long PLT, while Confucius can reduce this number by half to 127. Confucius’s results is comparable to FQ+FIFO, demonstrating the fairness of Confucius for competing Web flows. For baselines requiring labels that behave well in Fig. 11(a), at least 198 websites suffer from long PLTs. The closest label-based baseline is CBQ (1:1), which we can also see from Fig. 10. Besides the unrealistic label requirement, we will later show in § 7.3 that CBQ (1:1) does not scale to the variation of workloads. Fig. 10 evaluates with Web flows, but if the competing flows are not Web flows but FTP flows instead, the performance will be drastic since CBQ (1:1) adopts a fixed ratio between classes.

Confucius controls the delay and PLT following the theoretical analysis. Fig. 12(a) further presents the distribution of stall duration when the video flow encounters Web flows from different websites in the dataset. With FQ+CoDel or FIFO, the stall for the real-time flow will last for longer than 500 ms for 12% (FIFO)-18% (FQ+CoDel) websites, where the number for Confucius is 1%. In contrast, with Confucius, the real-time flow will not experience any stall when encountered with 95% of the websites, comparable to CBQ. Importantly, besides the PLT measured in Fig. 11(b), Confucius does not over-penalize web traffic – 60% of websites will not suffer from a penalty at all against FQ, as shown in Fig. 12(b), which mostly corroborates our previous theoretical analysis. We further present the distribution of maximum experienced delay for the real-time flow in Fig. 12(c). The fraction of having a maximum delay of $>$ 500 ms is 1% using Confucius, while for FIFO and FQ+CoDel are 5% and 18%. This further demonstrates that Confucius can control the latency fluctuation in not only the stall duration but also directly the raw delay. The dive into the network delay of all packets of the real-time flow in Fig. 12(d) corroborates this as well. The results when using GCC and BBR are similar.

7.3. Confucius under workload changes

In this subsection, we test our theoretical analysis by investigating whether Confucius can provide consistent performance in controlled workloads. We vary the workload by changing the number of flows in a Web page and the size of Web flows. We measure the stall duration in different scenarios and the degradation on the PLT against FQ.

Confucius is bounded by theoretical thresholds, confirming our analysis. We vary the number of Web flows from 5 to 100, each with the size of 15KB (medium flow size in our measurement), and summarize our results in Fig. 14(a). The stall duration for FQ and FIFO increases with the number of flows: when the number of Web flows goes to 60, the real-time flow experiences a stall for more than half a second. On the contrary, Confucius maintains zero stall in this setting, similar to CBQ. We further compare the experimental results with our previous analysis in § 4.3. As we can see in the dashed line in Fig. 14, the experimental results corroborate our theoretical analysis for Confucius in Tab. 1.

We further change the size of Web flows (from short flows to long flows) and see if Confucius is capable of handling all types of competing traffic. With the increase of the size of flows, the competing flows are changing from short flows (e.g., Web) to long flows (e.g., FTP). We vary the size of Web flows from 15KB to 9MB, and run 5 flows with the same size to compete with the HRT flow. When using FIFO, the real-time flow will suffer from drastic stall due to failure to provide inter-CCA fairness across flows, as shown in Fig. 15(a). The real-time flow using FQ also has a long stall of hundreds of milliseconds. In contrast, Confucius is still able to achieve negligible stall for the real-time flow and bounded PLT degradation for Web flows at the same time.

7.4. Heterogeneous Flow Classification

In this subsection, we zoom in on Confucius’s flow classification mechanism. We find that Confucius can accurately group flows of the same/similar CCA together without any prior knowledge or labels from end hosts, which in turn leads to better performance compared to the baselines.

We simultaneously run four long flows of different CCAs: one Cubic flow, one BBR flow, one GCC flow, and one Copa flow for 100 seconds. We plot the network delay for each flow over time in Fig. 13(a). We can clearly see that Copa and GCC enjoy a consistent low latency around 40-60 ms, even when they are competing with BBR and Cubic flows.

To understand Confucius’s superior performance, we look at its classification results over time and present them in Fig. 13(b). Four bars represent the classification results of Confucius for four flows over time, while three colors indicate which queue the flow is classified into. Confucius can correctly classify flows using different CCAs into the correct queues: Copa and GCC flows can be stably put into the low occupancy queue ( $Q_{1}$ , blue), the BBR flow into the median occupancy queue ( $Q_{2}$ , yellow), and the Cubic flow into the high occupancy queue ( $Q_{3}$ , green). This follows our previous observation in Fig. 9 – Copa and GCC both demonstrate similar low buffer occupancy, while Cubic occupies the buffer aggressively, and BBR in the middle. Moreover, we notice that the Cubic flow can temporarily be in the same queue as BBR, as shown in the yellow lines in the green bar in Fig. 13(b). This is expected as the Cubic flow has (at times) a low queue occupancy in its probing period. Second, flows with different CCAs can co-exist in the same queue as long as they have similar buffer occupancy. In this experiment, Copa and GCC flows are put into the same queue since they have similar buffer occupancy. As we can see in Fig. 13(a), these two flows still have consistent low latency all the time.

We also measure the Jain’s fairness index (JFI) in Fig. 13(c) to present the fairness when using different schemes. We compare the results (the delay of the Copa and GCC flow, and the JFI of all flows) in the same experiment with other schedulers in Fig. 13(c). With Confucius, the Copa and GCC flows also enjoy a reasonable fair share of the bandwidth as FQ – the JFI in this experiment is 0.98 in Fig. 13(c), where JFI close to one indicates a better fairness.

7.5. Testbed Experiments

We implement Confucius as a kernel module of queue disciplines (qdisc) in traffic control in Linux kernel 4.4.0 (1.4k LoCs) and evaluate the performance of Confucius on a software router based on Intel Xeon E5-2620 v4 CPU. We run the official implementation of Copa (copa-ccp, ). We find that Confucius achieves significant benefits in kernel implementations while only adding marginal processing delay.

We stream the video frames using socket and TCP Copa, and measure the end-to-end delay for the real-time flow. We then set up an HTTP server based on Python to replay the Web traces we collected. We also measure the computational overhead of Confucius and the baselines. We record the processing time for the enqueue and dequeue operation in Linux tc using printk, where the reweight and reclassification in Confucius are both implemented.

As shown in Fig. 16(a), Confucius reduces the stall duration by more than 60% without the need for labels on each packet. This result is similar to our simulation in Fig. 10(a). Moreover, from our experiments, 86% of websites when using Confucius do not suffer from stall at all. In contrast, this number is only 56% and 30% for FIFO and FQ.

We vary the number of long-running flows to measure the overhead of Confucius. Note that the processing time of Confucius is insensitive to the number of short flows, as they all belong to the new-flow queue. As shown in Fig. 16(b), Confucius slightly increases the processing time for each packet compared to FQ. Even for 100 concurrent long-running flows, the per-packet processing time of Confucius is still 5 $\mu s$ , indicating a processing rate of 200 kpps and is at the same magnitude as FQ. Since Confucius is mainly designed for last-mile routers such as home routers, this can satisfy the daily usage of home access points or last-mile routers. We stress that the kernel implementation of Confucius can be further optimized for high-performance execution in the future. We leave the further exploration of Confucius over numerous flows (e.g., in the routers in the core network) in the future.

7.6. Microbenchmarks

We further evaluate the performance of Confucius in a series of microbenchmarking settings. In Appx. C.3, we demonstrate that the hysteresis mechanism of Confucius (§5.1) can work with bandwidth-probing CCAs (e.g., BBR) and stably and correctly classify flows. We further show that Confucius will not have side effects on the fairness aspect in Appx. C.2, and investigate if the bottleneck is not the router where Confucius is deployed in Appx. C.4. Even if multiple real-time flows are competing simultaneously, Confucius can still handle those flows and provide significant performance improvements against baselines (Appx. C.5).

8. Limitations

We discuss some other related work besides § 2.3 in Appx. D, and outline some limitations of Confucius here.

Applications using latency-sensitive CCAs. In this paper, we assume that real-time applications will use latency-sensitive CCAs. For example, video conferencing applications will use CCAs such as GCC and Copa but not Cubic (sigcomm2022zhuge, ). This general holds since the operators of applications will optimize towards their goal, where the latency is definitely the goal of real-time applications. If the application does not follow this, that means the application’s CCA itself is still problematic, and the effect of Confucius will be limited.

Web flow characteristics for mobile applications. The measurement of Web flows in § 2 and Appx. A is loading Web pages from desktop browsers (Google Chrome). We do not conduct measurements over mobile Apps due to device limitation. However, the root cause contributing to the bursts still exists in mobile Apps – one page contains diverse objects (images, videos, scripts) from different domains. We leave the investigation of Web pages on mobile Apps to future.

Confucius scales to core routers. We mainly discuss the bottleneck at the edge routers in this paper. This is because when the network delay increases, it is more likely to happen at the edge (sigcomm2022zhuge, ). In core routers, due to high line rate, 100 new flows is not a big number and will likely not result in drastic available bandwidth fluctiation. Meanwhile, tracking per-flow state such as FQ is already burdensome, therefore is out of the scope of Confucius. Nevertheless, encouraged by recent sophisticated AQM on high-performance switches (sigcomm2022abm, ), it might be possible to extend Confucius to core routers.

9. Conclusion

We propose Confucius, the first queue management scheme to protect the real-time flows in the flow competition while not requiring any labels from end hosts. Confucius achieves this by grouping flows based on their latency preferences, which it infers by observing their buffer occupancy over time. Confucius gradually adjusts the service rates of flows to match the reaction of congestion control. Doing so allows Confucius to mitigate latency spikes of real-time flows. Extensive evaluation shows that Confucius protects the real-time flows from stalls when competing with 86% websites, almost doubling over numerous baselines.

This work does not raise any ethical issues.

References

[1] Gsoc2020prague - nsnam. https://www.nsnam.org/wiki/GSOC2020Prague.
[2] lightbody/browsermob-proxy: A free utility to help web developers watch and manipulate network traffic from their ajax applications. https://github.com/lightbody/browsermob-proxy.
[3] Netlog: Chrome’s network logging system. https://www.chromium.org/developers/design-documents/network-stack/netlog/.
[4] [systemd-devel] [announce] systemd 217. https://lists.freedesktop.org/archives/systemd-devel/2014-October/024662.html#:~:text=The%20default%20sysctl.d/%20snippets%20will%20now%20set%3A, 2014.
[5] Alexa Top Websites $>>$ ExpiredDomains.net. https://member.expireddomains.net/domains/researchalexamillion/, 2022.
[6] Selenium. https://www.selenium.dev/, 2022.
[7] Vamsi Addanki, Maria Apostolaki, Manya Ghobadi, Stefan Schmid, and Laurent Vanbever. Abm: Active buffer management in datacenters. In Proc. ACM SIGCOMM, 2022.
[8] Mohammad Alizadeh, Abdul Kabbani, Berk Atikoglu, and Balaji Prabhakar. Stability analysis of qcn: the averaging principle. In Proc. ACM SIGMETRICS, 2011.
[9] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport. In Proc. ACM SIGCOMM, 2013.
[10] Anurag. What os does the router use? is it linux? - quora. https://www.quora.com/What-OS-does-the-router-use-Is-it-Linux#:~:text=Yes%20most%20of%20the%20router,the%20router%20as%20pre%2Dinstalled.
[11] Venkat Arun. Implementation of the copa congestion control algorithm using ccp. https://github.com/venkatarun95/ccp_copa, 2020.
[12] Venkat Arun, Mohammad Alizadeh, and Hari Balakrishnan. Starvation in End-to-End Congestion Control. In Proc. ACM SIGCOMM, 2022.
[13] Venkat Arun and Hari Balakrishnan. Copa: Practical delay-based congestion control for the internet. In Proc. USENIX NSDI, 2018.
[14] Wei Bai, Li Chen, Kai Chen, Dongsu Han, Chen Tian, and Hao Wang. Information-agnostic flow scheduling for commodity data centers. In Proc. USENIX NSDI, 2015.
[15] Vaibhav Bajpai, Steffie Jacob Eravuchira, and Jürgen Schönwälder. Dissecting last-mile latency characteristics. ACM SIGCOMM Computer Communication Review, 47(5):25–34, 2017.
[16] Fred Baker, Jozef Babiarz, and Kwok Ho Chan. Configuration Guidelines for DiffServ Service Classes. IETF RFC 4594, 2006.
[17] Nimantha Baranasuriya, Vishnu Navda, Venkata N Padmanabhan, and Seth Gilbert. Qprobe: Locating the bottleneck in cellular communication. In Proc. ACM CoNEXT, pages 1–7, 2015.
[18] Apurv Bhartia, Bo Chen, Feng Wang, Derrick Pallas, Raluca Musaloiu-E, Ted Tsung-Te Lai, and Hao Ma. Measurement-based, practical techniques to improve 802.11 ac performance. In Proc. ACM IMC, 2017.
[19] Bob Briscoe, Koen De Schepper, Marcelo Bagnulo, and Greg White. Low Latency, Low Loss, and Scalable Throughput (L4S) Internet Service: Architecture. RFC 9330, January 2023.
[20] Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. Bbr: Congestion-based congestion control. ACM Queue, 2016.
[21] Gaetano Carlucci, Luca De Cicco, Stefan Holmer, and Saverio Mascolo. Congestion control for web real-time communication. IEEE/ACM Transactions on Networking, 2017.
[22] Hyunseok Chang, Matteo Varvello, Fang Hao, and Sarit Mukherjee. Can you see me now? a measurement study of zoom, webex, and meet. In Proceedings of the 21st ACM Internet Measurement Conference, pages 216–228, 2021.
[23] Gwyn Chatranon, Miguel A Labrador, and Sujata Banerjee. Black: detection and preferential dropping of high bandwidth unresponsive flows. In Proc. IEEE ICC, 2003.
[24] Li Chen, Kai Chen, Wei Bai, and Mohammad Alizadeh. Scheduling mix-flows in commodity datacenters with karuna. In Proc. ACM SIGCOMM, 2016.
[25] Wei Chen, Liangping Ma, and Chien-Chung Shen. Congestion-aware mac layer adaptation to improve video teleconferencing over wi-fi. In Proceedings of ACM Multimedia Systems Conference (MMSys), 2015.
[26] Amogh Dhamdhere, David D Clark, Alexander Gamero-Garrido, Matthew Luckie, Ricky KP Mok, Gautam Akiwate, Kabir Gogia, Vaibhav Bajpai, Alex C Snoeren, and Kc Claffy. Inferring persistent interdomain congestion. In Proc. ACM SIGCOMM, pages 1–15, 2018.
[27] Sandesh Dhawaskar Sathyanarayana, Kyunghan Lee, Dirk Grunwald, and Sangtae Ha. Converge: Qoe-driven multipath video conferencing over webrtc. In Proc. ACM SIGCOMM, pages 637–653, 2023.
[28] Mo Dong, Qingxi Li, Doron Zarchy, P Brighten Godfrey, and Michael Schapira. Pcc: Re-architecting congestion control for consistent high performance. In Proc. USENIX NSDI, 2015.
[29] Nandita Dukkipati, Tiziana Refice, Yuchung Cheng, Jerry Chu, Tom Herbert, Amit Agarwal, Arvind Jain, and Natalia Sutin. An argument for increasing tcp’s initial congestion window. ACM SIGCOMM Computer Communication Review, pages 26–33, 2010.
[30] Cristian Estan and George Varghese. New directions in traffic measurement and accounting. In Proc. ACM SIGCOMM, pages 323–336, 2002.
[31] Wu-chang Feng, Dilip D Kandlur, Debanjan Saha, and Kang G Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In Proc. IEEE INFOCOM, 2001.
[32] Wu-chun Feng, Apu Kapadia, and Sunil Thulasidasan. Green: proactive queue management over a best-effort network. In Proc. IEEE GLOBECOM, 2002.
[33] Marcel Flores, Alexander Wenzel, and Aleksandar Kuzmanovic. Enabling router-assisted congestion control on the internet. In Proc. IEEE ICNP, 2016.
[34] Sally Floyd and Van Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1993.
[35] Romain Fontugne, Anant Shah, and Kenjiro Cho. Persistent last-mile congestion: not so uncommon. In Proc. ACM IMC, pages 420–427, 2020.
[36] Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S Wahby, and Keith Winstein. Salsify: Low-latency network video through tighter integration between a video codec and a transport protocol. In Proc. USENIX NSDI, 2018.
[37] Nitin Garg. Copa congestion control for video performance - engineering at meta. https://engineering.fb.com/2019/11/17/video-engineering/copa/, 2019.
[38] Prateesh Goyal, Anup Agarwal, Ravi Netravali, Mohammad Alizadeh, and Hari Balakrishnan. Abc: A simple explicit congestion controller for wireless networks. In Proc. USENIX NSDI, 2020.
[39] Sangtae Ha, Injong Rhee, and Lisong Xu. Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS Operating Systems Review, 2008.
[40] Mario Hock, Roland Bless, and Martina Zitterbart. Experimental evaluation of bbr congestion control. In Proc. IEEE ICNP, 2017.
[41] Toke Høiland-Jørgensen, Paul McKenney, Dave Taht, Jim Gettys, and Eric Dumazet. The Flow Queue CoDel Packet Scheduler and Active Queue Management Algorithm. IETF RFC 8290.
[42] Van Jacobson. Congestion avoidance and control. In Proc. ACM SIGCOMM, 1988.
[43] Dina Katabi, Mark Handley, and Charlie Rohrs. Congestion control for high bandwidth-delay product networks. In Proc. ACM SIGCOMM, 2002.
[44] Tong Li, Kai Zheng, Ke Xu, Rahul Arvind Jadhav, Tao Xiong, Keith Winstein, and Kun Tan. Tack: Improving wireless transport performance by taming acknowledgments. In Proc. ACM SIGCOMM, 2020.
[45] Dong Lin and Robert Morris. Dynamics of random early detection. In Proc. ACM SIGCOMM, 1997.
[46] Chengnian Long, Bin Zhao, Xinping Guan, and Jun Yang. The yellow active queue management algorithm. Elsevier Computer Networks, 2005.
[47] James M Lucas and Michael S Saccucci. Exponentially weighted moving average control schemes: properties and enhancements. Technometrics, 32(1):1–12, 1990.
[48] Mike H MacGregor and Weiguang Shi. Deficits for bursty latency-critical flows: DRR++. In Proc. IEEE International Conference on Networks (ICON), pages 287–293.
[49] Zili Meng, Yaning Guo, Chen Sun, Bo Wang, Justine Sherry, Hongqiang Harry Liu, and Mingwei Xu. Achieving Consistent Low Latency for Wireless Real Time Communications with the Shortest Control Loop. In Proc. ACM SIGCOMM, 2022.
[50] Zili Meng, Xiao Kong, Jing Chen, Bo Wang, Mingwei Xu, Rui Han, Honghao Liu, Venkat Arun, Hongxin Hu, and Xue Wei. Hairpin: Rethinking packet loss recovery in edge-based interactive video streaming. In To appear at USENIX NSDI, 2024.
[51] Zili Meng, Tingfeng Wang, Yixin Shen, Bo Wang, Mingwei Xu, Rui Han, Honghao Liu, Venkat Arun, Hongxin Hu, and Xue Wei. Enabling high quality real-time communications with adaptive frame-rate. In Proc. USENIX NSDI, 2023.
[52] Oliver Michel, Satadal Sengupta, Hyojoon Kim, Ravi Netravali, and Jennifer Rexford. Enabling passive measurement of zoom performance in production networks. In Proceedings of the 22nd ACM Internet Measurement Conference, pages 244–260, 2022.
[53] Ayush Mishra, Xiangpeng Sun, Atishya Jain, Sameer Pande, Raj Joshi, and Ben Leong. The great internet tcp congestion control census. In Proc. ACM Sigmetrics, 2020.
[54] Sándor Molnár, Balázs Sonkoly, and Tuan Anh Trinh. A comprehensive tcp fairness analysis in high speed networks. Computer Communications, 32(13-14):1460–1484, 2009.
[55] Yunzhe Ni, Zhilong Zheng, Xianshang Lin, Fengyu Gao, Xuan Zeng, Yirui Liu, Tao Xu, Hua Wang, Zhidong Zhang, Senlang Du, et al. Cellfusion: Multipath vehicle-to-cloud video streaming with network coding in the wild. In Proc. ACM SIGCOMM, pages 668–683, 2023.
[56] Kathleen Nichols and Van Jacobson. Controlling queue delay. Communications of the ACM, 2012.
[57] Rong Pan, Lee Breslau, Balaji Prabhakar, and Scott Shenker. Approximate fairness through differential dropping. ACM SIGCOMM Computer Communication Review, 33(2):23–39, 2003.
[58] Adithya Abraham Philip, Ranysha Ware, Rukshani Athapathu, Justine Sherry, and Vyas Sekar. Revisiting tcp congestion control throughput models & fairness properties at scale. In Proc. ACM IMC, pages 96–103, 2021.
[59] Devdeep Ray, Connor Smith, Teng Wei, David Chu, and Srinivasan Seshan. Sqp: Congestion control for low-latency interactive video streaming. arXiv preprint arXiv:2207.11857, 2022.
[60] ITU Recommendations. G.1070 : Opinion model for video-telephony applications. https://www.itu.int/rec/T-REC-G.1070, 2018.
[61] Jacqueline Renouard. The average time spent on a website: Increase visitor engagement. website/#:~:text=The%20average%20time%20spent%20on%20a%20web%20page%20ranges%20depending,industries%2C%20is%20around%2053%20seconds., 2023.
[62] Michael Rudow, Francis Y Yan, Abhishek Kumar, Ganesh Ananthanarayanan, Martin Ellis, and KV Rashmi. Tambur: Efficient loss recovery for videoconferencing via streaming codes. In Proc. USENIX NSDI, pages 953–971, 2023.
[63] Saeed Shafiee Sabet, Steven Schmidt, Saman Zadtootaghaj, Babak Naderi, Carsten Griwodz, and Sebastian Möller. A latency compensation technique based on game characteristics to mitigate the influence of delay on cloud gaming quality of experience. In Proc. ACM MMSys, 2020.
[64] Kanon Sasaki, Masato Hanai, Kouto Miyazawa, Aki Kobayashi, Naoki Oda, and Saneyasu Yamaguchi. Tcp fairness among modern tcp congestion control algorithms including tcp bbr. In 2018 IEEE 7th international conference on cloud networking (CloudNet), pages 1–4. IEEE, 2018.
[65] Koen De Schepper, Bob Briscoe, and Greg White. Dual-Queue Coupled Active Queue Management (AQM) for Low Latency, Low Loss, and Scalable Throughput (L4S). RFC 9332, January 2023.
[66] Koen De Schepper, Olivier Tilmans, Bob Briscoe, and Vidhi Goel. Prague Congestion Control. Internet-Draft draft-briscoe-iccrg-prague-congestion-control-03, Internet Engineering Task Force, October 2023. Work in Progress.
[67] Ivan Slivar, Mirko Suznjevic, and Lea Skorin-Kapov. The impact of video encoding parameters and game type on qoe for cloud gaming: A case study using the steam platform. In Proc. IEEE International Conference on Quality of Multimedia Experience (QoMEX), 2015.
[68] Ammar Tahir and Radhika Mittal. Enabling users to control their internet. In Proc. USENIX NSDI, pages 555–573, 2023.
[69] C-H Tai, Jiang Zhu, and Nandita Dukkipati. Making large scale deployment of rcp practical for real networks. In Proc. IEEE INFOCOM, 2008.
[70] Pratiksha Thaker, Matei Zaharia, and Tatsunori Hashimoto. Don’t hate the player, hate the game: Safety and utility in multi-agent congestion control. In Proc. ACM HotNets, 2021.
[71] Haiping Wang, Zhenhua Yu, Ruixiao Zhang, Siping Tao, Hebin Yu, and Shu Shi. Twinstar: A practical multi-path transmission framework for ultra-low latency video delivery. In Proc. ACM Multimedia, page 9234–9242, 2023.
[72] Ranysha Ware, Matthew K Mukerjee, Srinivasan Seshan, and Justine Sherry. Beyond jain’s fairness index: Setting the bar for the deployment of congestion control algorithms. In Proc. ACM HotNets, pages 17–24, 2019.
[73] Peter Weidenbach and Johannes vom Dorp. Home router security report 2020. https://www.fkie.fraunhofer.de/content/dam/fkie/de/documents/HomeRouter/HomeRouterSecurity_2020_Bericht.pdf, 2020.
[74] Keith Winstein, Anirudh Sivaraman, and Hari Balakrishnan. Stochastic forecasts achieve high throughput and low delay over cellular networks. In Proc. USENIX NSDI, 2013.
[75] Yaxiong Xie, Fan Yi, and Kyle Jamieson. Pbe-cc: Congestion control via endpoint-centric, physical-layer bandwidth measurements. In Proc. ACM SIGCOMM, 2020.
[76] Zhaoqi Xiong and Noa Zilberman. Do switches dream of machine learning? toward in-network classification. In Proc. ACM HotNEts, pages 25–33, 2019.
[77] Dongzhu Xu, Anfu Zhou, Xinyu Zhang, Guixian Wang, Xi Liu, Congkai An, Yiming Shi, Liang Liu, and Huadong Ma. Understanding operational 5g: A first measurement study on its coverage, performance and energy consumption. In Proc. ACM SIGCOMM, 2020.
[78] Xiaokun Xu and Mark Claypool. Measurement of cloud-based game streaming system response to competing tcp cubic or tcp bbr flows. In Proceedings of the 22nd ACM Internet Measurement Conference, pages 305–316, 2022.
[79] Jia Zhang, Enhuan Dong, Zili Meng, Yuan Yang, Mingwei Xu, Sijie Yang, Miao Zhang, and Yang Yue. Wisetrans: Adaptive transport protocol selection for mobile web service. In Proceedings of the Web Conference, 2021.
[80] Xu Zhang, Siddhartha Sen, Daniar Kurniawan, Haryadi Gunawi, and Junchen Jiang. E2e: embracing user heterogeneity to improve quality of experience on the web. In Proc. ACM SIGCOMM, pages 289–302, 2019.
[81] Yuhan Zhou, Tingfeng Wang, Liying Wang, Nian Wen, Rui Han, Jing Wang, Chenglei Wu, Jiafeng Chen, Longwei Jiang, Shibo Wang, et al. Augur: Practical mobile multipath transport service for low tail latency in real-time streaming. In To appear at USENIX NSDI, 2024.
[82] Xutong Zuo, Yong Cui, Xin Wang, and Jiayu Yang. Deadline-aware multipath transmission for streaming blocks. In Proc. IEEE INFOCOM, pages 2178–2187, 2022.

Appendix A Web Page Connection Analysis

Rank	Website	ACTIVE	IN_USE	OPEN
189	dailymail.co.uk	50	134	250
109	tumblr.com	39	82	153
89	w3schools.com	32	95	176
147	speedtest.net	28	87	137
113	cnn.com	27	118	194
186	namu.wiki	27	112	192
173	indiatimes.com	22	99	136
106	rakuten.co.jp	20	68	97
35	fandom.com	19	68	97
7	yahoo.com	19	42	63

Table 2. Websites in Top 200 that have the highest number of ACTIVE flows.

Many ACTIVE flows. In this section we provide more details about the Web traces we measured in § 2. Tab. 2 presents the websites that have the highest number of concurrent ACTIVE flows. We can see thatdailymail.co.uk has 50 concurrent ACTIVE flows, and 250 OPEN sockets at the same time. This is due to the complicated page structure of the homepage ofdailymail.co.uk. We present a screenshot of the homepage ofdailymail.co.uk in Fig. 17. We can clearly see that there are many objects on the homepage, visible (images, texts, videos) and invisible (scripts, styles). Some objects have dependency over others, so the concurrent ACTIVE flows are fewer than concurrent OPEN sockets, but that still result in 50 flows.

Domains and requests. We further present the distribution of the number of unique HTTP requests and source IPs in Fig. 18(a), together with the size distribution in Fig. 18(b). The median number of unique IPs that loading the homepage of a website will request is 15, while flow sizes range from 100 bytes to 100KBytes with a median number of 15KB. This is also the size of Web flows we used in § 7.3.

	HTTP/1.0	HTTP/1.1	HTTP/2.0	HTTP/3.0
By website	40 (2.37%)	961 (56.97%)	13 (0.77%)	673 (39.89%)
By request	49 (0.02%)	178341 (87.29%)	75 (0.04%)	25833 (12.64%)

Table 3. The distribution of HTTP versions counted by websites and requests. The sum of websites is greater than 1000 since loading one websites may generate requests to different domains, which can use different HTTP versions.

Composition of HTTP versions. A straightforward understanding of multiple flows of loading one Web page is the parallel connection introduced in HTTP/1.1. Our measurement in Fig. 18(a) does show the effect of parallel connection – the median number of connections counted by flow is 2x of that of source IPs. However, the root cause is still the diverse objects on one page, as shown in Fig. 17. To help to better understand the composition of HTTP requests, we present the distribution of HTTP versions in Tab. 3. We can see that different websites actually have a very diverse structure of HTTP versions, where the majority is HTTP/1.1 and HTTP/3.0.

Appendix B Fluid Model Analysis

In this section, we present the details about how we get the results in Tab. 1. We list the notations we will use in Tab. 4.

Parameters and variables:
$B$	Size of each new Web flow.
$N$	Number of new Web flows.
$k$	The responsiveness of a CCA.
$q_{0}$	The delay target that a CCA will try to achieve.
$C$	The link capacity.
$\tau$	The feedback loop of a CCA (usually one RTT).
$B_{0}$	The initial burst of a new flow (e.g., the initial cwnd [29]).
$P$	The scheduling policy.
Functions:
$s(t)$	Sending rate of the real-time flow of time $t$ .
$r(t)$	Available bandwidth of the real-time flow of time $t$ .
$p(t)$	Number of packets in the queue of the real-time flow.
$q(t)$	The queueing delay of the real-time flow.

Table 4. Notations

CCA model. We adopt a simplified delay-convergent CCA model [12, 8], where the delay-sensitive CCA has a target queueing delay, $q_{0}$ . The CCA seeks to maintain its queueing delay around this target, increasing or decreasing its sending rate proportional to the difference between the current delay and the target:

(4)

\small\frac{{\rm d}s(t)}{{\rm d}t}=-k\cdot(q(t-\tau)-q_{0})

Here, $s(t)$ and $q(t)$ are the flow’s instantaneous sending rate and queueing delay at time $t$ , and $\tau$ is the feedback loop of the CCA. $k$ is a coefficient representing the CCA’s responsiveness, indicating how aggressive the CCA is when the delay changes. We explain it quantitatively in Appx. B.5.

Delay model. Next, we analyze the number of packets in the queue, $p(t)$ , at time $t$ . At any $t>0$ , this quantity satisfies the following relationship:

(5)

\small p(t)=p(0)+\int_{0}^{t}\left(s(t^{\prime})-r(t^{\prime})\right){\rm d}t^% {\prime}

where $p(0)=q_{0}\cdot C$ is the number of packets in the buffer in steady state with $C$ being the link capacity. If $r(t)$ represents the service rate for the real-time flow at time $t$ , then the queueing delay can be written as follows:

(6)

\small q(t)=\frac{p(t)}{r(t)}

The real-time flow and the competing flows focus on different metrics. The real-time flow focuses on the maximum queueing delay, $q^{max}_{P}$ , for a given scheduling policy $P$ :

(7)

\small q^{max}_{P}=\max_{t>0}\ q(t)

In this context, we find that $q^{max}_{P}$ serves as a good proxy for the duration of delay degradation since it establishes a lower bound on how quickly previously-queued packets of the real-time flow drain from the bottleneck queue.

The Web flows will focus on flow completion time (FCT), $T$ , which can be expressed as follows:

(8)

\small\int_{0}^{T}\left(C-r(t^{\prime})\right){\rm d}t^{\prime}=N\cdot B

Having established our two figures of merit (maximum queueuing delay and FCT degradation to FQ), we evaluate four scheduling policies: FQ, FIFO, CBQ (1:1), and Confucius. We find that the available bandwidths for these policies satisfy the following relationships:

(9a)	$\displaystyle r_{FQ}(t)\$	$\displaystyle=\ \textstyle{\frac{C}{N+1}}\quad$	$\displaystyle(t>0)$
(9b)	$\displaystyle r_{FIFO}(t)\$	$\displaystyle\leqslant\ \textstyle{C\cdot\frac{Cq_{0}}{Cq_{0}+NB_{0}}}\quad$	$\displaystyle(t>0)$
(9c)	$\displaystyle r_{CBQ}(t)\$	$\displaystyle=\ \textstyle{\frac{C}{2}}\quad$	$\displaystyle(t>0)$
(9d)	$\displaystyle r_{\textsf{Confucius}}(t)\$	$\displaystyle=\ \textstyle{\max\left(\frac{C}{2}\cdot 2^{-\lambda t},\frac{C}{% N+1}\right)}\quad$	$\displaystyle(t>0)$

where for FIFO, $B_{0}$ is the initial burst size of these new flows (e.g., the initial congestion window in TCP). We then solve for the performance degradation of the real-time flow, $q^{max}_{P}$ . For FCT, since FQ provides the ‘fairest’ bandwidth allocation (representing one extreme of the per-flow fairness), we use the FCT for Web flows under FQ, $T_{FQ}$ to normalize and calculate $T_{P}-T_{FQ}$ as the degree to which policy $P$ degrades Web flow performance relative to FQ. Below we analyze four schedulers in detail.

B.1. Fair Queueing (FQ)

Substituting Eq. 9a into Eq. 4, and taking the derivatives, we have:

(10)

\small\frac{{\rm d^{2}}}{{\rm d}t^{2}}s(t)+k\cdot s(t-\tau)=k\frac{C}{N+1}

With loss of generality, we assume $s(\tau)=C$ , meaning that before $N$ flows join, the sending rate has converged to the link capacity. Note that the measurement loop is usually much smaller than the control loop, i.e. $\tau\ll 1/k$ , we then solve the differential equation above as:

(11)

\small s(t)=\left(1-\frac{1}{N+1}\right)\cos\left(\sqrt{k}(t-\tau)\right)+% \frac{1}{N+1}C\quad(t>\tau)

Since we are considering the transient conditions with a small $t$ , where $t$ is less than the first time of $s(t)=r(t)$ , we approximate the formula above with Taylor’s expression:

(12)

\small s(t)=C-C\frac{N}{N+1}\cdot\frac{k}{2}\cdot(t-\tau)^{2}\quad(t>\tau)

Combine with Eq. 6, we have

(13)

\small q(t)=N\left(q_{0}+\tau-\frac{N}{6k(N+1)}(t-\tau)^{2}\right)

We then have the maximum queue delay as:

(14)

\small q^{max}_{FQ}\geqslant q\left(\tau+\sqrt{2}{k}\right)=N\left(\frac{2}{3}% \sqrt{\frac{2}{k}}+q_{0}+\tau\right)

As $N$ increases, $q^{max}_{FIFO}$ will also increase.

Meanwhile, by substituting the available bandwidth in Eq. 8 with Eq. 9a, we have $T_{FQ}$ :

(15)

\small T_{FQ}=\left(1+\frac{1}{N}\right)\cdot\frac{NB}{C}

B.2. FIFO

Since the share of available bandwidth is proportional to the share of buffer occupancy, we estimate $r_{FIFO}(t)$ as in Eq. 9b. Similar to FQ, we can get:

(16)

\small q(t)\geqslant\frac{1}{C}\left(\frac{NB}{q_{0}C}\right)\left(q_{0}C+\int% _{0}^{t}s(t^{\prime}){\rm d}t^{\prime}-tC\frac{1}{\frac{NB}{q_{0}C}+1}\right)

and then

(17)

\small q^{max}_{FIFO}\geqslant q\left(\tau+\sqrt{\frac{2}{k}}\right)

Consequently

(18)

\small q^{max}_{FIFO}\geqslant\left(\frac{NB_{0}}{q_{0}C}+1\right)\left(\frac{% 2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tau\right)

B.3. DRR

As we can see from Eq. 9c, the $r_{DRR}(t)$ is a special case of $r_{FQ}(t)$ with $N=1$ . Therefore, according to the delay degradation result in Eq. 14, we have:

(19)

\small q^{max}_{DRR}\geqslant\frac{2}{3}\sqrt{\frac{2}{k}}+q_{0}+\tau

The FCT satisfies:

(20)

\small T_{DRR}=\frac{2NB}{C}

In this case,

\small T_{DRR}-T_{FQ}=\frac{(N-1)B}{C}

diverges with $N$ and $B$ .

B.4. Confucius

For Confucius, we have:

(21)

\small r_{\textsf{Confucius}}(t)=\frac{C}{2}e^{-\lambda t}\quad(t>0)

we could then solve out (using Laplacian transform, and solve with undetermined coefficients):

(22)

\small s(t)=Ae^{-\lambda(t-\tau)}+B\cos\sqrt{k}(t-\tau)

where

(23)		$\displaystyle\small A=$	$\displaystyle C\cdot\frac{k}{2}\cdot\frac{1}{\lambda^{2}+k\cdot e^{\lambda\tau}}$
(24)		$\displaystyle B=$	$\displaystyle C-A$

Still using Taylor’s approximation:

(25)

\small\begin{array}[]{cl}s(t)&=A\left(1-\lambda(t-\tau)\right)+B\left(1-\frac{% 1}{2}k(t-\tau)^{2}\right)\\ &=-\frac{B}{2}k(t-\tau)^{2}-\lambda A(t-\tau)+A+B\end{array}

Denote the root of $s(t)=0$ on $t>\tau$ as $t_{0}+\tau$ ( $t_{0}>0$ ), we then have

(26)

\small q(t_{0}+\tau)=2e^{\lambda(t_{0}+\tau)}\left(q_{0}+\tau-\left(t_{0}-% \frac{\lambda A}{2C}t_{0}-\frac{kB}{6C}t_{0}^{3}\right)\right)

where $t_{0}$ satisfies:

(27)

\small t_{0}=\frac{-\lambda A+\sqrt{(\lambda A)^{2}+2Bk(A+B)}}{Bk}

Thus, we have a bound of $q^{max}_{\textsf{Confucius}}$ :

(28)

\small q^{max}_{\textsf{Confucius}}\approx q(t_{0}+\tau)=f(\lambda;k,\tau,q_{0})

independent of $B$ or $N$ . bounded. We expand the series as:

(29)

\small\begin{array}[]{rl}f(\lambda)&=F_{0}+F_{1}\lambda+F_{2}\lambda^{2}+o(% \lambda^{2})\\ F_{0}&=2q_{0}+6\tau+\frac{8}{2\sqrt{k}}\\ F_{1}&=\frac{10}{3k}+2q_{0}\tau+2\tau^{2}+\frac{4q_{0}}{\sqrt{k}}+\frac{16\tau% }{3\sqrt{k}}\\ F_{2}&=\frac{4q_{0}}{k}+\frac{6\tau}{k}+q_{0}\tau^{2}+\tau^{3}+\frac{6q_{o}% \tau}{\sqrt{k}}+\frac{11\tau^{2}}{\sqrt{k}}\end{array}

Given that $\frac{1}{k}\ll q_{0},\tau$ , we can simplify and upper bound them into:

(30)

\small q^{max}_{\textsf{Confucius}}\leqslant 6q_{0}+15\tau+\frac{8\lambda}{k}+% \frac{(10q_{0}+15\tau)\lambda^{2}}{k}

The FCT difference over the fair share for new flows is also bounded compared to other baselines. The FCT of $N$ flows with $B$ bytes, $T$ for each flow basically follows:

Recall that $r(t)=\max(C-\frac{C}{2}2^{-\lambda t},\frac{N}{N+1}C)$ , we thus have

(31)

\small T_{\textsf{Confucius}}=\frac{(N+1)B}{C}+\frac{1}{\lambda}\cdot\left(% \frac{1}{2}-\frac{1}{N}\log_{2}\frac{N+1}{2}-\frac{1}{2N}\right)

where $t\geqslant\frac{1}{\lambda}\log_{2}\frac{N+1}{2}$ . In this case,

(32)

\small T_{\textsf{Confucius}}-T_{FQ}\leqslant\frac{1}{\lambda}\cdot\left(\frac% {1}{2}-\frac{1}{N}\log_{2}\frac{N+1}{2}-\frac{1}{2N}\right)\leqslant\frac{\log% _{2}e}{\lambda}

We further plot the unsimplified bound in different $k$ and other parameter settings in Fig. 19. Remember that the theoretical bounds are much greater than the actual experiment results, as shown in § 7.6.

B.5. Responsiveness for CCAs

For different CCAs, we can fit their responsiveness $k$ based on their probing period in the steady state. From the differential equations in Eq. 4 and Eq. 6, during the steady state where $r(t)\equiv C$ , we can solve that the sending rate $s(t)$ follows:

(33)

\small s(t)=C+A\cos(\sqrt{k}t+\varphi)

where $A$ and $\varphi$ are undetermined coefficients. In this case, we can know that the probing period of a CCA is $\frac{2\pi}{\sqrt{k}}$ . From the respective design of CCAs, the probing period for Copa is 5 RTTs, and for BBR is 8 RTTs. For example, when RTT is 40 ms, we will have $k_{Copa}=0.001~{}(ms^{-2}$ ), $k_{BBR}=0.0004~{}(ms^{-2}$ ).

Appendix C Supplementary Experiments

We further evaluate the performance of Confucius in a series of microbenchmarking settings.

C.1. Results for other traces

We further present the results of Confucius over other bandwidth trace datasets (C2: Cellular 4G; C3: Cellular 5G; W1: Office WiFi; W2: Restaurant WiFi) in Figs. 20, 21, 22 and 23. The experiment setting follows the one in Fig. 10. The average and standard deviation of bandwidth of these traces are presented in Fig. 24. We can clearly see that Confucius always pushes the Pareto-optimal frontier of baselines not requiring labels (blue baselines) front. This demonstrates the robustness of Confucius across different bandwidth datasets.

C.2. Fairness

Classful schemes such as CBQ, which splits packets into classful queues of configurable service rate or strict priority, which only dequeues packets of lower priority if high priority is empty, protect the real-time flow. However, classful schemes also result in unfair allocations because they overpenalize (or even starve) web traffic which experiences high page load times (PLTs) as shown in Fig. 4(c). While, in theory, CBQ could be configured to be fair, that requires knowledge of the exact workload (ratio of flows between classes) over very short time intervals, which is in practice infeasible. For example, we measure the fairness that different schedulers can provide while changing the number of competing flows to the real-time flow in Fig. 25. Modifying CBQ’s configuration improves JFI for a subset of the workloads: CBQ (1:1) works well when there are two flows competing while CBQ (1:5) achieves a good JFI when there are five competing flows – they both degrade as the number of flows changes. Even we change the ratio, such a phenomenon still exists – the JFI of CBQ (1:5) is the highest when there are 5 new flows, but degrades drastically in other workloads. In contrast, Confucius can relatively keep the JFI consistent across different number of competitors.

C.3. Working with Bandwidth Probing

Some recent CCAs proposed to periodically probe the available bandwidth by overshooting the network, which might introduce noises in classifying the buffer occupancy of flows in Confucius. Some recent examples for video streaming include Sprout [74], PCC (probing up to 5%) [28], and BBR (probing 25%) [20]. We evaluate how Confucius is able to handle the bandwidth probing from CCAs. We first run one BBR flow, which is the most aggressive one among these bandwidth probing CCAs, and change the RTT from 20 ms to 160 ms since the probing period is counted in the unit of RTT. As shown in Fig. 27, with the other settings the same as Fig. 28, the queue fluctuations never go across the threshold of reclassification of the flow. This is due to the hysteresis design in § 5.1 – Confucius deliberately makes conservative decisions in the classification of flows to smoothize the noises out. This can also be validated from Fig. 13(b): the classification results are stable all the time even if BBR periodically probes the bandwidth. Therefore, Confucius is able to work well with bandwidth-probing CCAs.

C.4. Working with Different Bottleneck

We further evaluate the end-to-end performance when the bottleneck is not where Confucius is deployed. Confucius is able to reduce the latency volatility when it is deployed on the bottleneck router. Our further experiments show that Confucius does not introduce side effects when the bottleneck is before or after the router deployed with Confucius. We still deploy queue management mechanisms to the router before link B and respectively rate-limit the link A, B, and C in Fig. 28 to 20 Mbps:

•

Btlnk-A. When link A is limited while the other two links are set to 100 Mbps, the bottleneck is before the place of Confucius.
•

Btlnk-B. The case when link B is limited is what we mainly evaluated in this section, where Confucius is at the bottleneck.
•

Btlnk-C. When link C is limited, the bottleneck is after the place of Confucius.

For those unmanaged routers, they adopt FIFO as their default mechanism. As shown in Fig. 27, the performance is only affected by the mechanism deployed at the bottleneck. When Confucius is not at the bottleneck (e.g., link A or C), the performance is the same no matter what mechanism is deployed at link B. It is worth to note that as discussed in a series of papers [49, 17], the last-mile routers (e.g., cellular base stations, home wireless APs) are the bottleneck for most of the congestions, in which case deploying Confucius will achieve significant performance benefits.

C.5. Multiple Real-time Flows Competition

We further evaluate the performance when there are multiple real-time flows running simultaneously. We reproduce the experiments in Fig. 10(a) but change the number of real-time flows from 1 to 5. The average duration of delay degradation of real-time flows, and the PLT of Web flows are presented in Fig. 29. Confucius is able to provide a consistent performance for multiple real-time flows in the same time – the delay degradation is consistently negligible independent of the number of concurrent real-time flows and the PLT stays roughly the same place compared to the baselines. Note that since Confucius is designed for last-mile routers, 5 concurrent flows should be able to cover most scenarios [49].

Appendix D Related Work

Queue management solutions. There are numerous efforts on queue management for routers. Besides the solutions we introduced in § 2 and § 7.1, there are even more AQMs proposed back to 2000s [31, 32, 46, 23, 57]. As we discussed in § 2, these AQMs cannot meet the requirement of providing consistent performance and fairness during transient events. At the same time, recent delay- or rate-based CCAs, which are commonly used in real-time flows, are not responsive to such dropping-based or ECN marking-based AQMs. Further, datacenter flow scheduling schemes [14, 9] or buffer management [7] are designed for homogeneous flows (sometimes with labelled packets) and are not suitable for heterogeneous flows in home routers in the wide-area network.

Optimizations for latency consistency. Multiple schemes aim at offering consistent low latency for latency-sensitive applications such as videoconferencing either at the end hosts [36, 21, 13], and/or in-network [49, 38]. Besides, there are also application-specific solutions such as frame-rate or bit-rate adaption [51, 36] and latency compensation [63]. Confucius is orthogonal to such solutions.

Inter-flow fairness. The fairness across flows dates back to the birth of congestion control [42]. Recent work analyzes fairness in different scenarios [58] or defines fairness with different applications [70, 72]. There are also measurements investigating the inter-CCA fairness with emerging CCAs [64, 54, 40]. Instead, Confucius is also able to maintain the long-term fairness across flows.

Confucius: Achieving Consistent Low Latency with Practical Queue Management for Real-Time Communications