tc-hfsc(7) — Linux manual page

NAME | HISTORY & INTRODUCTION | ABBREVIATIONS | BASICS OF HFSC | REALTIME CRITERION | LINKSHARING CRITERION | UPPERLIMIT CRITERION | SEPARATE LS / RT SCs | CORNER CASES | LINUX AND TIMER RESOLUTION | CAVEAT: RANDOM ONLINE EXAMPLES | LAYER2 ADAPTATION | SEE ALSO | AUTHOR | COLOPHON

TC-HFSC(7)                        Linux                       TC-HFSC(7)

NAME         top

       tc-hfcs - Hierarchical Fair Service Curve

HISTORY & INTRODUCTION         top

       HFSC (Hierarchical Fair Service Curve) is a network packet
       scheduling algorithm that was first presented at SIGCOMM'97.
       Developed as a part of ALTQ (ALTernative Queuing) on NetBSD,
       found its way quickly to other BSD systems, and then a few years
       ago became part of the linux kernel. Still, it's not the most
       popular scheduling algorithm - especially if compared to HTB -
       and it's not well documented for the enduser. This introduction
       aims to explain how HFSC works without using too much math
       (although some math it will be inevitable).

       In short HFSC aims to:

           1)  guarantee precise bandwidth and delay allocation for all
               leaf classes (realtime criterion)

           2)  allocate excess bandwidth fairly as specified by class
               hierarchy (linkshare & upperlimit criterion)

           3)  minimize any discrepancy between the service curve and
               the actual amount of service provided during linksharing

       The main "selling" point of HFSC is feature (1), which is
       achieved by using nonlinear service curves (more about what it
       actually is later). This is particularly useful in VoIP or games,
       where not only a guarantee of consistent bandwidth is important,
       but also limiting the initial delay of a data stream. Note that
       it matters only for leaf classes (where the actual queues are) -
       thus class hierarchy is ignored in the realtime case.

       Feature (2) is well, obvious - any algorithm featuring class
       hierarchy (such as HTB or CBQ) strives to achieve that. HFSC does
       that well, although you might end with unusual situations, if you
       define service curves carelessly - see section CORNER CASES for
       examples.

       Feature (3) is mentioned due to the nature of the problem. There
       may be situations where it's either not possible to guarantee
       service of all curves at the same time, and/or it's impossible to
       do so fairly. Both will be explained later. Note that this is
       mainly related to interior (aka aggregate) classes, as the leafs
       are already handled by (1). Still, it's perfectly possible to
       create a leaf class without realtime service, and in such a case
       the caveats will naturally extend to leaf classes as well.

ABBREVIATIONS         top

       For the remaining part of the document, we'll use following
       shortcuts:

           RT - realtime
           LS - linkshare
           UL - upperlimit
           SC - service curve

BASICS OF HFSC         top

       To understand how HFSC works, we must first introduce a service
       curve.  Overall, it's a nondecreasing function of some time unit,
       returning the amount of service (an allowed or allocated amount
       of bandwidth) at some specific point in time. The purpose of it
       should be subconsciously obvious: if a class was allowed to
       transfer not less than the amount specified by its service curve,
       then the service curve is not violated.

       Still, we need more elaborate criterion than just the above
       (although in the most generic case it can be reduced to it). The
       criterion has to take two things into account:

           •   idling periods

           •   the ability to "look back", so if during current active
               period the service curve is violated, maybe it isn't if
               we count excess bandwidth received during earlier active
               period(s)

       Let's define the criterion as follows:

           (1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)

       Here 'w' denotes the amount of service received during some time
       period between t0 and t1. B is a set of all times, where a
       session becomes active after idling period (further denoted as
       'becoming backlogged'). For a clearer picture, imagine two
       situations:

           a)  our session was active during two periods, with a small
               time gap between them

           b)  as in (a), but with a larger gap

       Consider (a): if the service received during both periods meets
       (1), then all is well. But what if it doesn't do so during the
       2nd period? If the amount of service received during the 1st
       period is larger than the service curve, then it might compensate
       for smaller service during the 2nd period and the gap - if the
       gap is small enough.

       If the gap is larger (b) - then it's less likely to happen
       (unless the excess bandwidth allocated during the 1st part was
       really large). Still, the larger the gap - the less interesting
       is what happened in the past (e.g. 10 minutes ago) - what matters
       is the current traffic that just started.

       From HFSC's perspective, more interesting is answering the
       following question: when should we start transferring packets, so
       a service curve of a class is not violated. Or rephrasing it: How
       much X() amount of service should a session receive by time t, so
       the service curve is not violated. Function X() defined as below
       is the basic building block of HFSC, used in: eligible, deadline,
       virtual-time and fit-time curves. Of course, X() is based on
       equation (1) and is defined recursively:

           •   At the 1st backlogged period beginning function X is
               initialized to generic service curve assigned to a class

           •   At any subsequent backlogged period, X() is:
               min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
               ... where t0 denotes the beginning of the current
               backlogged period.

       HFSC uses either linear, or two-piece linear service curves. In
       case of linear or two-piece linear convex functions (first slope
       < second slope), min() in X's definition reduces to the 2nd
       argument. But in case of two-piece concave functions, the 1st
       argument might quickly become lesser for some t>=t0. Note, that
       for some backlogged period, X() is defined only from that
       period's beginning. We also define X^(-1)(w) as smallest t>=t0,
       for which X(t) = w. We have to define it this way, as X() is
       usually not an injection.

       The above generic X() can be one of the following:

           E() In realtime criterion, selects packets eligible for
               sending. If none are eligible, HFSC will use linkshare
               criterion. Eligible time 'et' is calculated with
               reference to packets' heads ( et = E^(-1)(w) ). It's
               based on RT service curve, but in case of a convex curve,
               uses its 2nd slope only.

           D() In realtime criterion, selects the most suitable packet
               from the ones chosen by E(). Deadline time 'dt'
               corresponds to packets' tails (dt = D^(-1)(w+l), where
               'l' is packet's length). Based on RT service curve.

           V() In linkshare criterion, arbitrates which packet to send
               next. Note that V() is function of a virtual time - see
               LINKSHARE CRITERION section for details. Virtual time
               'vt' corresponds to packets' heads (vt = V^(-1)(w)).
               Based on LS service curve.

           F() An extension to linkshare criterion, used to limit at
               which speed linkshare criterion is allowed to dequeue.
               Fit-time 'ft' corresponds to packets' heads as well
               (ft = F^(-1)(w)). Based on UL service curve.

       Be sure to make clean distinction between session's RT, LS and UL
       service curves and the above "utility" functions.

REALTIME CRITERION         top

       RT criterion ignores class hierarchy and guarantees precise
       bandwidth and delay allocation. We say that a packet is eligible
       for sending, when the current real time is later than the
       eligible time of the packet. From all eligible packets, the one
       most suited for sending is the one with the shortest deadline
       time. This sounds simple, but consider the following example:

       Interface 10Mbit, two classes, both with two-piece linear service
       curves:

           •   1st class - 2Mbit for 100ms, then 7Mbit (convex - 1st
               slope < 2nd slope)

           •   2nd class - 7Mbit for 100ms, then 2Mbit (concave - 1st
               slope > 2nd slope)

       Assume for a moment, that we only use D() for both finding
       eligible packets, and choosing the most fitting one, thus
       eligible time would be computed as D^(-1)(w) and deadline time
       would be computed as D^(-1)(w+l). If the 2nd class starts sending
       packets 1 second after the 1st class, it's of course impossible
       to guarantee 14Mbit, as the interface capability is only 10Mbit.
       The only workaround in this scenario is to allow the 1st class to
       send the packets earlier that would normally be allowed. That's
       where separate E() comes to help. Putting all the math aside (see
       HFSC paper for details), E() for RT concave service curve is just
       like D(), but for the RT convex service curve - it's constructed
       using only RT service curve's 2nd slope (in our example
        7Mbit).

       The effect of such E() - packets will be sent earlier, and at the
       same time D() will be updated - so the current deadline time
       calculated from it will be later. Thus, when the 2nd class starts
       sending packets later, both the 1st and the 2nd class will be
       eligible, but the 2nd session's deadline time will be smaller and
       its packets will be sent first. When the 1st class becomes idle
       at some later point, the 2nd class will be able to "buffer" up
       again for later active period of the 1st class.

       A short remark - in a situation, where the total amount of
       bandwidth available on the interface is larger than the allocated
       total realtime parts (imagine a 10 Mbit interface, but
       1Mbit/2Mbit and 2Mbit/1Mbit classes), the sole speed of the
       interface could suffice to guarantee the times.

       Important part of RT criterion is that apart from updating its
       D() and E(), also V() used by LS criterion is updated. Generally
       the RT criterion is secondary to LS one, and used only if there's
       a risk of violating precise realtime requirements. Still, the
       "participation" in bandwidth distributed by LS criterion is
       there, so V() has to be updated along the way. LS criterion can
       than properly compensate for non-ideal fair sharing situation,
       caused by RT scheduling. If you use UL service curve its F() will
       be updated as well (UL service curve is an extension to LS one -
       see UPPERLIMIT CRITERION section).

       Anyway - careless specification of LS and RT service curves can
       lead to potentially undesired situations (see CORNER CASES for
       examples). This wasn't the case in HFSC paper where LS and RT
       service curves couldn't be specified separately.

LINKSHARING CRITERION         top

       LS criterion's task is to distribute bandwidth according to
       specified class hierarchy. Contrary to RT criterion, there're no
       comparisons between current real time and virtual time - the
       decision is based solely on direct comparison of virtual times of
       all active subclasses - the one with the smallest vt wins and
       gets scheduled. One immediate conclusion from this fact is that
       absolute values don't matter - only ratios between them (so for
       example, two children classes with simple linear 1Mbit service
       curves will get the same treatment from LS criterion's
       perspective, as if they were 5Mbit). The other conclusion is,
       that in perfectly fluid system with linear curves, all virtual
       times across whole class hierarchy would be equal.

       Why is VC defined in term of virtual time (and what is it)?

       Imagine an example: class A with two children - A1 and A2, both
       with let's say 10Mbit SCs. If A2 is idle, A1 receives all the
       bandwidth of A (and update its V() in the process). When A2
       becomes active, A1's virtual time is already far later than A2's
       one. Considering the type of decision made by LS criterion, A1
       would become idle for a long time. We can workaround this
       situation by adjusting virtual time of the class becoming active
       - we do that by getting such time "up to date". HFSC uses a mean
       of the smallest and the biggest virtual time of currently active
       children fit for sending. As it's not real time anymore
       (excluding trivial case of situation where all classes become
       active at the same time, and never become idle), it's called
       virtual time.

       Such approach has its price though. The problem is analogous to
       what was presented in previous section and is caused by
       non-linearity of service curves:

       1)  either it's impossible to guarantee service curves and
           satisfy fairness during certain time periods:

           Recall the example from RT section, slightly modified (with
           3Mbit slopes instead of 2Mbit ones):

           •   1st class - 3Mbit for 100ms, then 7Mbit (convex - 1st
               slope < 2nd slope)

           •   2nd class - 7Mbit for 100ms, then 3Mbit (concave - 1st
               slope > 2nd slope)

           They sum up nicely to 10Mbit - the interface's capacity. But
           if we wanted to only use LS for guarantees and fairness - it
           simply won't work. In LS context, only V() is used for making
           decision which class to schedule. If the 2nd class becomes
           active when the 1st one is in its second slope, the fairness
           will be preserved - ratio will be 1:1 (7Mbit:7Mbit), but LS
           itself is of course unable to guarantee the absolute values
           themselves - as it would have to go beyond of what the
           interface is capable of.

       2)  and/or it's impossible to guarantee service curves of all
           classes at the same time [fairly or not]:

           This is similar to the above case, but a bit more subtle. We
           will consider two subtrees, arbitrated by their common (root
           here) parent:

           R (root) - 10Mbit

           A  - 7Mbit, then 3Mbit
           A1 - 5Mbit, then 2Mbit
           A2 - 2Mbit, then 1Mbit

           B  - 3Mbit, then 7Mbit

           R arbitrates between left subtree (A) and right (B). Assume
           that A2 and B are constantly backlogged, and at some later
           point A1 becomes backlogged (when all other classes are in
           their 2nd linear part).

           What happens now? B (choice made by R) will always get 7 Mbit
           as R is only (obviously) concerned with the ratio between its
           direct children. Thus A subtree gets 3Mbit, but its children
           would want (at the point when A1 became backlogged) 5Mbit +
           1Mbit. That's of course impossible, as they can only get
           3Mbit due to interface limitation.

           In the left subtree - we have the same situation as
           previously (fair split between A1 and A2, but violated
           guarantees), but in the whole tree - there's no fairness (B
           got 7Mbit, but A1 and A2 have to fit together in 3Mbit) and
           there's no guarantees for all classes (only B got what it
           wanted). Even if we violated fairness in the A subtree and
           set A2's service curve to 0, A1 would still not get the
           required bandwidth.

UPPERLIMIT CRITERION         top

       UL criterion is an extensions to LS one, that permits sending
       packets only if current real time is later than fit-time ('ft').
       So the modified LS criterion becomes: choose the smallest virtual
       time from all active children, such that fit-time < current real
       time also holds. Fit-time is calculated from F(), which is based
       on UL service curve. As you can see, its role is kinda similar to
       E() used in RT criterion. Also, for obvious reasons - you can't
       specify UL service curve without LS one.

       The main purpose of the UL service curve is to limit HFSC to
       bandwidth available on the upstream router (think adsl home
       modem/router, and linux server as NAT/firewall/etc. with 100Mbit+
       connection to mentioned modem/router).  Typically, it's used to
       create a single class directly under root, setting a linear UL
       service curve to available bandwidth - and then creating your
       class structure from that class downwards. Of course, you're free
       to add a UL service curve (linear or not) to any class with LS
       criterion.

       An important part about the UL service curve is that whenever at
       some point in time a class doesn't qualify for linksharing due to
       its fit-time, the next time it does qualify it will update its
       virtual time to the smallest virtual time of all active children
       fit for linksharing. This way, one of the main things the LS
       criterion tries to achieve - equality of all virtual times across
       whole hierarchy - is preserved (in perfectly fluid system with
       only linear curves, all virtual times would be equal).

       Without that, 'vt' would lag behind other virtual times, and
       could cause problems. Consider an interface with a capacity of
       10Mbit, and the following leaf classes (just in case you're
       skipping this text quickly - this example shows behavior that
       doesn't happen):

       A - ls 5.0Mbit
       B - ls 2.5Mbit
       C - ls 2.5Mbit, ul 2.5Mbit

       If B was idle, while A and C were constantly backlogged, A and C
       would normally (as far as LS criterion is concerned) divide
       bandwidth in 2:1 ratio. But due to UL service curve in place, C
       would get at most 2.5Mbit, and A would get the remaining 7.5Mbit.
       The longer the backlogged period, the more the virtual times of A
       and C would drift apart. If B became backlogged at some later
       point in time, its virtual time would be set to
       (A's vt + C's vt)/2, thus blocking A from sending any traffic
       until B's virtual time catches up with A.

SEPARATE LS / RT SCs         top

       Another difference from the original HFSC paper is that RT and LS
       SCs can be specified separately. Moreover, leaf classes are
       allowed to have only either RT SC or LS SC. For interior classes,
       only LS SCs make sense: any RT SC will be ignored.

CORNER CASES         top

       Separate service curves for LS and RT criteria can lead to
       certain traps that come from "fighting" between ideal linksharing
       and enforced realtime guarantees. Those situations didn't exist
       in original HFSC paper, where specifying separate LS / RT service
       curves was not discussed.

       Consider an interface with a 10Mbit capacity, with the following
       leaf classes:

       A - ls 5.0Mbit, rt 8Mbit
       B - ls 2.5Mbit
       C - ls 2.5Mbit

       Imagine A and C are constantly backlogged. As B is idle, A and C
       would divide bandwidth in 2:1 ratio, considering LS service curve
       (so in theory - 6.66 and 3.33). Alas RT criterion takes priority,
       so A will get 8Mbit and LS will be able to compensate class C for
       only 2 Mbit - this will cause discrepancy between virtual times
       of A and C.

       Assume this situation lasts for a long time with no idle periods,
       and suddenly B becomes active. B's virtual time will be updated
       to (A's vt + C's vt)/2, effectively landing in the middle between
       A's and C's virtual time. The effect - B, having no RT
       guarantees, will be punished and will not be allowed to transfer
       until C's virtual time catches up.

       If the interface had a higher capacity, for example 100Mbit, this
       example would behave perfectly fine though.

       Let's look a bit closer at the above example - it "cleverly"
       invalidates one of the basic things LS criterion tries to achieve
       - equality of all virtual times across class hierarchy. Leaf
       classes without RT service curves are literally left to their own
       fate (governed by messed up virtual times).

       Also, it doesn't make much sense. Class A will always be
       guaranteed up to 8Mbit, and this is more than any absolute
       bandwidth that could happen from its LS criterion (excluding
       trivial case of only A being active). If the bandwidth taken by A
       is smaller than absolute value from LS criterion, the unused part
       will be automatically assigned to other active classes (as A has
       idling periods in such case). The only "advantage" is, that even
       in case of low bandwidth on average, bursts would be handled at
       the speed defined by RT criterion. Still, if extra speed is
       needed (e.g. due to latency), non linear service curves should be
       used in such case.

       In the other words: the LS criterion is meaningless in the above
       example.

       You can quickly "workaround" it by making sure each leaf class
       has RT service curve assigned (thus guaranteeing all of them will
       get some bandwidth), but it doesn't make it any more valid.

       Keep in mind - if you use nonlinear curves and irregularities
       explained above happen only in the first segment, then there's
       little wrong with "overusing" RT curve a bit:

       A - ls 5.0Mbit, rt 9Mbit/30ms, then 1Mbit
       B - ls 2.5Mbit
       C - ls 2.5Mbit

       Here, the vt of A will "spike" in the initial period, but then A
       will never get more than 1Mbit until B & C catch up. Then
       everything will be back to normal.

LINUX AND TIMER RESOLUTION         top

       In certain situations, the scheduler can throttle itself and
       setup so called watchdog to wakeup dequeue function at some time
       later. In case of HFSC it happens when for example no packet is
       eligible for scheduling, and UL service curve is used to limit
       the speed at which LS criterion is allowed to dequeue packets.
       It's called throttling, and accuracy of it is dependent on how
       the kernel is compiled.

       There're 3 important options in modern kernels, as far as timers'
       resolution goes: 'tickless system', 'high resolution timer
       support' and 'timer frequency'.

       If you have 'tickless system' enabled, then the timer interrupt
       will trigger as slowly as possible, but each time a scheduler
       throttles itself (or any other part of the kernel needs better
       accuracy), the rate will be increased as needed / possible. The
       ceiling is either 'timer frequency' if 'high resolution timer
       support' is not available or not compiled in, or it's hardware
       dependent and can go far beyond the highest 'timer frequency'
       setting available.

       If 'tickless system' is not enabled, the timer will trigger at a
       fixed rate specified by 'timer frequency' - regardless if high
       resolution timers are or aren't available.

       This is important to keep those settings in mind, as in scenario
       like: no tickless, no HR timers, frequency set to 100hz -
       throttling accuracy would be at 10ms. It doesn't automatically
       mean you would be limited to ~0.8Mbit/s (assuming packets at
       ~1KB) - as long as your queues are prepared to cover for timer
       inaccuracy. Of course, in case of e.g. locally generated UDP
       traffic - appropriate socket size is needed as well. Short
       example to make it more understandable (assume hardcore
       anti-schedule settings - HZ=100, no HR timers, no tickless):

       tc qdisc add dev eth0 root handle 1:0 hfsc default 1
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10Mbit

       Assuming packet of ~1KB size and HZ=100, that averages to
       ~0.8Mbit - anything beyond it (e.g. the above example with
       specified rate over 10x larger) will require appropriate queuing
       and cause bursts every ~10 ms. As you can imagine, any HFSC's RT
       guarantees will be seriously invalidated by that.  Aforementioned
       example is mainly important if you deal with old hardware - as is
       particularly popular for home server chores. Even then, you can
       easily set HZ=1000 and have very accurate scheduling for typical
       adsl speeds.

       Anything modern (apic or even hpet msi based timers + 'tickless
       system') will provide enough accuracy for superb 1Gbit
       scheduling. For example, on one of my cheap dual-core AMD boards
       I have the following settings:

       tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 300mbit

       And a simple:

       nc -u dst.host.com 54321 </dev/zero
       nc -l -p 54321 >/dev/null

       ...will yield the following effects over a period of ~10 seconds
       (taken from /proc/interrupts):

       319: 42124229   0  HPET_MSI-edge  hpet2 (before)
       319: 42436214   0  HPET_MSI-edge  hpet2 (after 10s.)

       That's roughly 31000/s. Now compare it with HZ=1000 setting. The
       obvious drawback of it is that cpu load can be rather high with
       servicing that many timer interrupts. The example with 300Mbit RT
       service curve on 1Gbit link is particularly ugly, as it requires
       a lot of throttling with minuscule delays.

       Also note that it's just an example showing the capabilities of
       current hardware.  The above example (essentially a 300Mbit TBF
       emulator) is pointless on an internal interface to begin with:
       you will pretty much always want a regular LS service curve
       there, and in such a scenario HFSC simply doesn't throttle at
       all.

       300Mbit RT service curve (selected columns from mpstat -P ALL 1):

       10:56:43 PM  CPU  %sys     %irq   %soft   %idle
       10:56:44 PM  all  20.10    6.53   34.67   37.19
       10:56:44 PM    0  35.00    0.00   63.00    0.00
       10:56:44 PM    1   4.95   12.87    6.93   73.27

       So, in the rare case you need those speeds with only a RT service
       curve, or with a UL service curve: remember the drawbacks.

CAVEAT: RANDOM ONLINE EXAMPLES         top

       For reasons unknown (though well guessed), many examples you can
       google love to overuse UL criterion and stuff it in every node
       possible. This makes no sense and works against what HFSC tries
       to do (and does pretty damn well). Use UL where it makes sense:
       on the uppermost node to match upstream router's uplink capacity.
       Or in special cases, such as testing (limit certain subtree to
       some speed), or customers that must never get more than certain
       speed. In the last case you can usually achieve the same by just
       using a RT criterion without LS+UL on leaf nodes.

       As for the router case - remember it's good to differentiate
       between "traffic to router" (remote console, web config, etc.)
       and "outgoing traffic", so for example:

       tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
       tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50Mbit
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2Mbit ul m2 2Mbit

       ... so "internet" tree under 1:1 and "router itself" as 1:999

LAYER2 ADAPTATION         top

       Please refer to tc-stab(8)

SEE ALSO         top

       tc(8), tc-hfsc(8), tc-stab(8)

       Please direct bugreports and patches to: <netdev@vger.kernel.org>

AUTHOR         top

       Manpage created by Michal Soltys (soltys@ziu.info)

COLOPHON         top

       This page is part of the iproute2 (utilities for controlling
       TCP/IP networking and traffic) project.  Information about the
       project can be found at 
       ⟨http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2⟩.
       If you have a bug report for this manual page, send it to
       netdev@vger.kernel.org, shemminger@osdl.org.  This page was
       obtained from the project's upstream Git repository
       ⟨https://git.kernel.org/pub/scm/network/iproute2/iproute2.git⟩ on
       2021-08-27.  (At that time, the date of the most recent commit
       that was found in the repository was 2021-08-18.)  If you
       discover any rendering problems in this HTML version of the page,
       or you believe there is a better or more up-to-date source for
       the page, or you have corrections or improvements to the
       information in this COLOPHON (which is not part of the original
       manual page), send a mail to man-pages@man7.org

iproute2                     31 October 2011                  TC-HFSC(7)

Pages that refer to this page: tc(8)tc-hfsc(8)tc-stab(8)