mbox series

[net-next,v19,0/8] sched: Add Common Applications Kept Enhanced (cake) qdisc

Message ID 153089141290.14813.1010951705929696896.stgit@alrua-x1
Headers show
Series sched: Add Common Applications Kept Enhanced (cake) qdisc | expand

Message

Toke Høiland-Jørgensen July 6, 2018, 3:37 p.m. UTC
This patch series adds the CAKE qdisc, and has been split up to ease
review.

I have attempted to split out each configurable feature into its own patch.
The first commit adds the base shaper and packet scheduler, while
subsequent commits add the optional features. The full userspace API and
most data structures are included in this commit, but options not
understood in the base version will be ignored.

The result of applying the entire series is identical to the out of tree
version that have seen extensive testing in previous deployments, most
notably as an out of tree patch to OpenWrt. However, note that I have only
compile tested the individual patches; so the whole series should be
considered as a unit.

---
Changelog

v19:
  - Rebase to current net-next.
  - Don't rely on the value of sch->q.qlen to break loops; fixes possible
    infinite loop on multi-queue devices.
  - Don't overwrite NAT flag when setting flow mode.

v18:
  - Rework classification logic in the diffserv case to always hash if
    filter doesn't select a queue, and to run TC filters before
    selecting the diffserv tin (allowing filter to influence this).
  - Make sure we always call qdisc_watchdog_init() in cake_init(), so we
    don't crash in cake_destroy().

v17:
  - Rebase to newest net-next and move the conntrack callback to
    nf_ct_hook
  - Fix a compile error when NF_CONNTRACK is unset.

v16:
  - Move conntrack lookup function into conntrack core and read it via
    RCU so it is only active when the nf_conntrack module is loaded.
    This avoids the module dependency on conntrack for NAT mode. Thanks
    to Pablo for the idea.

v15:
  - Handle ECN flags in ACK filter

v14:
  - Handle seqno wraps and DSACKs in ACK filter

v13:
  - Avoid ktime_t to scalar compares
  - Add class dumping and basic stats
  - Fail with ENOTSUPP when requesting NAT mode and conntrack is not
    available.
  - Parse all TCP options in ACK filter and make sure to only drop safe
    ones. Also handle SACK ranges properly.

v12:
  - Get rid of custom time typedefs. Use ktime_t for time and u64 for
    duration instead.

v11:
  - Fix overhead compensation calculation for GSO packets
  - Change configured rate to be u64 (I ran out of bits before I ran out
    of CPU when testing the effects of the above)

v10:
  - Christmas tree gardening (fix variable declarations to be in reverse
    line length order)

v9:
  - Remove duplicated checks around kvfree() and just call it
    unconditionally.
  - Don't pass __GFP_NOWARN when allocating memory
  - Move options in cake_dump() that are related to optional features to
    later patches implementing the features.
  - Support attaching filters to the qdisc and use the classification
    result to select flow queue.
  - Support overriding diffserv priority tin from skb->priority

v8:
  - Remove inline keyword from function definitions
  - Simplify ACK filter; remove the complex state handling to make the
    logic easier to follow. This will potentially be a bit less efficient,
    but I have not been able to measure a difference.

v7:
  - Split up patch into a series to ease review.
  - Constify the ACK filter.

v6:
  - Fix 6in4 encapsulation checks in ACK filter code
  - Checkpatch fixes

v5:
  - Refactor ACK filter code and hopefully fix the safety issues
    properly this time.

v4:
  - Only split GSO packets if shaping at speeds <= 1Gbps
  - Fix overhead calculation code to also work for GSO packets
  - Don't re-implement kvzalloc()
  - Remove local header include from out-of-tree build (fixes kbuild-bot
    complaint).
  - Several fixes to the ACK filter:
    - Check pskb_may_pull() before deref of transport headers.
    - Don't run ACK filter logic on split GSO packets
    - Fix TCP sequence number compare to deal with wraparounds

v3:
  - Use IS_REACHABLE() macro to fix compilation when sch_cake is
    built-in and conntrack is a module.
  - Switch the stats output to use nested netlink attributes instead
    of a versioned struct.
  - Remove GPL boilerplate.
  - Fix array initialisation style.

v2:
  - Fix kbuild test bot complaint
  - Clean up the netlink ABI
  - Fix checkpatch complaints
  - A few tweaks to the behaviour of cake based on testing carried out
    while writing the paper.

---

Toke Høiland-Jørgensen (8):
      sched: Add Common Applications Kept Enhanced (cake) qdisc
      sch_cake: Add ingress mode
      sch_cake: Add optional ACK filter
      netfilter: Add nf_ct_get_tuple_skb global lookup function
      sch_cake: Add NAT awareness to packet classifier
      sch_cake: Add DiffServ handling
      sch_cake: Add overhead compensation support to the rate shaper
      sch_cake: Conditionally split GSO segments


 include/linux/netfilter.h         |   11 
 include/uapi/linux/pkt_sched.h    |  114 +
 net/netfilter/core.c              |   15 
 net/netfilter/nf_conntrack_core.c |   36 
 net/sched/Kconfig                 |   11 
 net/sched/Makefile                |    1 
 net/sched/sch_cake.c              | 3019 +++++++++++++++++++++++++++++++++++++
 7 files changed, 3207 insertions(+)
 create mode 100644 net/sched/sch_cake.c

Comments

David Miller July 11, 2018, 5:56 a.m. UTC | #1
From: Toke Høiland-Jørgensen <toke@toke.dk>
Date: Fri, 06 Jul 2018 17:37:19 +0200

> This patch series adds the CAKE qdisc, and has been split up to ease
> review.
> 
> I have attempted to split out each configurable feature into its own patch.
> The first commit adds the base shaper and packet scheduler, while
> subsequent commits add the optional features. The full userspace API and
> most data structures are included in this commit, but options not
> understood in the base version will be ignored.
> 
> The result of applying the entire series is identical to the out of tree
> version that have seen extensive testing in previous deployments, most
> notably as an out of tree patch to OpenWrt. However, note that I have only
> compile tested the individual patches; so the whole series should be
> considered as a unit.

Ok, I decided to apply this even though there are still bits I'm not
%100 happy with.

I don't like the netfilter dependency at all.

You can get the NAT addresses in other ways as I've tried to suggest
in the past.  Your scheme absolutely does not work with act_nat
in the packet scheduler, not any NAT done by XDP/eBPF programs.
Toke Høiland-Jørgensen July 11, 2018, 8:40 p.m. UTC | #2
David Miller <davem@davemloft.net> writes:

> From: Toke Høiland-Jørgensen <toke@toke.dk>
> Date: Fri, 06 Jul 2018 17:37:19 +0200
>
>> This patch series adds the CAKE qdisc, and has been split up to ease
>> review.
>> 
>> I have attempted to split out each configurable feature into its own patch.
>> The first commit adds the base shaper and packet scheduler, while
>> subsequent commits add the optional features. The full userspace API and
>> most data structures are included in this commit, but options not
>> understood in the base version will be ignored.
>> 
>> The result of applying the entire series is identical to the out of tree
>> version that have seen extensive testing in previous deployments, most
>> notably as an out of tree patch to OpenWrt. However, note that I have only
>> compile tested the individual patches; so the whole series should be
>> considered as a unit.
>
> Ok, I decided to apply this even though there are still bits I'm not
> %100 happy with.

Yay, awesome, thanks! :)

> I don't like the netfilter dependency at all.
>
> You can get the NAT addresses in other ways as I've tried to suggest
> in the past. Your scheme absolutely does not work with act_nat in the
> packet scheduler, not any NAT done by XDP/eBPF programs.

Just to reiterate why we didn't go with your suggestion of recording the
pre-NAT IP in the flow dissector as the packet comes in:

- It only works on egress; on ingress (with an ifb), packets hit the
  qdisc before NAT, so we need the stateful lookup in CAKE for this
  case, which is a common deployment scenario.

- It's not needed for act_nat (for 1-to-1 NAT, hashing on the post-NAT
  IP is fine), and it won't work for XDP (which would change the packets
  before the flow dissector sees them). This means that custom NAT
  solutions in TC BPF hooks are the only ones that would benefit; and
  they can just set the classifier to achieve the same thing.

Now, I'm absolutely not opposed to having this as a fallback egress-only
mechanism. I might even be convinced to write it myself if someone
demonstrates that they really need it :)

-Toke