mbox series

[net-next,v5,00/11] ethtool: Add support for frame preemption

Message ID 20220520011538.1098888-1-vinicius.gomes@intel.com
Headers show
Series ethtool: Add support for frame preemption | expand

Message

Vinicius Costa Gomes May 20, 2022, 1:15 a.m. UTC
Hi,

Please consider this as a PATCH-like quality RFC (in short, even in
the absence of comments, please do not apply this series as is), my
aim is to get an consensus on the userspace API.

I also found some weirdness with Intel I226, that I would like to
investigate better. So, maybe it's a good use of everyone's time to
have this series out, so people can take a look at the more
controversial parts while I investigate/fix those issues.

(The checkpatch.pl warnings about the spelling of "preemptible" are
ignored because that's the way it's spelled in IEEE 802.1Q-2018, but
in IEEE 802.3-2018 it's preemptable, it's a mess)

Changes from v4:
 - Went back to exposing the per-queue frame preemption bits via
   ethtool-netlink only, via taprio/mqprio was seen as too much
   trouble. (Vladimir Oltean)
 - Fixed documentation and code/patch organization changes (Vladimir
   Oltean).

Changes from v3:
 - Added early support for sending/receiving support for verification
   frames (Vladimir Oltean). This is a bit more than RFC-quality, but
   adding this so people can see how it fits together with the rest.
   The driver specific bits are interesting because the hardware does
   the absolute minimum, the driver needs to do the heavy lifting.

 - Added support for setting preemptible/express traffic classes via
   tc-mqprio (Vladimir Oltean). mqprio parsing of configuration
   options is... interesting, so comments here are going to be useful,
   I may have missed something.

Changes from v2:
 - Fixed some copy&paste mistakes, documentation formatting and
   slightly improved error reporting (Jakub Kicinski);

Changes from v1:
 - The minimum fragment size configuration was changed to be
   configured in bytes to be more future proof, in case the standard
   changes this (the previous definition was '(X + 1) * 64', X being
   [0..3]) (Michal Kubecek);
 - In taprio, frame preemption is now configured by traffic classes (was
   done by queues) (Jakub Kicinski, Vladimir Oltean);
 - Various netlink protocol validation improvements (Jakub Kicinski);
 - Dropped the IGC register dump for frame preemption registers, until a
   stardandized way of exposing that is agreed (Jakub Kicinski);

Changes from RFC v2:
 - Reorganised the offload enabling/disabling on the driver size;
 - Added a few igc fixes;

Changes from RFC v1:
 - The per-queue preemptible/express setting is moved to applicable
   qdiscs (Jakub Kicinski and others);
 - "min-frag-size" now follows the 802.3br specification more closely,
   it's expressed as X in '64(1 + X) + 4' (Joergen Andreasen);

Another point that should be noted is the addition of the
TC_SETUP_PREEMPT offload type, the idea behind this is to allow other
qdiscs (was thinking of mqprio) to also configure which traffic
classes should be marked as express/preemptible.

Original cover letter (lightly edited):

This is still an RFC because two main reasons, I want to confirm that
this approach (per-queue settings via qdiscs, device settings via
ethtool) looks good, even though there aren't much more options left ;-)
The other reason is that while testing this I found some weirdness
in the driver that I would need a bit more time to investigate.

(In case these patches are not enough to give an idea of how things
work, I can send the userspace patches, of course.)

The idea of this "hybrid" approach is that applications/users would do
the following steps to configure frame preemption:

$ tc qdisc replace dev $IFACE parent root handle 100 taprio \
      num_tc 3 \
      map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
      queues 1@0 1@1 2@2 \
      base-time $BASE_TIME \
      sched-entry S 0f 10000000 \
      preempt 1110 \
      flags 0x2 

The "preempt" parameter is the only difference, it configures which
traffic classes are marked as preemptible, in this example, traffic
class 0 is marked as "not preemptible", so it is express, the rest of
the four traffic classes are preemptible.

The next step, of this example, would be to enable frame preemption in
the device, via ethtool, and set the minimum fragment size to 192 bytes:

$ sudo ./ethtool --set-frame-preemption $IFACE fp on min-frag-size 192

Cheers,


Vinicius Costa Gomes (11):
  ethtool: Add support for configuring frame preemption
  ethtool: Add support for Frame Preemption verification
  igc: Add support for receiving frames with all zeroes address
  igc: Set the RX packet buffer size for TSN mode
  igc: Optimze TX buffer sizes for TSN
  igc: Add support for receiving errored frames
  igc: Add support for enabling frame preemption via ethtool
  igc: Add support for setting frame preemption configuration
  igc: Add support for Frame Preemption verification
  igc: Check incompatible configs for Frame Preemption
  igc: Add support for exposing frame preemption stats registers

 Documentation/networking/ethtool-netlink.rst |  55 ++++
 drivers/net/ethernet/intel/igc/igc.h         |  29 ++-
 drivers/net/ethernet/intel/igc/igc_defines.h |  22 +-
 drivers/net/ethernet/intel/igc/igc_ethtool.c |  92 +++++++
 drivers/net/ethernet/intel/igc/igc_main.c    | 256 +++++++++++++++++++
 drivers/net/ethernet/intel/igc/igc_regs.h    |  10 +
 drivers/net/ethernet/intel/igc/igc_tsn.c     |  57 ++++-
 include/linux/ethtool.h                      |  26 ++
 include/uapi/linux/ethtool_netlink.h         |  20 ++
 net/ethtool/Makefile                         |   3 +-
 net/ethtool/common.c                         |  23 ++
 net/ethtool/netlink.c                        |  19 ++
 net/ethtool/netlink.h                        |   4 +
 net/ethtool/preempt.c                        | 188 ++++++++++++++
 14 files changed, 791 insertions(+), 13 deletions(-)
 create mode 100644 net/ethtool/preempt.c

Comments

Jakub Kicinski May 20, 2022, 10:34 p.m. UTC | #1
On Thu, 19 May 2022 18:15:27 -0700 Vinicius Costa Gomes wrote:
> Changes from v4:
>  - Went back to exposing the per-queue frame preemption bits via
>    ethtool-netlink only, via taprio/mqprio was seen as too much
>    trouble. (Vladimir Oltean)
>  - Fixed documentation and code/patch organization changes (Vladimir
>    Oltean).

First of all - could you please, please, please rev these patches more
than once a year? It's really hard to keep track of the context when
previous version was sent in Jun 2021 :/

I disagree that queue mask belongs in ethtool. It's an attribute of 
a queue and should be attached to a queue. The DCBNL parallel is flawed
IMO because pause generation is Rx, not Tx. There is no Rx queue in
Linux, much less per-prio.
Vladimir Oltean May 21, 2022, 3:03 p.m. UTC | #2
Hi Jakub,

On Fri, May 20, 2022 at 03:34:13PM -0700, Jakub Kicinski wrote:
> On Thu, 19 May 2022 18:15:27 -0700 Vinicius Costa Gomes wrote:
> > Changes from v4:
> >  - Went back to exposing the per-queue frame preemption bits via
> >    ethtool-netlink only, via taprio/mqprio was seen as too much
> >    trouble. (Vladimir Oltean)
> >  - Fixed documentation and code/patch organization changes (Vladimir
> >    Oltean).
> 
> First of all - could you please, please, please rev these patches more
> than once a year? It's really hard to keep track of the context when
> previous version was sent in Jun 2021 :/

It would have been nice if Vinicius would have posted these more than
once a year. But let's not throw stones at each other, I'm sure everyone
is doing their best ;) These are new specs, their usefulness in the Real
World is still being evaluated, and you shouldn't underestimate the
difficulty of exposing standard Linux interfaces for pre-standard or
first-generation hardware. Even the mistakes I may be making in my
interpretation of the spec below are in good faith (i.e. if I'm wrong
it's because I'm stupid, not because I'm interested in leaning the
implementation towards an interface that's more conventient for the
hardware I'm working with).



> I disagree that queue mask belongs in ethtool. It's an attribute of 
> a queue and should be attached to a queue.

Sure, you have very strong reasons to disagree with that statement, if
only the premise were true. But you have to understand that IEEE 802.1Q
does not talk about preemptible queues, but about preemptible priorities.
Here:

| 6.7.1 Support of the ISS by IEEE Std 802.3 (Ethernet)
| 
| For priority values that are identified in the frame preemption status table
| (6.7.2) as preemptible, frames that are selected for transmission shall be
| transmitted using the pMAC service instance, and for priority values that are
| identified in the frame preemption status table as express, frames that are
| selected for transmission shall be transmitted using the eMAC service instance.
| In all other respects, the Port behaves as if it is supported by a single MAC
| service interface. In particular, all frames received by the Port are treated
| as if they were received on a single MAC service interface regardless of
| whether they were received on the eMAC service interface or the pMAC service
| interface, except with respect to frame preemption.
| 
| 6.7.2 Frame preemption
| If the Port supports frame preemption, then a value of frame preemption status
| is assigned to each value of priority via a frame preemption status table. The
| possible values of frame preemption status are express or preemptible.
| The frame preemption status table can be changed by management as described in
| 12.30.1.1. The default value of frame preemption status is express for all
| priority values.

For context, I probably need to point out the distinction that the spec
makes between a priority and a traffic class.

A priority is a number assigned to a packet based on the VLAN PCP using
the rules in clause 6.9.3 Priority Code Point encoding. In Linux, it is
more or less equivalent to skb->priority.

A traffic class, on the other hand, is defined as basically synonimous
with a TX priority queue, as follows:

| 3.268 traffic class: A classification used to expedite transmission of frames
| generated by critical or time-sensitive services. Traffic classes are numbered
| from zero through N-1, where N is the number of outbound queues associated with
| a given Bridge Port, and 1 <= N <= 8, and each traffic class has a one-to-one
| correspondence with a specific outbound queue for that Port. Traffic class 0
| corresponds to nonexpedited traffic; nonzero traffic classes correspond to
| expedited classes of traffic. A fixed mapping determines, for a given priority
| associated with a frame and a given number of traffic classes, what traffic
| class will be assigned to the frame.

A priority is translated into a traffic class using Table 8-5:
Recommended priority to traffic class mappings, which in Linux would be
handled using the tc-mqprio "map".

But attention, a priority TX queue is not the same as a netdev TX queue,
but rather the same as a tc-mqprio traffic class (i.e. when you specify
"queues count@offset" to mqprio, from Linux perspective there are "count"
queues, from 802.1Q perspective there is only the "offset" queue (or TC).
This is because we may have per-CPU queues, etc.

This is even spelled out in this note:

| NOTE 3 - A queue in this context is not necessarily a single FIFO data structure.
| A queue is a record of all frames of a given traffic class awaiting
| transmission on a given Bridge Port. The structure of this record is not
| specified. The transmission selection algorithm (8.6.8) determines which
| traffic class, among those classes with frames available for transmission,
| provides the next frame for transmission. The method of determining which frame
| within a traffic class is the next available frame is not specified beyond
| conforming to the frame ordering requirements of this subclause. This allows a
| variety of queue structures such as a single FIFO, or a set of FIFOs with one
| for each pairing of ingress and egress ports (i.e., Virtual Output Queuing), or
| a set of FIFOs with one for each VLAN or priority, or hierarchical structures.

I'm not sure how much of this was already clear and how much wasn't.
I apologize if I'm not bringing new info to the table. I just want to
point out what a "queue" is, and what a "priority" is.



> The DCBNL parallel is flawed IMO because pause generation is Rx, not
> Tx. There is no Rx queue in Linux, much less per-prio.

First of all: we both know that PFC is not only about RX, right? :) Here:

| 8.6.8 Transmission selection
| In a port of a Bridge or station that supports PFC, a frame of priority
| n is not available for transmission if that priority is paused (i.e., if
| Priority_Paused[n] is TRUE (see 36.1.3.2) on that port.
| 
| NOTE 1 - Two or more priorities can be combined in a single queue. In
| this case if one or more of the priorities in the queue are paused, it
| is possible for frames in that queue not belonging to the paused
| priority to not be scheduled for transmission.
| 
| NOTE 2 - Mixing PFC and non-PFC priorities in the same queue results in
| non-PFC traffic being paused causing congestion spreading, and therefore
| is not recommended.

And that's kind of my whole point: PFC is per _priority_, not per
"queue"/"traffic class". And so is frame preemption (right below, same
clause). So the parallel isn't flawed at all. The dcbnl-pfc isn't in tc
for a reason, and that isn't because we don't have RX netdev queues...
And the reason why dcbnl-pfc isn't in tc is the same reason why ethtool
frame preemption shouldn't, either.

| In a port of a Bridge or station that supports frame preemption, a frame
| of priority n is not available for transmission if that priority is
| identified in the frame preemption status table (6.7.2) as preemptible
| and either the holdRequest object (12.30.1.5) is set to the value hold,
| or the transmission of a prior preemptible frame has yet to complete
| because it has been interrupted to allow the transmission of an express
| frame.

So since the managed objects for frame preemption are stipulated by IEEE
per priority:

| The framePreemptionStatusTable (6.7.2) consists of 8
| framePreemptionAdminStatus values (12.30.1.1.1), one per priority.

I think it is only reasonable for Linux to expose the same thing, and
let drivers do the priority to queue or traffic class remapping as they
see fit, when tc-mqprio or tc-taprio or other qdiscs that change this
mapping are installed (if their preemption hardware implementation is
per TC or queue rather than per priority). After all, you can have 2
priorities mapped to the same TC, but still have one express and one
preemptible. That is to say, you can implement preemption even in single
"queue" devices, and it even makes sense.
Jakub Kicinski May 23, 2022, 7:52 p.m. UTC | #3
On Sat, 21 May 2022 15:03:05 +0000 Vladimir Oltean wrote:
> > I disagree that queue mask belongs in ethtool. It's an attribute of 
> > a queue and should be attached to a queue.  
> 
> Sure, you have very strong reasons to disagree with that statement, if
> only the premise were true. But you have to understand that IEEE 802.1Q
> does not talk about preemptible queues, but about preemptible priorities.
> Here:
> 
> | 6.7.1 Support of the ISS by IEEE Std 802.3 (Ethernet)
> | 
> | For priority values that are identified in the frame preemption status table
> | (6.7.2) as preemptible, frames that are selected for transmission shall be
> | transmitted using the pMAC service instance, and for priority values that are
> | identified in the frame preemption status table as express, frames that are
> | selected for transmission shall be transmitted using the eMAC service instance.
> | In all other respects, the Port behaves as if it is supported by a single MAC
> | service interface. In particular, all frames received by the Port are treated
> | as if they were received on a single MAC service interface regardless of
> | whether they were received on the eMAC service interface or the pMAC service
> | interface, except with respect to frame preemption.
> | 
> | 6.7.2 Frame preemption
> | If the Port supports frame preemption, then a value of frame preemption status
> | is assigned to each value of priority via a frame preemption status table. The
> | possible values of frame preemption status are express or preemptible.
> | The frame preemption status table can be changed by management as described in
> | 12.30.1.1. The default value of frame preemption status is express for all
> | priority values.
> 
> For context, I probably need to point out the distinction that the spec
> makes between a priority and a traffic class.
> 
> A priority is a number assigned to a packet based on the VLAN PCP using
> the rules in clause 6.9.3 Priority Code Point encoding. In Linux, it is
> more or less equivalent to skb->priority.
> 
> A traffic class, on the other hand, is defined as basically synonimous
> with a TX priority queue, as follows:
> 
> | 3.268 traffic class: A classification used to expedite transmission of frames
> | generated by critical or time-sensitive services. Traffic classes are numbered
> | from zero through N-1, where N is the number of outbound queues associated with
> | a given Bridge Port, and 1 <= N <= 8, and each traffic class has a one-to-one
> | correspondence with a specific outbound queue for that Port. Traffic class 0
> | corresponds to nonexpedited traffic; nonzero traffic classes correspond to
> | expedited classes of traffic. A fixed mapping determines, for a given priority
> | associated with a frame and a given number of traffic classes, what traffic
> | class will be assigned to the frame.
> 
> A priority is translated into a traffic class using Table 8-5:
> Recommended priority to traffic class mappings, which in Linux would be
> handled using the tc-mqprio "map".
> 
> But attention, a priority TX queue is not the same as a netdev TX queue,
> but rather the same as a tc-mqprio traffic class (i.e. when you specify
> "queues count@offset" to mqprio, from Linux perspective there are "count"
> queues, from 802.1Q perspective there is only the "offset" queue (or TC).
> This is because we may have per-CPU queues, etc.
> 
> This is even spelled out in this note:
> 
> | NOTE 3 - A queue in this context is not necessarily a single FIFO data structure.
> | A queue is a record of all frames of a given traffic class awaiting
> | transmission on a given Bridge Port. The structure of this record is not
> | specified. The transmission selection algorithm (8.6.8) determines which
> | traffic class, among those classes with frames available for transmission,
> | provides the next frame for transmission. The method of determining which frame
> | within a traffic class is the next available frame is not specified beyond
> | conforming to the frame ordering requirements of this subclause. This allows a
> | variety of queue structures such as a single FIFO, or a set of FIFOs with one
> | for each pairing of ingress and egress ports (i.e., Virtual Output Queuing), or
> | a set of FIFOs with one for each VLAN or priority, or hierarchical structures.
> 
> I'm not sure how much of this was already clear and how much wasn't.
> I apologize if I'm not bringing new info to the table. I just want to
> point out what a "queue" is, and what a "priority" is.

Very useful, thanks!

> > The DCBNL parallel is flawed IMO because pause generation is Rx, not
> > Tx. There is no Rx queue in Linux, much less per-prio.  
> 
> First of all: we both know that PFC is not only about RX, right? :) Here:
> 
> | 8.6.8 Transmission selection
> | In a port of a Bridge or station that supports PFC, a frame of priority
> | n is not available for transmission if that priority is paused (i.e., if
> | Priority_Paused[n] is TRUE (see 36.1.3.2) on that port.
> | 
> | NOTE 1 - Two or more priorities can be combined in a single queue. In
> | this case if one or more of the priorities in the queue are paused, it
> | is possible for frames in that queue not belonging to the paused
> | priority to not be scheduled for transmission.
> | 
> | NOTE 2 - Mixing PFC and non-PFC priorities in the same queue results in
> | non-PFC traffic being paused causing congestion spreading, and therefore
> | is not recommended.
> 
> And that's kind of my whole point: PFC is per _priority_, not per
> "queue"/"traffic class". And so is frame preemption (right below, same
> clause). So the parallel isn't flawed at all. The dcbnl-pfc isn't in tc
> for a reason, and that isn't because we don't have RX netdev queues...
> And the reason why dcbnl-pfc isn't in tc is the same reason why ethtool
> frame preemption shouldn't, either.

My understanding is that DCBNL is not in ethtool is that it was built
primarily for converged Ethernet. ethtool being a netdev thing it's
largely confined to coarse interface configuration in such
environments, they certainly don't use TC to control RDMA queues.

To put it differently DCBNL separates RoCE and storage queues from
netdev queues (latter being lossy). It's Conway's law at work.

Frame preemption falls entirely into netdev land. We can use the right
interface rather than building a FW shim^W "generic" interface.

> | In a port of a Bridge or station that supports frame preemption, a frame
> | of priority n is not available for transmission if that priority is
> | identified in the frame preemption status table (6.7.2) as preemptible
> | and either the holdRequest object (12.30.1.5) is set to the value hold,
> | or the transmission of a prior preemptible frame has yet to complete
> | because it has been interrupted to allow the transmission of an express
> | frame.
> 
> So since the managed objects for frame preemption are stipulated by IEEE
> per priority:
> 
> | The framePreemptionStatusTable (6.7.2) consists of 8
> | framePreemptionAdminStatus values (12.30.1.1.1), one per priority.
> 
> I think it is only reasonable for Linux to expose the same thing, and
> let drivers do the priority to queue or traffic class remapping as they
> see fit, when tc-mqprio or tc-taprio or other qdiscs that change this
> mapping are installed (if their preemption hardware implementation is
> per TC or queue rather than per priority). After all, you can have 2
> priorities mapped to the same TC, but still have one express and one
> preemptible. That is to say, you can implement preemption even in single
> "queue" devices, and it even makes sense.

Honestly I feel like I'm missing a key detail because all you wrote
sounds like an argument _against_ exposing the queue mask in ethtool.
Neither the standard calls for it, nor is it convenient to the user
who sets the prio->tc and queue allocation in TC.

If we wanted to expose prio mask in ethtool, that's a different story.
Vladimir Oltean May 23, 2022, 8:32 p.m. UTC | #4
On Mon, May 23, 2022 at 12:52:38PM -0700, Jakub Kicinski wrote:
> > > The DCBNL parallel is flawed IMO because pause generation is Rx, not
> > > Tx. There is no Rx queue in Linux, much less per-prio.  
> > 
> > First of all: we both know that PFC is not only about RX, right? :) Here:
> > 
> > | 8.6.8 Transmission selection
> > | In a port of a Bridge or station that supports PFC, a frame of priority
> > | n is not available for transmission if that priority is paused (i.e., if
> > | Priority_Paused[n] is TRUE (see 36.1.3.2) on that port.
> > | 
> > | NOTE 1 - Two or more priorities can be combined in a single queue. In
> > | this case if one or more of the priorities in the queue are paused, it
> > | is possible for frames in that queue not belonging to the paused
> > | priority to not be scheduled for transmission.
> > | 
> > | NOTE 2 - Mixing PFC and non-PFC priorities in the same queue results in
> > | non-PFC traffic being paused causing congestion spreading, and therefore
> > | is not recommended.
> > 
> > And that's kind of my whole point: PFC is per _priority_, not per
> > "queue"/"traffic class". And so is frame preemption (right below, same
> > clause). So the parallel isn't flawed at all. The dcbnl-pfc isn't in tc
> > for a reason, and that isn't because we don't have RX netdev queues...
> > And the reason why dcbnl-pfc isn't in tc is the same reason why ethtool
> > frame preemption shouldn't, either.
> 
> My understanding is that DCBNL is not in ethtool is that it was built
> primarily for converged Ethernet. ethtool being a netdev thing it's
> largely confined to coarse interface configuration in such
> environments, they certainly don't use TC to control RDMA queues.
> 
> To put it differently DCBNL separates RoCE and storage queues from
> netdev queues (latter being lossy). It's Conway's law at work.
> 
> Frame preemption falls entirely into netdev land. We can use the right
> interface rather than building a FW shim^W "generic" interface.

Not sure where you're aiming with this, sorry. Why dcbnl is not
integrated in ethtool is a bit beside the point. What was relevant about
PFC as an analogy was it's something that is configured per priority
[ and not per queue ] and does not belong to the qdisc for that reason.

> > | In a port of a Bridge or station that supports frame preemption, a frame
> > | of priority n is not available for transmission if that priority is
> > | identified in the frame preemption status table (6.7.2) as preemptible
> > | and either the holdRequest object (12.30.1.5) is set to the value hold,
> > | or the transmission of a prior preemptible frame has yet to complete
> > | because it has been interrupted to allow the transmission of an express
> > | frame.
> > 
> > So since the managed objects for frame preemption are stipulated by IEEE
> > per priority:
> > 
> > | The framePreemptionStatusTable (6.7.2) consists of 8
> > | framePreemptionAdminStatus values (12.30.1.1.1), one per priority.
> > 
> > I think it is only reasonable for Linux to expose the same thing, and
> > let drivers do the priority to queue or traffic class remapping as they
> > see fit, when tc-mqprio or tc-taprio or other qdiscs that change this
> > mapping are installed (if their preemption hardware implementation is
> > per TC or queue rather than per priority). After all, you can have 2
> > priorities mapped to the same TC, but still have one express and one
> > preemptible. That is to say, you can implement preemption even in single
> > "queue" devices, and it even makes sense.
> 
> Honestly I feel like I'm missing a key detail because all you wrote
> sounds like an argument _against_ exposing the queue mask in ethtool.

Yeah, I guess the key detail that you're missing is that there's no such
thing as "preemptible queue mask" in 802.1Q. My feeling is that both
Vinicius and myself were confused in different ways by some spec
definitions and had slightly different things in mind, and we've
essentially ended up debating where a non-standard thing should go.

In my case, I said in my reply to the previous patch set that a priority
is essentially synonymous with a traffic class (which it isn't, as per
the definitions above), so I used the "traffic class" term incorrectly
and didn't capitalize the "priority" word, which I should have.
https://patchwork.kernel.org/project/netdevbpf/patch/20210626003314.3159402-3-vinicius.gomes@intel.com/#24812068

In Vinicius' case, part of the confusion might come from the fact that
his hardware really has preemption configurable per queue, and he
mistook it for the standard itself.

> Neither the standard calls for it, nor is it convenient to the user
> who sets the prio->tc and queue allocation in TC.
> 
> If we wanted to expose prio mask in ethtool, that's a different story.

Re-reading what I've said, I can't say "I was right all along"
(not by a long shot, sorry for my part in the confusion), but I guess
the conclusion is that:

(a) "preemptable queues" needs to become "preemptable priorities" in the
    UAPI. The question becomes how to expose the mask of preemptable
    priorities. A simple u8 bit mask where "BIT(i) == 1" means "prio i
    is preemptable", or with a nested netlink attribute scheme similar
    to DCB_PFC_UP_ATTR_0 -> DCB_PFC_UP_ATTR_7?

(b) keeping the "preemptable priorities" away from tc-qdisc is ok

(c) non-standard hardware should deal with prio <-> queue mapping by
    itself if its queues are what are preemptable
Jakub Kicinski May 23, 2022, 9:31 p.m. UTC | #5
On Mon, 23 May 2022 20:32:15 +0000 Vladimir Oltean wrote:
> > > | In a port of a Bridge or station that supports frame preemption, a frame
> > > | of priority n is not available for transmission if that priority is
> > > | identified in the frame preemption status table (6.7.2) as preemptible
> > > | and either the holdRequest object (12.30.1.5) is set to the value hold,
> > > | or the transmission of a prior preemptible frame has yet to complete
> > > | because it has been interrupted to allow the transmission of an express
> > > | frame.
> > > 
> > > So since the managed objects for frame preemption are stipulated by IEEE
> > > per priority:
> > > 
> > > | The framePreemptionStatusTable (6.7.2) consists of 8
> > > | framePreemptionAdminStatus values (12.30.1.1.1), one per priority.
> > > 
> > > I think it is only reasonable for Linux to expose the same thing, and
> > > let drivers do the priority to queue or traffic class remapping as they
> > > see fit, when tc-mqprio or tc-taprio or other qdiscs that change this
> > > mapping are installed (if their preemption hardware implementation is
> > > per TC or queue rather than per priority). After all, you can have 2
> > > priorities mapped to the same TC, but still have one express and one
> > > preemptible. That is to say, you can implement preemption even in single
> > > "queue" devices, and it even makes sense.  
> > 
> > Honestly I feel like I'm missing a key detail because all you wrote
> > sounds like an argument _against_ exposing the queue mask in ethtool.  
> 
> Yeah, I guess the key detail that you're missing is that there's no such
> thing as "preemptible queue mask" in 802.1Q. My feeling is that both
> Vinicius and myself were confused in different ways by some spec
> definitions and had slightly different things in mind, and we've
> essentially ended up debating where a non-standard thing should go.
> 
> In my case, I said in my reply to the previous patch set that a priority
> is essentially synonymous with a traffic class (which it isn't, as per
> the definitions above), so I used the "traffic class" term incorrectly
> and didn't capitalize the "priority" word, which I should have.
> https://patchwork.kernel.org/project/netdevbpf/patch/20210626003314.3159402-3-vinicius.gomes@intel.com/#24812068
> 
> In Vinicius' case, part of the confusion might come from the fact that
> his hardware really has preemption configurable per queue, and he
> mistook it for the standard itself.
> 
> > Neither the standard calls for it, nor is it convenient to the user
> > who sets the prio->tc and queue allocation in TC.
> > 
> > If we wanted to expose prio mask in ethtool, that's a different story.  
> 
> Re-reading what I've said, I can't say "I was right all along"
> (not by a long shot, sorry for my part in the confusion),

Sorry, I admit I did not go back to the archives to re-read your
feedback today. I'm purely reacting to the fact that the "preemptible
queue mask" attribute which I have successfully fought off in the
past have now returned.

Let me also spell out the source of my objection - high speed NICs
have multitude of queues, queue groups and sub-interfaces. ethtool
uAPI which uses a zero-based integer ID will lead to confusion and lack
of portability because users will not know the mapping and vendors
will invent whatever fits their HW best.

> but I guess the conclusion is that:
> 
> (a) "preemptable queues" needs to become "preemptable priorities" in the
>     UAPI. The question becomes how to expose the mask of preemptable
>     priorities. A simple u8 bit mask where "BIT(i) == 1" means "prio i
>     is preemptable", or with a nested netlink attribute scheme similar
>     to DCB_PFC_UP_ATTR_0 -> DCB_PFC_UP_ATTR_7?

No preference there, we can also put it in DCBnl, if it fits better.

> (b) keeping the "preemptable priorities" away from tc-qdisc is ok

Ack.

> (c) non-standard hardware should deal with prio <-> queue mapping by
>     itself if its queues are what are preemptable

I'd prefer if the core had helpers to do the mapping for drivers, 
but in principle yes - make the preemptible queues an implementation
detail if possible.
Vladimir Oltean May 23, 2022, 10:49 p.m. UTC | #6
On Mon, May 23, 2022 at 02:31:16PM -0700, Jakub Kicinski wrote:
> > > If we wanted to expose prio mask in ethtool, that's a different story.  
> > 
> > Re-reading what I've said, I can't say "I was right all along"
> > (not by a long shot, sorry for my part in the confusion),
> 
> Sorry, I admit I did not go back to the archives to re-read your
> feedback today. I'm purely reacting to the fact that the "preemptible
> queue mask" attribute which I have successfully fought off in the
> past have now returned.
> 
> Let me also spell out the source of my objection - high speed NICs
> have multitude of queues, queue groups and sub-interfaces. ethtool
> uAPI which uses a zero-based integer ID will lead to confusion and lack
> of portability because users will not know the mapping and vendors
> will invent whatever fits their HW best.

I'm re-reading even further back and noticing that I really did not use
the "traffic class" term with its correct meaning. I really meant
"priority" here too, in Dec 2020:
https://patchwork.kernel.org/project/netdevbpf/cover/20201202045325.3254757-1-vinicius.gomes@intel.com/#23827347

I see you were opposed to the "preemptable queue mask" idea, and so was
I, but apparently the way in which I formulated this was not quite clear.

> > but I guess the conclusion is that:
> > 
> > (a) "preemptable queues" needs to become "preemptable priorities" in the
> >     UAPI. The question becomes how to expose the mask of preemptable
> >     priorities. A simple u8 bit mask where "BIT(i) == 1" means "prio i
> >     is preemptable", or with a nested netlink attribute scheme similar
> >     to DCB_PFC_UP_ATTR_0 -> DCB_PFC_UP_ATTR_7?
> 
> No preference there, we can also put it in DCBnl, if it fits better.

TBH I don't think I understand what exactly belongs in dcbnl and what
doesn't. My running hypothesis so far was that it's the stuff negotiable
through the DCBX protocol, documented as 802.1Q clause 38 to be
(a) Enhanced Transmission Selection (ETS)
(b) Priority-based Flow Control (PFC)
(c) Application Priority TLV
(d) Application VLAN TLV

but
(1) Frame Preemption isn't negotiated through DCBX, so we should be safe there
(2) I never quite understood why the existence of the DCBX protocol or
    any other protocol would mandate what the kernel interfaces should
    look like. Following this model results in absurdities - unless I'm
    misunderstanding something, an extreme case of this seems to be ETS
    itself. As per the spec, the ETS parameters are numTrafficClassesSupported,
    TCPriorityAssignment and TCBandwidth. What's funny, though, is that
    coincidentally they aren't ETS-specific information, and we seem to
    be able to set the number of TCs of a port both with DCB_CMD_SNUMTCS
    and with tc-mqprio. Same with the priority -> tc map (struct ieee_ets ::
    prio_tc), not to mention shapers per traffic class which are also in
    tc-mqprio, etc.

My instinct so far was to stay away from adding new code to dcbnl and I
think I will continue to do that going forward, thank you.

> > (b) keeping the "preemptable priorities" away from tc-qdisc is ok
> 
> Ack.
> 
> > (c) non-standard hardware should deal with prio <-> queue mapping by
> >     itself if its queues are what are preemptable
> 
> I'd prefer if the core had helpers to do the mapping for drivers, 
> but in principle yes - make the preemptible queues an implementation
> detail if possible.

Yeah, those are details already.
Vladimir Oltean May 23, 2022, 11:33 p.m. UTC | #7
On Mon, May 23, 2022 at 12:52:38PM -0700, Jakub Kicinski wrote:
> My understanding is that DCBNL is not in ethtool is that it was built
> primarily for converged Ethernet. ethtool being a netdev thing it's
> largely confined to coarse interface configuration in such
> environments, they certainly don't use TC to control RDMA queues.
> 
> To put it differently DCBNL separates RoCE and storage queues from
> netdev queues (latter being lossy). It's Conway's law at work.

I had to look up Conway's law, now I get it. Beautiful euphemism, thank you.