Message ID | 20200515114014.3135-1-vladbu@mellanox.com |
---|---|
Headers | show |
Series | Implement classifier-action terse dump mode | expand |
On Fri, 15 May 2020 14:40:10 +0300 Vlad Buslov wrote: > Output rate of current upstream kernel TC filter dump implementation if > relatively low (~100k rules/sec depending on configuration). This > constraint impacts performance of software switch implementation that > rely on TC for their datapath implementation and periodically call TC > filter dump to update rules stats. Moreover, TC filter dump output a lot > of static data that don't change during the filter lifecycle (filter > key, specific action details, etc.) which constitutes significant > portion of payload on resulting netlink packets and increases amount of > syscalls necessary to dump all filters on particular Qdisc. In order to > significantly improve filter dump rate this patch sets implement new > mode of TC filter dump operation named "terse dump" mode. In this mode > only parameters necessary to identify the filter (handle, action cookie, > etc.) and data that can change during filter lifecycle (filter flags, > action stats, etc.) are preserved in dump output while everything else > is omitted. Please keep the review tags you already got when making minor changes. Reviewed-by: Jakub Kicinski <kuba@kernel.org>
From: Vlad Buslov <vladbu@mellanox.com> Date: Fri, 15 May 2020 14:40:10 +0300 > Output rate of current upstream kernel TC filter dump implementation if > relatively low (~100k rules/sec depending on configuration). This > constraint impacts performance of software switch implementation that > rely on TC for their datapath implementation and periodically call TC > filter dump to update rules stats. Moreover, TC filter dump output a lot > of static data that don't change during the filter lifecycle (filter > key, specific action details, etc.) which constitutes significant > portion of payload on resulting netlink packets and increases amount of > syscalls necessary to dump all filters on particular Qdisc. In order to > significantly improve filter dump rate this patch sets implement new > mode of TC filter dump operation named "terse dump" mode. In this mode > only parameters necessary to identify the filter (handle, action cookie, > etc.) and data that can change during filter lifecycle (filter flags, > action stats, etc.) are preserved in dump output while everything else > is omitted. > > Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only > available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires > individual classifier support (new tcf_proto_ops->terse_dump() > callback). Support for action terse dump is implemented in act API and > don't require changing individual action implementations. ... This looks fine, so series applied. But really if people just want an efficient stats dump there is probably a better way to efficiently encode just the IDs and STATs. Maybe even put the stats in pages that userland can mmap() and avoid all of this system call overhead and locking altogether.
On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote: > > Output rate of current upstream kernel TC filter dump implementation if > relatively low (~100k rules/sec depending on configuration). This > constraint impacts performance of software switch implementation that > rely on TC for their datapath implementation and periodically call TC > filter dump to update rules stats. Moreover, TC filter dump output a lot > of static data that don't change during the filter lifecycle (filter > key, specific action details, etc.) which constitutes significant > portion of payload on resulting netlink packets and increases amount of > syscalls necessary to dump all filters on particular Qdisc. In order to > significantly improve filter dump rate this patch sets implement new > mode of TC filter dump operation named "terse dump" mode. In this mode > only parameters necessary to identify the filter (handle, action cookie, > etc.) and data that can change during filter lifecycle (filter flags, > action stats, etc.) are preserved in dump output while everything else > is omitted. > > Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only > available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires > individual classifier support (new tcf_proto_ops->terse_dump() > callback). Support for action terse dump is implemented in act API and > don't require changing individual action implementations. Sorry for being late. Why terse dump needs a new ops if it only dumps a subset of the regular dump? That is, why not just pass a boolean flag to regular ->dump() implementation? I guess that might break user-space ABI? At least some netlink attributes are not always dumped anyway, so it does not look like a problem? Thanks.
On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> Output rate of current upstream kernel TC filter dump implementation if >> relatively low (~100k rules/sec depending on configuration). This >> constraint impacts performance of software switch implementation that >> rely on TC for their datapath implementation and periodically call TC >> filter dump to update rules stats. Moreover, TC filter dump output a lot >> of static data that don't change during the filter lifecycle (filter >> key, specific action details, etc.) which constitutes significant >> portion of payload on resulting netlink packets and increases amount of >> syscalls necessary to dump all filters on particular Qdisc. In order to >> significantly improve filter dump rate this patch sets implement new >> mode of TC filter dump operation named "terse dump" mode. In this mode >> only parameters necessary to identify the filter (handle, action cookie, >> etc.) and data that can change during filter lifecycle (filter flags, >> action stats, etc.) are preserved in dump output while everything else >> is omitted. >> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires >> individual classifier support (new tcf_proto_ops->terse_dump() >> callback). Support for action terse dump is implemented in act API and >> don't require changing individual action implementations. > > Sorry for being late. > > Why terse dump needs a new ops if it only dumps a subset of the > regular dump? That is, why not just pass a boolean flag to regular > ->dump() implementation? > > I guess that might break user-space ABI? At least some netlink > attributes are not always dumped anyway, so it does not look like > a problem? > > Thanks. Hi Cong, I considered adding a flag to ->dump() callback but decided against it for following reasons: - It complicates fl_dump() code by adding additional conditionals. Not a big problem but it seemed better for me to have a standalone callback because with combined implementation it is even hard to deduce what does terse dump actually output. - My initial implementation just called regular dump for classifiers that don't support terse dump, but in internal review Jiri insisted that cls API should fail if it can't satisfy user's request and having dedicated callback allows implementation to return an error if classifier doesn't define ->terse_dump(). With flag approach it would be not trivial to determine if implementation actually uses the flag. I guess I could have added new tcf_proto_ops->flags value to designate terse dump support, but checking for dedicated callback existence seemed like obvious approach. Regards, Vlad
On Fri 15 May 2020 at 20:04, Jakub Kicinski <kuba@kernel.org> wrote: > On Fri, 15 May 2020 14:40:10 +0300 Vlad Buslov wrote: >> Output rate of current upstream kernel TC filter dump implementation if >> relatively low (~100k rules/sec depending on configuration). This >> constraint impacts performance of software switch implementation that >> rely on TC for their datapath implementation and periodically call TC >> filter dump to update rules stats. Moreover, TC filter dump output a lot >> of static data that don't change during the filter lifecycle (filter >> key, specific action details, etc.) which constitutes significant >> portion of payload on resulting netlink packets and increases amount of >> syscalls necessary to dump all filters on particular Qdisc. In order to >> significantly improve filter dump rate this patch sets implement new >> mode of TC filter dump operation named "terse dump" mode. In this mode >> only parameters necessary to identify the filter (handle, action cookie, >> etc.) and data that can change during filter lifecycle (filter flags, >> action stats, etc.) are preserved in dump output while everything else >> is omitted. > > Please keep the review tags you already got when making minor changes. Thanks. I generally hesitant to retain other people's tags when making any changes to the code. I'll retain your tags when making trivial changes from now on. > > Reviewed-by: Jakub Kicinski <kuba@kernel.org>
On Fri 15 May 2020 at 20:25, David Miller <davem@davemloft.net> wrote: > From: Vlad Buslov <vladbu@mellanox.com> > Date: Fri, 15 May 2020 14:40:10 +0300 > >> Output rate of current upstream kernel TC filter dump implementation if >> relatively low (~100k rules/sec depending on configuration). This >> constraint impacts performance of software switch implementation that >> rely on TC for their datapath implementation and periodically call TC >> filter dump to update rules stats. Moreover, TC filter dump output a lot >> of static data that don't change during the filter lifecycle (filter >> key, specific action details, etc.) which constitutes significant >> portion of payload on resulting netlink packets and increases amount of >> syscalls necessary to dump all filters on particular Qdisc. In order to >> significantly improve filter dump rate this patch sets implement new >> mode of TC filter dump operation named "terse dump" mode. In this mode >> only parameters necessary to identify the filter (handle, action cookie, >> etc.) and data that can change during filter lifecycle (filter flags, >> action stats, etc.) are preserved in dump output while everything else >> is omitted. >> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires >> individual classifier support (new tcf_proto_ops->terse_dump() >> callback). Support for action terse dump is implemented in act API and >> don't require changing individual action implementations. > ... > > This looks fine, so series applied. > > But really if people just want an efficient stats dump there is probably > a better way to efficiently encode just the IDs and STATs. Maybe even > put the stats in pages that userland can mmap() and avoid all of this > system call overhead and locking altogether. Thanks! Adding such API will be my next step, if terse dump performance proves insufficient.
On 15/05/2020 12:40, Vlad Buslov wrote: > In order to > significantly improve filter dump rate this patch sets implement new > mode of TC filter dump operation named "terse dump" mode. In this mode > only parameters necessary to identify the filter (handle, action cookie, > etc.) and data that can change during filter lifecycle (filter flags, > action stats, etc.) are preserved in dump output while everything else > is omitted. I realise I'm a bit late, but isn't this the kind of policy that shouldn't be hard-coded in the kernel? I.e. if next year it turns out that some user needs one parameter that's been omitted here, but not the whole dump, are they going to want to add another mode to the uapi? Should this not instead have been done as a set of flags to specify which pieces of information the caller wanted in the dump, rather than a mode flag selecting a pre-defined set? -ed
On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote: > > > On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote: > >> > >> Output rate of current upstream kernel TC filter dump implementation if > >> relatively low (~100k rules/sec depending on configuration). This > >> constraint impacts performance of software switch implementation that > >> rely on TC for their datapath implementation and periodically call TC > >> filter dump to update rules stats. Moreover, TC filter dump output a lot > >> of static data that don't change during the filter lifecycle (filter > >> key, specific action details, etc.) which constitutes significant > >> portion of payload on resulting netlink packets and increases amount of > >> syscalls necessary to dump all filters on particular Qdisc. In order to > >> significantly improve filter dump rate this patch sets implement new > >> mode of TC filter dump operation named "terse dump" mode. In this mode > >> only parameters necessary to identify the filter (handle, action cookie, > >> etc.) and data that can change during filter lifecycle (filter flags, > >> action stats, etc.) are preserved in dump output while everything else > >> is omitted. > >> > >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only > >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires > >> individual classifier support (new tcf_proto_ops->terse_dump() > >> callback). Support for action terse dump is implemented in act API and > >> don't require changing individual action implementations. > > > > Sorry for being late. > > > > Why terse dump needs a new ops if it only dumps a subset of the > > regular dump? That is, why not just pass a boolean flag to regular > > ->dump() implementation? > > > > I guess that might break user-space ABI? At least some netlink > > attributes are not always dumped anyway, so it does not look like > > a problem? > > > > Thanks. > > Hi Cong, > > I considered adding a flag to ->dump() callback but decided against it > for following reasons: > > - It complicates fl_dump() code by adding additional conditionals. Not a > big problem but it seemed better for me to have a standalone callback > because with combined implementation it is even hard to deduce what > does terse dump actually output. This is not a problem, at least you can add a big if in fl_dump(), something like: if (terse) { // do terse dump return 0; } // normal dump > > - My initial implementation just called regular dump for classifiers > that don't support terse dump, but in internal review Jiri insisted > that cls API should fail if it can't satisfy user's request and having > dedicated callback allows implementation to return an error if > classifier doesn't define ->terse_dump(). With flag approach it would > be not trivial to determine if implementation actually uses the flag. Hmm? For those not support terse dump, we can just do: if (terse) return -EOPNOTSUPP; // normal dump goes here You just have to pass 'terse' flag to all implementations and let them to decide whether to support it or not. > I guess I could have added new tcf_proto_ops->flags value to designate > terse dump support, but checking for dedicated callback existence > seemed like obvious approach. This does not look necessary, as long as we can just pass the flag down to each ->dump(). Thanks.
On Mon, May 18, 2020 at 8:38 AM Edward Cree <ecree@solarflare.com> wrote: > > On 15/05/2020 12:40, Vlad Buslov wrote: > > In order to > > significantly improve filter dump rate this patch sets implement new > > mode of TC filter dump operation named "terse dump" mode. In this mode > > only parameters necessary to identify the filter (handle, action cookie, > > etc.) and data that can change during filter lifecycle (filter flags, > > action stats, etc.) are preserved in dump output while everything else > > is omitted. > I realise I'm a bit late, but isn't this the kind of policy that shouldn't > be hard-coded in the kernel? I.e. if next year it turns out that some > user needs one parameter that's been omitted here, but not the whole dump, > are they going to want to add another mode to the uapi? > Should this not instead have been done as a set of flags to specify which > pieces of information the caller wanted in the dump, rather than a mode > flag selecting a pre-defined set? Excellent point! I agree, this is more elegant, although potentially needs more work. I am not sure whether we can simply pass those flags to cb->args[], if not, that will need more work at netlink layer. Thanks.
On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote: > On 15/05/2020 12:40, Vlad Buslov wrote: >> In order to >> significantly improve filter dump rate this patch sets implement new >> mode of TC filter dump operation named "terse dump" mode. In this mode >> only parameters necessary to identify the filter (handle, action cookie, >> etc.) and data that can change during filter lifecycle (filter flags, >> action stats, etc.) are preserved in dump output while everything else >> is omitted. > I realise I'm a bit late, but isn't this the kind of policy that shouldn't > be hard-coded in the kernel? I.e. if next year it turns out that some > user needs one parameter that's been omitted here, but not the whole dump, > are they going to want to add another mode to the uapi? Why not just extend terse dump? I won't break user land unless you are removing something from it. > Should this not instead have been done as a set of flags to specify which > pieces of information the caller wanted in the dump, rather than a mode > flag selecting a pre-defined set? > > -ed I considered that approach initially but decided against it for following reasons: - Generic data is covered by current terse dump implementation. Everything else will be act or cls specific which would result long list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ..., TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of these would require a lot of dedicated logic in act and cls dump callbacks. Also, it would be quite a challenge to test all possible combinations. - It is hard to come up with proper validation for such implementation. In case of terse dump I just return an error if classifier doesn't implement the callback (and since current implementation only outputs generic action info, it doesn't even require support from action-specific dump callbacks). But, for example, how do we validate a case where user sets some flower and tunnel_key act dump flags from previous paragraph, but Qdisc contains some other classifier? Or flower classifier points to other types of actions? Or when flower classifier has and tunnel_key actions but also mirred? Should the implementation return an error on encountering any classifier or action that doesn't have any flags set for its type or just print all data like regular dump? What if user asks to dump some specific option that wasn't set for particular filter of action instance? Overall, the more I think about such implementation the more it looks like a mess to me.
On Mon 18 May 2020 at 21:46, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> >> Output rate of current upstream kernel TC filter dump implementation if >> >> relatively low (~100k rules/sec depending on configuration). This >> >> constraint impacts performance of software switch implementation that >> >> rely on TC for their datapath implementation and periodically call TC >> >> filter dump to update rules stats. Moreover, TC filter dump output a lot >> >> of static data that don't change during the filter lifecycle (filter >> >> key, specific action details, etc.) which constitutes significant >> >> portion of payload on resulting netlink packets and increases amount of >> >> syscalls necessary to dump all filters on particular Qdisc. In order to >> >> significantly improve filter dump rate this patch sets implement new >> >> mode of TC filter dump operation named "terse dump" mode. In this mode >> >> only parameters necessary to identify the filter (handle, action cookie, >> >> etc.) and data that can change during filter lifecycle (filter flags, >> >> action stats, etc.) are preserved in dump output while everything else >> >> is omitted. >> >> >> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only >> >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires >> >> individual classifier support (new tcf_proto_ops->terse_dump() >> >> callback). Support for action terse dump is implemented in act API and >> >> don't require changing individual action implementations. >> > >> > Sorry for being late. >> > >> > Why terse dump needs a new ops if it only dumps a subset of the >> > regular dump? That is, why not just pass a boolean flag to regular >> > ->dump() implementation? >> > >> > I guess that might break user-space ABI? At least some netlink >> > attributes are not always dumped anyway, so it does not look like >> > a problem? >> > >> > Thanks. >> >> Hi Cong, >> >> I considered adding a flag to ->dump() callback but decided against it >> for following reasons: >> >> - It complicates fl_dump() code by adding additional conditionals. Not a >> big problem but it seemed better for me to have a standalone callback >> because with combined implementation it is even hard to deduce what >> does terse dump actually output. > > This is not a problem, at least you can add a big if in fl_dump(), > something like: > > if (terse) { > // do terse dump > return 0; > } > // normal dump That is what I was trying to prevent with my implementation: having big "superfunctions" that implement multiple things with branching. Why not just have dedicated callbacks that do exactly one thing? > >> >> - My initial implementation just called regular dump for classifiers >> that don't support terse dump, but in internal review Jiri insisted >> that cls API should fail if it can't satisfy user's request and having >> dedicated callback allows implementation to return an error if >> classifier doesn't define ->terse_dump(). With flag approach it would >> be not trivial to determine if implementation actually uses the flag. > > Hmm? For those not support terse dump, we can just do: > > if (terse) > return -EOPNOTSUPP; > // normal dump goes here > > You just have to pass 'terse' flag to all implementations and let them > to decide whether to support it or not. But why duplicate the same code to all existing cls dump implementations instead of having such check nicely implemented in cls API (via callback existence or a flag)? > > >> I guess I could have added new tcf_proto_ops->flags value to designate >> terse dump support, but checking for dedicated callback existence >> seemed like obvious approach. > > This does not look necessary, as long as we can just pass the flag > down to each ->dump(). > > Thanks.
On 19/05/2020 10:04, Vlad Buslov wrote: > On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote: >> I.e. if next year it turns out that some >> user needs one parameter that's been omitted here, but not the whole dump, >> are they going to want to add another mode to the uapi? > Why not just extend terse dump? I won't break user land unless you are > removing something from it. But then all terse dump users pay the performance cost for thatone app's extra need. > - Generic data is covered by current terse dump implementation. > Everything else will be act or cls specific Fair point. I don't suppose something something BPF mumble solve this? I haven't been following the BPF dumping work in detail but it sounds like it might be a cheap way to get the 'more performant next step' that was mentioned in the subthread with David. Just a thought. -ed
On Tue 19 May 2020 at 17:30, Edward Cree <ecree@solarflare.com> wrote: > On 19/05/2020 10:04, Vlad Buslov wrote: >> On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote: >>> I.e. if next year it turns out that some >>> user needs one parameter that's been omitted here, but not the whole dump, >>> are they going to want to add another mode to the uapi? >> Why not just extend terse dump? I won't break user land unless you are >> removing something from it. > But then all terse dump users pay the performance cost for thatone > app's extra need. Yes. However reducing amount of data per filter is only part of optimization. Skipping fl_dump_key() in flower and completely omitting calling any of the tc_action_ops callbacks is also a major improvement. So as long as outputting new parameter doesn't require calling one of those (and it shouldn't, otherwise the parameter represent something very cls or action type specific) it shouldn't impact performance significantly. > >> - Generic data is covered by current terse dump implementation. >> Everything else will be act or cls specific > Fair point. > I don't suppose something something BPF mumble solve this? I haven't > been following the BPF dumping work in detail but it sounds like it > might be a cheap way to get the 'more performant next step' that > was mentioned in the subthread with David. Just a thought. I'm very interested in ideas how to use BPF to improve dump performance so any comments/suggestions from resident BPF experts are welcome. Don't think it will as fast as what David suggested though.
On Tue, May 19, 2020 at 2:10 AM Vlad Buslov <vladbu@mellanox.com> wrote: > > > On Mon 18 May 2020 at 21:46, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote: > >> > >> > >> On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote: > >> > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote: > >> >> > >> >> Output rate of current upstream kernel TC filter dump implementation if > >> >> relatively low (~100k rules/sec depending on configuration). This > >> >> constraint impacts performance of software switch implementation that > >> >> rely on TC for their datapath implementation and periodically call TC > >> >> filter dump to update rules stats. Moreover, TC filter dump output a lot > >> >> of static data that don't change during the filter lifecycle (filter > >> >> key, specific action details, etc.) which constitutes significant > >> >> portion of payload on resulting netlink packets and increases amount of > >> >> syscalls necessary to dump all filters on particular Qdisc. In order to > >> >> significantly improve filter dump rate this patch sets implement new > >> >> mode of TC filter dump operation named "terse dump" mode. In this mode > >> >> only parameters necessary to identify the filter (handle, action cookie, > >> >> etc.) and data that can change during filter lifecycle (filter flags, > >> >> action stats, etc.) are preserved in dump output while everything else > >> >> is omitted. > >> >> > >> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only > >> >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires > >> >> individual classifier support (new tcf_proto_ops->terse_dump() > >> >> callback). Support for action terse dump is implemented in act API and > >> >> don't require changing individual action implementations. > >> > > >> > Sorry for being late. > >> > > >> > Why terse dump needs a new ops if it only dumps a subset of the > >> > regular dump? That is, why not just pass a boolean flag to regular > >> > ->dump() implementation? > >> > > >> > I guess that might break user-space ABI? At least some netlink > >> > attributes are not always dumped anyway, so it does not look like > >> > a problem? > >> > > >> > Thanks. > >> > >> Hi Cong, > >> > >> I considered adding a flag to ->dump() callback but decided against it > >> for following reasons: > >> > >> - It complicates fl_dump() code by adding additional conditionals. Not a > >> big problem but it seemed better for me to have a standalone callback > >> because with combined implementation it is even hard to deduce what > >> does terse dump actually output. > > > > This is not a problem, at least you can add a big if in fl_dump(), > > something like: > > > > if (terse) { > > // do terse dump > > return 0; > > } > > // normal dump > > That is what I was trying to prevent with my implementation: having big > "superfunctions" that implement multiple things with branching. Why not > just have dedicated callbacks that do exactly one thing? 1. Saving one unnecessary ops. 2. Easier to trace the code: all dumpings are in one implementation. > > > > >> > >> - My initial implementation just called regular dump for classifiers > >> that don't support terse dump, but in internal review Jiri insisted > >> that cls API should fail if it can't satisfy user's request and having > >> dedicated callback allows implementation to return an error if > >> classifier doesn't define ->terse_dump(). With flag approach it would > >> be not trivial to determine if implementation actually uses the flag. > > > > Hmm? For those not support terse dump, we can just do: > > > > if (terse) > > return -EOPNOTSUPP; > > // normal dump goes here > > > > You just have to pass 'terse' flag to all implementations and let them > > to decide whether to support it or not. > > But why duplicate the same code to all existing cls dump implementations > instead of having such check nicely implemented in cls API (via callback > existence or a flag)? I am confused, your fl_terse_dump() also has duplication with fl_dump()... Thanks.
On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote: > I considered that approach initially but decided against it for > following reasons: > > - Generic data is covered by current terse dump implementation. > Everything else will be act or cls specific which would result long > list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST, > TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ..., > TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of > these would require a lot of dedicated logic in act and cls dump > callbacks. Also, it would be quite a challenge to test all possible > combinations. Well, if you consider netlink dump as a database query, what Edward proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather than "select * from cls_db". No one said it is easy to implement, it is just more elegant than you select a hardcoded set of columns for the user. Think about it, what if another user wants a less terse dump but still not a full dump? Would you implement ops->terse_dump2()? Or what if people still think your terse dump is not as terse as she wants? ops->mini_dump()? How many ops's we would end having? > > - It is hard to come up with proper validation for such implementation. > In case of terse dump I just return an error if classifier doesn't > implement the callback (and since current implementation only outputs > generic action info, it doesn't even require support from > action-specific dump callbacks). But, for example, how do we validate > a case where user sets some flower and tunnel_key act dump flags from > previous paragraph, but Qdisc contains some other classifier? Or > flower classifier points to other types of actions? Or when flower > classifier has and tunnel_key actions but also mirred? Should the Each action should be able to dump selectively too. If you think it as a database, it is just a different table with different schemas. > implementation return an error on encountering any classifier or > action that doesn't have any flags set for its type or just print all > data like regular dump? What if user asks to dump some specific option > that wasn't set for particular filter of action instance? Undefined or error. > > Overall, the more I think about such implementation the more it looks > like a mess to me. This is what I think about your current implementation. You know once we add this we can't remove it any longer, right? This is why we should make it right and better in the first place, not after carrying it for even one release. Thanks.
On Tue 19 May 2020 at 21:58, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> I considered that approach initially but decided against it for >> following reasons: >> >> - Generic data is covered by current terse dump implementation. >> Everything else will be act or cls specific which would result long >> list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST, >> TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ..., >> TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of >> these would require a lot of dedicated logic in act and cls dump >> callbacks. Also, it would be quite a challenge to test all possible >> combinations. > > Well, if you consider netlink dump as a database query, what Edward > proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather > than "select * from cls_db". > > No one said it is easy to implement, it is just more elegant than you > select a hardcoded set of columns for the user. As I explained to Edward, having denser netlink packets with more filters per packet is only part of optimization. Another part is not executing some code at all. Consider fl_dump_key() which is 200 lines function with bunch of conditionals like that: static int fl_dump_key(struct sk_buff *skb, struct net *net, struct fl_flow_key *key, struct fl_flow_key *mask) { if (mask->meta.ingress_ifindex) { struct net_device *dev; dev = __dev_get_by_index(net, key->meta.ingress_ifindex); if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name)) goto nla_put_failure; } if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST, mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK, sizeof(key->eth.dst)) || fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC, mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK, sizeof(key->eth.src)) || fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE, &mask->basic.n_proto, TCA_FLOWER_UNSPEC, sizeof(key->basic.n_proto))) goto nla_put_failure; if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls)) goto nla_put_failure; if (fl_dump_key_vlan(skb, TCA_FLOWER_KEY_VLAN_ID, TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, &mask->vlan)) goto nla_put_failure; ... Now imagine all of these are extended with additional if (flags & TCA_DUMP_XXX). All gains from not outputting some other minor stuff into netlink packet will be negated by it. > > Think about it, what if another user wants a less terse dump but still > not a full dump? Would you implement ops->terse_dump2()? Or > what if people still think your terse dump is not as terse as she wants? > ops->mini_dump()? How many ops's we would end having? User can discard whatever he doesn't need in user land code. The goal of this change is performance optimization, not designing a generic kernel-space data filtering mechanism. > > >> >> - It is hard to come up with proper validation for such implementation. >> In case of terse dump I just return an error if classifier doesn't >> implement the callback (and since current implementation only outputs >> generic action info, it doesn't even require support from >> action-specific dump callbacks). But, for example, how do we validate >> a case where user sets some flower and tunnel_key act dump flags from >> previous paragraph, but Qdisc contains some other classifier? Or >> flower classifier points to other types of actions? Or when flower >> classifier has and tunnel_key actions but also mirred? Should the > > Each action should be able to dump selectively too. If you think it > as a database, it is just a different table with different schemas. How is designing custom SQL-like query language (according to your example at the beginning of the mail) for filter dump is going to improve performance? If there is a way to do it in fast a generic manner with BPF, then I'm very interested to hear the details. But adding hundred more hardcoded conditionals is just not a solution considering main motivations for this change is performance.
On Tue 19 May 2020 at 21:39, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Tue, May 19, 2020 at 2:10 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> On Mon 18 May 2020 at 21:46, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> > On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> >> >> >> On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> >> > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> >> >> >> Output rate of current upstream kernel TC filter dump implementation if >> >> >> relatively low (~100k rules/sec depending on configuration). This >> >> >> constraint impacts performance of software switch implementation that >> >> >> rely on TC for their datapath implementation and periodically call TC >> >> >> filter dump to update rules stats. Moreover, TC filter dump output a lot >> >> >> of static data that don't change during the filter lifecycle (filter >> >> >> key, specific action details, etc.) which constitutes significant >> >> >> portion of payload on resulting netlink packets and increases amount of >> >> >> syscalls necessary to dump all filters on particular Qdisc. In order to >> >> >> significantly improve filter dump rate this patch sets implement new >> >> >> mode of TC filter dump operation named "terse dump" mode. In this mode >> >> >> only parameters necessary to identify the filter (handle, action cookie, >> >> >> etc.) and data that can change during filter lifecycle (filter flags, >> >> >> action stats, etc.) are preserved in dump output while everything else >> >> >> is omitted. >> >> >> >> >> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only >> >> >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires >> >> >> individual classifier support (new tcf_proto_ops->terse_dump() >> >> >> callback). Support for action terse dump is implemented in act API and >> >> >> don't require changing individual action implementations. >> >> > >> >> > Sorry for being late. >> >> > >> >> > Why terse dump needs a new ops if it only dumps a subset of the >> >> > regular dump? That is, why not just pass a boolean flag to regular >> >> > ->dump() implementation? >> >> > >> >> > I guess that might break user-space ABI? At least some netlink >> >> > attributes are not always dumped anyway, so it does not look like >> >> > a problem? >> >> > >> >> > Thanks. >> >> >> >> Hi Cong, >> >> >> >> I considered adding a flag to ->dump() callback but decided against it >> >> for following reasons: >> >> >> >> - It complicates fl_dump() code by adding additional conditionals. Not a >> >> big problem but it seemed better for me to have a standalone callback >> >> because with combined implementation it is even hard to deduce what >> >> does terse dump actually output. >> > >> > This is not a problem, at least you can add a big if in fl_dump(), >> > something like: >> > >> > if (terse) { >> > // do terse dump >> > return 0; >> > } >> > // normal dump >> >> That is what I was trying to prevent with my implementation: having big >> "superfunctions" that implement multiple things with branching. Why not >> just have dedicated callbacks that do exactly one thing? > > 1. Saving one unnecessary ops. > 2. Easier to trace the code: all dumpings are in one implementation. Okay, I see your point. > >> >> > >> >> >> >> - My initial implementation just called regular dump for classifiers >> >> that don't support terse dump, but in internal review Jiri insisted >> >> that cls API should fail if it can't satisfy user's request and having >> >> dedicated callback allows implementation to return an error if >> >> classifier doesn't define ->terse_dump(). With flag approach it would >> >> be not trivial to determine if implementation actually uses the flag. >> > >> > Hmm? For those not support terse dump, we can just do: >> > >> > if (terse) >> > return -EOPNOTSUPP; >> > // normal dump goes here >> > >> > You just have to pass 'terse' flag to all implementations and let them >> > to decide whether to support it or not. >> >> But why duplicate the same code to all existing cls dump implementations >> instead of having such check nicely implemented in cls API (via callback >> existence or a flag)? > > I am confused, your fl_terse_dump() also has duplication with fl_dump()... > > Thanks. I meant duplicating the "if terse not supported return -EOPNOTSUPP" in dump callback of every classifier implementation. With current implementation cls API handles such case by checking whether classifier implementation has ->terse_dump() defined and returns error otherwise. This can also be achieved by having a new classifier flag, in case we decide to proceed with folding both dump and terse_dump into single ->dump(bool terse) callback.
Hi Edward, Cong, On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote: > On 15/05/2020 12:40, Vlad Buslov wrote: >> In order to >> significantly improve filter dump rate this patch sets implement new >> mode of TC filter dump operation named "terse dump" mode. In this mode >> only parameters necessary to identify the filter (handle, action cookie, >> etc.) and data that can change during filter lifecycle (filter flags, >> action stats, etc.) are preserved in dump output while everything else >> is omitted. > I realise I'm a bit late, but isn't this the kind of policy that shouldn't > be hard-coded in the kernel? I.e. if next year it turns out that some > user needs one parameter that's been omitted here, but not the whole dump, > are they going to want to add another mode to the uapi? > Should this not instead have been done as a set of flags to specify which > pieces of information the caller wanted in the dump, rather than a mode > flag selecting a pre-defined set? > > -ed I've been thinking some more about this. While the idea of making fine-grained dump where user controls exact contents field-by-field is unfeasible due to performance considerations, we can try to come up with something more coarse-grained but not fully hardcoded (like current terse dump implementation). Something like having a set of flags that allows to skip output of groups of attributes. For example, CLS_SKIP_KEY flag would skip the whole expensive classifier key dump without having to go through all 200 lines of conditionals in fl_dump_key() while ACT_SKIP_OPTIONS would skip outputting TCA_OPTIONS compound attribute (and expensive call to tc_action_ops->dump()). This approach would also leave the door open for further more fine-grained flags, if the need arises. For example, new flags CLS_SKIP_KEY_{L2,L3,L4} can be introduced to more precisely control which parts of cls key should be skipped. The main drawback of such approach is that it is impossible to come up with universal set of flags that would be applicable for all classifiers. Key (in some form) is applicable to most classifiers, but it still doesn't make sense for matchall or bpf. Some classifiers have 'flags', some don't. Hardware-offloaded classifiers have in_hw_count. Considering this, initial set of flags will be somewhat flower-centric. What do you think? Regards, Vlad
On Thu, 21 May 2020 17:36:12 +0300 Vlad Buslov wrote: > Hi Edward, Cong, > > On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote: > > On 15/05/2020 12:40, Vlad Buslov wrote: > >> In order to > >> significantly improve filter dump rate this patch sets implement new > >> mode of TC filter dump operation named "terse dump" mode. In this mode > >> only parameters necessary to identify the filter (handle, action cookie, > >> etc.) and data that can change during filter lifecycle (filter flags, > >> action stats, etc.) are preserved in dump output while everything else > >> is omitted. > > I realise I'm a bit late, but isn't this the kind of policy that shouldn't > > be hard-coded in the kernel? I.e. if next year it turns out that some > > user needs one parameter that's been omitted here, but not the whole dump, > > are they going to want to add another mode to the uapi? > > Should this not instead have been done as a set of flags to specify which > > pieces of information the caller wanted in the dump, rather than a mode > > flag selecting a pre-defined set? > > > > -ed > > I've been thinking some more about this. While the idea of making > fine-grained dump where user controls exact contents field-by-field is > unfeasible due to performance considerations, we can try to come up with > something more coarse-grained but not fully hardcoded (like current terse > dump implementation). Something like having a set of flags that allows > to skip output of groups of attributes. > > For example, CLS_SKIP_KEY flag would skip the whole expensive classifier > key dump without having to go through all 200 lines of conditionals in Do you really need to dump classifiers? If you care about stats the actions could be sufficient if the offload code was fixed appropriately... Sorry I had to say that. > fl_dump_key() while ACT_SKIP_OPTIONS would skip outputting TCA_OPTIONS > compound attribute (and expensive call to tc_action_ops->dump()). This > approach would also leave the door open for further more fine-grained > flags, if the need arises. For example, new flags > CLS_SKIP_KEY_{L2,L3,L4} can be introduced to more precisely control > which parts of cls key should be skipped. L2, L3, etc. will be meaningless for a lot of classifiers. > The main drawback of such approach is that it is impossible to come up > with universal set of flags that would be applicable for all > classifiers. Key (in some form) is applicable to most classifiers, but > it still doesn't make sense for matchall or bpf. Some classifiers have > 'flags', some don't. Hardware-offloaded classifiers have in_hw_count. > Considering this, initial set of flags will be somewhat flower-centric. > > What do you think? Simplest heuristic is to dump everything that can't get changed without a notification. Which I think you're quite close to already..
On Thu 21 May 2020 at 20:02, Jakub Kicinski <kuba@kernel.org> wrote: > On Thu, 21 May 2020 17:36:12 +0300 Vlad Buslov wrote: >> Hi Edward, Cong, >> >> On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote: >> > On 15/05/2020 12:40, Vlad Buslov wrote: >> >> In order to >> >> significantly improve filter dump rate this patch sets implement new >> >> mode of TC filter dump operation named "terse dump" mode. In this mode >> >> only parameters necessary to identify the filter (handle, action cookie, >> >> etc.) and data that can change during filter lifecycle (filter flags, >> >> action stats, etc.) are preserved in dump output while everything else >> >> is omitted. >> > I realise I'm a bit late, but isn't this the kind of policy that shouldn't >> > be hard-coded in the kernel? I.e. if next year it turns out that some >> > user needs one parameter that's been omitted here, but not the whole dump, >> > are they going to want to add another mode to the uapi? >> > Should this not instead have been done as a set of flags to specify which >> > pieces of information the caller wanted in the dump, rather than a mode >> > flag selecting a pre-defined set? >> > >> > -ed >> >> I've been thinking some more about this. While the idea of making >> fine-grained dump where user controls exact contents field-by-field is >> unfeasible due to performance considerations, we can try to come up with >> something more coarse-grained but not fully hardcoded (like current terse >> dump implementation). Something like having a set of flags that allows >> to skip output of groups of attributes. >> >> For example, CLS_SKIP_KEY flag would skip the whole expensive classifier >> key dump without having to go through all 200 lines of conditionals in > > Do you really need to dump classifiers? If you care about stats > the actions could be sufficient if the offload code was fixed > appropriately... Sorry I had to say that. Technically I need neither classifier nor action. All I need is cookie and stats of single terminating action attached to filter (redirect, drop, etc.). This can be achieved by making terse dump output that data for last extension on filter. However, when I discussed my initial terse dump idea with Jamal he asked me not to ossify such behavior to allow for implementation of offloaded shared actions in future. Speaking about shared action offload, I remember looking at some RFC patches by Edward implementing such functionality and allowing action stats update directly from act, as opposed to current design that relies on classifier to update action stats from hardware. Is that what you mean by appropriately fixing offload code? With such implementation, just dumping relevant action types would also work. My only concern is that the only way to dump actions is per-namespace as opposed to per-Qdisc of filters, which would clash with any other cls/act users on same machine/hypervisor. [...]
On Wed, May 20, 2020 at 12:24 AM Vlad Buslov <vladbu@mellanox.com> wrote: > > > On Tue 19 May 2020 at 21:58, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote: > >> I considered that approach initially but decided against it for > >> following reasons: > >> > >> - Generic data is covered by current terse dump implementation. > >> Everything else will be act or cls specific which would result long > >> list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST, > >> TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ..., > >> TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of > >> these would require a lot of dedicated logic in act and cls dump > >> callbacks. Also, it would be quite a challenge to test all possible > >> combinations. > > > > Well, if you consider netlink dump as a database query, what Edward > > proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather > > than "select * from cls_db". > > > > No one said it is easy to implement, it is just more elegant than you > > select a hardcoded set of columns for the user. > > As I explained to Edward, having denser netlink packets with more > filters per packet is only part of optimization. Another part is not > executing some code at all. Consider fl_dump_key() which is 200 lines > function with bunch of conditionals like that: > > static int fl_dump_key(struct sk_buff *skb, struct net *net, > struct fl_flow_key *key, struct fl_flow_key *mask) > { > if (mask->meta.ingress_ifindex) { > struct net_device *dev; > > dev = __dev_get_by_index(net, key->meta.ingress_ifindex); > if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name)) > goto nla_put_failure; > } > > if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST, > mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK, > sizeof(key->eth.dst)) || > fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC, > mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK, > sizeof(key->eth.src)) || > fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE, > &mask->basic.n_proto, TCA_FLOWER_UNSPEC, > sizeof(key->basic.n_proto))) > goto nla_put_failure; > > if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls)) > goto nla_put_failure; > > if (fl_dump_key_vlan(skb, TCA_FLOWER_KEY_VLAN_ID, > TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, &mask->vlan)) > goto nla_put_failure; > ... > > > Now imagine all of these are extended with additional if (flags & > TCA_DUMP_XXX). All gains from not outputting some other minor stuff into > netlink packet will be negated by it. Interesting, are you saying a bit test is as expensive as appending an actual netlink attribution to the dumping? I am surprised. > > > > > > Think about it, what if another user wants a less terse dump but still > > not a full dump? Would you implement ops->terse_dump2()? Or > > what if people still think your terse dump is not as terse as she wants? > > ops->mini_dump()? How many ops's we would end having? > > User can discard whatever he doesn't need in user land code. The goal of > this change is performance optimization, not designing a generic > kernel-space data filtering mechanism. You optimize the performance by reducing the dump size, which is already effectively a data filtering. This doesn't have to be your goal, you are implementing it anyway. > > > > > > >> > >> - It is hard to come up with proper validation for such implementation. > >> In case of terse dump I just return an error if classifier doesn't > >> implement the callback (and since current implementation only outputs > >> generic action info, it doesn't even require support from > >> action-specific dump callbacks). But, for example, how do we validate > >> a case where user sets some flower and tunnel_key act dump flags from > >> previous paragraph, but Qdisc contains some other classifier? Or > >> flower classifier points to other types of actions? Or when flower > >> classifier has and tunnel_key actions but also mirred? Should the > > > > Each action should be able to dump selectively too. If you think it > > as a database, it is just a different table with different schemas. > > How is designing custom SQL-like query language (according to your > example at the beginning of the mail) for filter dump is going to > improve performance? If there is a way to do it in fast a generic manner > with BPF, then I'm very interested to hear the details. But adding > hundred more hardcoded conditionals is just not a solution considering > main motivations for this change is performance. I still wonder how a bit test is as expensive as you claim, it does not look like you actually measure it. This of course depends on the size of the dump, but if you look at other netlink dump in kernel, not just tc filters, we already dump a lot of attributes per record. Thanks.
On Thu, May 21, 2020 at 7:36 AM Vlad Buslov <vladbu@mellanox.com> wrote: > > Hi Edward, Cong, > > On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote: > > On 15/05/2020 12:40, Vlad Buslov wrote: > >> In order to > >> significantly improve filter dump rate this patch sets implement new > >> mode of TC filter dump operation named "terse dump" mode. In this mode > >> only parameters necessary to identify the filter (handle, action cookie, > >> etc.) and data that can change during filter lifecycle (filter flags, > >> action stats, etc.) are preserved in dump output while everything else > >> is omitted. > > I realise I'm a bit late, but isn't this the kind of policy that shouldn't > > be hard-coded in the kernel? I.e. if next year it turns out that some > > user needs one parameter that's been omitted here, but not the whole dump, > > are they going to want to add another mode to the uapi? > > Should this not instead have been done as a set of flags to specify which > > pieces of information the caller wanted in the dump, rather than a mode > > flag selecting a pre-defined set? > > > > -ed > > I've been thinking some more about this. While the idea of making > fine-grained dump where user controls exact contents field-by-field is > unfeasible due to performance considerations, we can try to come up with > something more coarse-grained but not fully hardcoded (like current terse > dump implementation). Something like having a set of flags that allows > to skip output of groups of attributes. > > For example, CLS_SKIP_KEY flag would skip the whole expensive classifier > key dump without having to go through all 200 lines of conditionals in > fl_dump_key() while ACT_SKIP_OPTIONS would skip outputting TCA_OPTIONS > compound attribute (and expensive call to tc_action_ops->dump()). This > approach would also leave the door open for further more fine-grained > flags, if the need arises. For example, new flags > CLS_SKIP_KEY_{L2,L3,L4} can be introduced to more precisely control > which parts of cls key should be skipped. > > The main drawback of such approach is that it is impossible to come up > with universal set of flags that would be applicable for all > classifiers. Key (in some form) is applicable to most classifiers, but > it still doesn't make sense for matchall or bpf. Some classifiers have > 'flags', some don't. Hardware-offloaded classifiers have in_hw_count. > Considering this, initial set of flags will be somewhat flower-centric. > > What do you think? This looks like a reverse filtering to me, so essentially the same. Please give me some time to think about this, it is definitely not easy. The only thing I worry is that once you add terse dump, we cannot simply remove it any more. (Otherwise I wouldn't even want to push you on this.) Thanks.
On 2020-05-22 12:16 p.m., Vlad Buslov wrote: > On Thu 21 May 2020 at 20:02, Jakub Kicinski <kuba@kernel.org> wrote: >> On Thu, 21 May 2020 17:36:12 +0300 Vlad Buslov wrote: >>> Hi Edward, Cong, >>> >> Do you really need to dump classifiers? If you care about stats >> the actions could be sufficient if the offload code was fixed >> appropriately... Sorry I had to say that. > > Technically I need neither classifier nor action. All I need is cookie > and stats of single terminating action attached to filter (redirect, > drop, etc.). This can be achieved by making terse dump output that data > for last extension on filter. However, when I discussed my initial terse > dump idea with Jamal he asked me not to ossify such behavior to allow > for implementation of offloaded shared actions in future. > Trying to recollect our discussion (please forgive me if i am rehashing). old skule hardware model typically is ACL style with one action - therefore concept of tying a counter with with a classifier is common. Other hardware i am familiar with tends to have a table of counters. More an array with indices. In the shared case using the same counter index from multiple tables implies it is shared. Note "old skule" does not have a concept of sharing. So i was more worried about assuming the "old skule" model at the expense of other hardware models. We should be able to dump different tables from hardware. Mostly these could be tables of actions. And counter tables look like a gact action. From s/w: If what is needed is to just dump explicit stats a gact action with a cookie and an index would suffice. i.e tc match foobar \ action continue cookie blah index 15 \ action ... action mirred redirect ... of course this now adds extra cycles in the s/w datapath but advantage is it means you can cheaply either get individual counters (get action gact index 15) or dump all gact actions and filter in user space for your cookie. Or introduce a dump cookie filter in the kernel (similar to the "time since" action dump filter). cheers, jamal > Speaking about shared action offload, I remember looking at some RFC > patches by Edward implementing such functionality and allowing action > stats update directly from act, as opposed to current design that relies > on classifier to update action stats from hardware. Is that what you > mean by appropriately fixing offload code? With such implementation, > just dumping relevant action types would also work. My only concern is > that the only way to dump actions is per-namespace as opposed to > per-Qdisc of filters, which would clash with any other cls/act users on > same machine/hypervisor. > > > [...] >
On Fri 22 May 2020 at 22:33, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Wed, May 20, 2020 at 12:24 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> On Tue 19 May 2020 at 21:58, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> > On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> I considered that approach initially but decided against it for >> >> following reasons: >> >> >> >> - Generic data is covered by current terse dump implementation. >> >> Everything else will be act or cls specific which would result long >> >> list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST, >> >> TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ..., >> >> TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of >> >> these would require a lot of dedicated logic in act and cls dump >> >> callbacks. Also, it would be quite a challenge to test all possible >> >> combinations. >> > >> > Well, if you consider netlink dump as a database query, what Edward >> > proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather >> > than "select * from cls_db". >> > >> > No one said it is easy to implement, it is just more elegant than you >> > select a hardcoded set of columns for the user. >> >> As I explained to Edward, having denser netlink packets with more >> filters per packet is only part of optimization. Another part is not >> executing some code at all. Consider fl_dump_key() which is 200 lines >> function with bunch of conditionals like that: >> >> static int fl_dump_key(struct sk_buff *skb, struct net *net, >> struct fl_flow_key *key, struct fl_flow_key *mask) >> { >> if (mask->meta.ingress_ifindex) { >> struct net_device *dev; >> >> dev = __dev_get_by_index(net, key->meta.ingress_ifindex); >> if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name)) >> goto nla_put_failure; >> } >> >> if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST, >> mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK, >> sizeof(key->eth.dst)) || >> fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC, >> mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK, >> sizeof(key->eth.src)) || >> fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE, >> &mask->basic.n_proto, TCA_FLOWER_UNSPEC, >> sizeof(key->basic.n_proto))) >> goto nla_put_failure; >> >> if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls)) >> goto nla_put_failure; >> >> if (fl_dump_key_vlan(skb, TCA_FLOWER_KEY_VLAN_ID, >> TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, &mask->vlan)) >> goto nla_put_failure; >> ... >> >> >> Now imagine all of these are extended with additional if (flags & >> TCA_DUMP_XXX). All gains from not outputting some other minor stuff into >> netlink packet will be negated by it. > > Interesting, are you saying a bit test is as expensive as appending > an actual netlink attribution to the dumping? I am surprised. It is not just adding a clause to all those conditionals. Some functions are not called at all with current terse dump design. In the case of fl_dump_key() it is just a bunch of conditionals (and maybe price of cache misses to access struct fl_flow_key in a first place). In case of tc_action_ops->dump() it is also obtaining a spinlock, some atomic ops, etc. But I agree, "negated" is too strong of a word, "significantly impacted" is more correct. > > >> >> >> > >> > Think about it, what if another user wants a less terse dump but still >> > not a full dump? Would you implement ops->terse_dump2()? Or >> > what if people still think your terse dump is not as terse as she wants? >> > ops->mini_dump()? How many ops's we would end having? >> >> User can discard whatever he doesn't need in user land code. The goal of >> this change is performance optimization, not designing a generic >> kernel-space data filtering mechanism. > > You optimize the performance by reducing the dump size, which is > already effectively a data filtering. This doesn't have to be your goal, > you are implementing it anyway. > > >> >> > >> > >> >> >> >> - It is hard to come up with proper validation for such implementation. >> >> In case of terse dump I just return an error if classifier doesn't >> >> implement the callback (and since current implementation only outputs >> >> generic action info, it doesn't even require support from >> >> action-specific dump callbacks). But, for example, how do we validate >> >> a case where user sets some flower and tunnel_key act dump flags from >> >> previous paragraph, but Qdisc contains some other classifier? Or >> >> flower classifier points to other types of actions? Or when flower >> >> classifier has and tunnel_key actions but also mirred? Should the >> > >> > Each action should be able to dump selectively too. If you think it >> > as a database, it is just a different table with different schemas. >> >> How is designing custom SQL-like query language (according to your >> example at the beginning of the mail) for filter dump is going to >> improve performance? If there is a way to do it in fast a generic manner >> with BPF, then I'm very interested to hear the details. But adding >> hundred more hardcoded conditionals is just not a solution considering >> main motivations for this change is performance. > > I still wonder how a bit test is as expensive as you claim, it does > not look like you actually measure it. This of course depends on the > size of the dump, but if you look at other netlink dump in kernel, > not just tc filters, we already dump a lot of attributes per record. > > Thanks. I agree that I didn't specify which parts of the change constitute what fraction of the dump rate increase. Lets stage a simple test to verify the cost of calling just two functions (fl_dump_key() and tc_act_ops->dump() callback) and instantly discarding their results from packet (patch attached). diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 8ac7eb0a8309..267ee76d3ddb 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -771,6 +771,9 @@ tcf_action_dump_terse(struct sk_buff *skb, struct tc_action *a) { unsigned char *b = skb_tail_pointer(skb); struct tc_cookie *cookie; + unsigned char *c; + struct nlattr *nest; + int err; if (nla_put_string(skb, TCA_KIND, a->ops->kind)) goto nla_put_failure; @@ -787,6 +790,16 @@ tcf_action_dump_terse(struct sk_buff *skb, struct tc_action *a) } rcu_read_unlock(); + c = skb_tail_pointer(skb); + nest = nla_nest_start_noflag(skb, TCA_OPTIONS); + if (nest == NULL) + goto nla_put_failure; + err = tcf_action_dump_old(skb, a, 0, 0); + if (err > 0) { + nla_nest_end(skb, nest); + } + nlmsg_trim(skb, c); + return 0; nla_put_failure: diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 0c574700da75..1bc6294c5c9b 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -2771,8 +2771,10 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, static int fl_terse_dump(struct net *net, struct tcf_proto *tp, void *fh, struct sk_buff *skb, struct tcmsg *t, bool rtnl_held) { + struct fl_flow_key *key, *mask; struct cls_fl_filter *f = fh; struct nlattr *nest; + unsigned char *b; bool skip_hw; if (!f) @@ -2786,8 +2788,15 @@ static int fl_terse_dump(struct net *net, struct tcf_proto *tp, void *fh, spin_lock(&tp->lock); + key = &f->key; + mask = &f->mask->key; skip_hw = tc_skip_hw(f->flags); + b = skb_tail_pointer(skb); + if (fl_dump_key(skb, net, key, mask)) + goto nla_put_failure_locked; + nlmsg_trim(skb, b); + if (f->flags && nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags)) goto nla_put_failure_locked; Result for terse dumping 1m simple rules (flower with L2 key + gact drop) on current net-next: $ time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null real 0m3.445s user 0m2.087s sys 0m1.298s With patch applied: $ time sudo tc -s filter show terse dev ens1f0_0 ingress >/dev/null real 0m5.035s user 0m3.289s sys 0m1.687s As we can see this leads to 30% overhead in kernel space execution time. Now with more complex rules (flower 5tuple + act tunnel key + act mirred) on current net-next: $ time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null real 0m4.052s user 0m2.065s sys 0m1.937s Same rules with patch applied: $ time sudo tc -s filter show terse dev ens1f0_0 ingress >/dev/null real 0m6.346s user 0m3.166s sys 0m3.108s With more complex rules performance impact on kernel space execution time get more severe (60%). Overall, this looks like significant degradation.