diff mbox series

[iwl-next,05/12] idpf: strictly assert cachelines of queue and queue vector structures

Message ID 20240528134846.148890-6-aleksander.lobakin@intel.com
State Changes Requested
Delegated to: Anthony Nguyen
Headers show
Series idpf: XDP chapter I: convert Rx to libeth | expand

Commit Message

Alexander Lobakin May 28, 2024, 1:48 p.m. UTC
Now that the queue and queue vector structures are separated and laid
out optimally, group the fields as read-mostly, read-write, and cold
cachelines and add size assertions to make sure new features won't push
something out of its place and provoke perf regression.
Despite looking innocent, this gives up to 2% of perf bump on Rx.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.h | 443 +++++++++++---------
 1 file changed, 250 insertions(+), 193 deletions(-)

Comments

Jacob Keller May 29, 2024, 12:43 a.m. UTC | #1
On 5/28/2024 6:48 AM, Alexander Lobakin wrote:
> Now that the queue and queue vector structures are separated and laid
> out optimally, group the fields as read-mostly, read-write, and cold
> cachelines and add size assertions to make sure new features won't push
> something out of its place and provoke perf regression.



> Despite looking innocent, this gives up to 2% of perf bump on Rx.
> 

Could you explain this a bit more for my education? This patch does
clearly change the layout from what it was before this patch, but the
commit message here claims it was already laid out optimally? I guess
that wasn't 100% true? Or do these group field macros also provide
further hints to the compiler about read_mostly or cold, etc?

> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---

Having the compiler assert some of this so that we can more easily spot
regressions in the layout is a big benefit.

Thanks!

Although I'm not an expert on cache-line layout, I didn't see anything
wrong here:

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

>  drivers/net/ethernet/intel/idpf/idpf_txrx.h | 443 +++++++++++---------
>  1 file changed, 250 insertions(+), 193 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> index 9e4ba0aaf3ab..731e2a73def5 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> @@ -6,6 +6,7 @@
>  
>  #include <linux/dim.h>
>  
> +#include <net/libeth/cache.h>
>  #include <net/page_pool/helpers.h>
>  #include <net/tcp.h>
>  #include <net/netdev_queues.h>
> @@ -504,59 +505,70 @@ struct idpf_intr_reg {
>  
>  /**
>   * struct idpf_q_vector
> + * @read_mostly: CL group with rarely written hot fields

I wonder if there is a good way to format the doc here since we almost
want read_mostly to be some sort of header making it clear which fields
belong to it? I don't know how we'd achieve that with current kdoc though.
Alexander Lobakin June 12, 2024, 1:03 p.m. UTC | #2
From: Jacob Keller <jacob.e.keller@intel.com>
Date: Tue, 28 May 2024 17:43:34 -0700

> 
> 
> On 5/28/2024 6:48 AM, Alexander Lobakin wrote:
>> Now that the queue and queue vector structures are separated and laid
>> out optimally, group the fields as read-mostly, read-write, and cold
>> cachelines and add size assertions to make sure new features won't push
>> something out of its place and provoke perf regression.
> 
> 
> 
>> Despite looking innocent, this gives up to 2% of perf bump on Rx.
>>
> 
> Could you explain this a bit more for my education? This patch does
> clearly change the layout from what it was before this patch, but the
> commit message here claims it was already laid out optimally? I guess
> that wasn't 100% true? Or do these group field macros also provide
> further hints to the compiler about read_mostly or cold, etc?

Queue structure split placed fields grouped more optimally, but didn't
place ro/rw/cold into separate cachelines. This commit performs the
separation via libeth_cacheline_group(). Doing that in one commit didn't
look atomically, especially given that the queue split is already big
enough.

> 
>> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
> 
> Having the compiler assert some of this so that we can more easily spot
> regressions in the layout is a big benefit.

[...]

>> @@ -504,59 +505,70 @@ struct idpf_intr_reg {
>>  
>>  /**
>>   * struct idpf_q_vector
>> + * @read_mostly: CL group with rarely written hot fields
> 
> I wonder if there is a good way to format the doc here since we almost
> want read_mostly to be some sort of header making it clear which fields
> belong to it? I don't know how we'd achieve that with current kdoc though.

Since commit [0], we need to explicitly describe struct groups in kdocs.
@read_mostly and friends are struct groups themselves and in the first
patch, where I add these macros, I also add them to the kdoc script, so
that it treats them as struct groups, thus they also need to be described.
Given that one may use libeth_cacheline_group() to declare some custom
groups, like

	libeth_cacheline_group(my_cl,
		fields
	);

it makes sense as I'd like to know what this @my_cl is about. Here I use
"default" CL names, so this kdocs looks like Ctrl-{C,V} explaining
obvious things :D

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=5f8e4007c10d

Thanks,
Olek
Alexander Lobakin June 12, 2024, 1:08 p.m. UTC | #3
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 12 Jun 2024 15:03:07 +0200

> From: Jacob Keller <jacob.e.keller@intel.com>
> Date: Tue, 28 May 2024 17:43:34 -0700
> 
>>
>>
>> On 5/28/2024 6:48 AM, Alexander Lobakin wrote:
>>> Now that the queue and queue vector structures are separated and laid
>>> out optimally, group the fields as read-mostly, read-write, and cold
>>> cachelines and add size assertions to make sure new features won't push
>>> something out of its place and provoke perf regression.
>>
>>
>>
>>> Despite looking innocent, this gives up to 2% of perf bump on Rx.
>>>
>>
>> Could you explain this a bit more for my education? This patch does
>> clearly change the layout from what it was before this patch, but the
>> commit message here claims it was already laid out optimally? I guess
>> that wasn't 100% true? Or do these group field macros also provide
>> further hints to the compiler about read_mostly or cold, etc?
> 
> Queue structure split placed fields grouped more optimally, but didn't
> place ro/rw/cold into separate cachelines. This commit performs the
> separation via libeth_cacheline_group(). Doing that in one commit didn't
> look atomically, especially given that the queue split is already big
> enough.
> 
>>
>>> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> ---
>>
>> Having the compiler assert some of this so that we can more easily spot
>> regressions in the layout is a big benefit.
> 
> [...]
> 
>>> @@ -504,59 +505,70 @@ struct idpf_intr_reg {
>>>  
>>>  /**
>>>   * struct idpf_q_vector
>>> + * @read_mostly: CL group with rarely written hot fields
>>
>> I wonder if there is a good way to format the doc here since we almost
>> want read_mostly to be some sort of header making it clear which fields
>> belong to it? I don't know how we'd achieve that with current kdoc though.
> 
> Since commit [0], we need to explicitly describe struct groups in kdocs.
> @read_mostly and friends are struct groups themselves and in the first
> patch, where I add these macros, I also add them to the kdoc script, so
> that it treats them as struct groups, thus they also need to be described.
> Given that one may use libeth_cacheline_group() to declare some custom
> groups, like
> 
> 	libeth_cacheline_group(my_cl,
> 		fields
> 	);
> 
> it makes sense as I'd like to know what this @my_cl is about. Here I use
> "default" CL names, so this kdocs looks like Ctrl-{C,V} explaining
> obvious things :D

Sorry, I read your comment badly =\
I think this is enough to have it the way it is right now, as you anyway
has something like:

* @read_mostly: read-mostly hotpath fields
* @rm_field1: first read-mostly field
* @rm_field2: second read-mostly field
* @read_write: read-write hotpath fields
* @rw_field1: first read-write field
...

I mean, they are already sorta headers, aren't they? By looking at where
the next group is described, you can have a picture of which fields
belong to this one, given that the fields must be described in the same
order as they're defined in the structure.

Perhaps we could do

* @read_mostly: read-mostly hotpath fields
*  @rm_field1: first read-mostlyfields
* @read_write: read-write hotpath fields

i.e. indent the "child" fields, but it doesn't look good I'd say?

> 
> [0]
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=5f8e4007c10d
> 
> Thanks,
> Olek

Thanks,
Olek
Jacob Keller June 12, 2024, 10:40 p.m. UTC | #4
On 6/12/2024 6:03 AM, Alexander Lobakin wrote:
> From: Jacob Keller <jacob.e.keller@intel.com>
> Date: Tue, 28 May 2024 17:43:34 -0700
> 
>>
>>
>> On 5/28/2024 6:48 AM, Alexander Lobakin wrote:
>>> Now that the queue and queue vector structures are separated and laid
>>> out optimally, group the fields as read-mostly, read-write, and cold
>>> cachelines and add size assertions to make sure new features won't push
>>> something out of its place and provoke perf regression.
>>
>>
>>
>>> Despite looking innocent, this gives up to 2% of perf bump on Rx.
>>>
>>
>> Could you explain this a bit more for my education? This patch does
>> clearly change the layout from what it was before this patch, but the
>> commit message here claims it was already laid out optimally? I guess
>> that wasn't 100% true? Or do these group field macros also provide
>> further hints to the compiler about read_mostly or cold, etc?
> 
> Queue structure split placed fields grouped more optimally, but didn't
> place ro/rw/cold into separate cachelines. This commit performs the
> separation via libeth_cacheline_group(). Doing that in one commit didn't
> look atomically, especially given that the queue split is already big
> enough.
> 

Makes sense, thanks for the clarification!

>>
>>> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> ---
>>
>> Having the compiler assert some of this so that we can more easily spot
>> regressions in the layout is a big benefit.
> 
> [...]
> 
>>> @@ -504,59 +505,70 @@ struct idpf_intr_reg {
>>>  
>>>  /**
>>>   * struct idpf_q_vector
>>> + * @read_mostly: CL group with rarely written hot fields
>>
>> I wonder if there is a good way to format the doc here since we almost
>> want read_mostly to be some sort of header making it clear which fields
>> belong to it? I don't know how we'd achieve that with current kdoc though.
> 
> Since commit [0], we need to explicitly describe struct groups in kdocs.
> @read_mostly and friends are struct groups themselves and in the first
> patch, where I add these macros, I also add them to the kdoc script, so
> that it treats them as struct groups, thus they also need to be described.
> Given that one may use libeth_cacheline_group() to declare some custom
> groups, like
> 
> 	libeth_cacheline_group(my_cl,
> 		fields
> 	);
> 
> it makes sense as I'd like to know what this @my_cl is about. Here I use
> "default" CL names, so this kdocs looks like Ctrl-{C,V} explaining
> obvious things :D
> 
> [0]
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=5f8e4007c10d
> 

Yea, I am not asking about "don't document it" but wondering if the doc
format itself could expand so that from the kdoc it is clear which
fields belong to which group.

Anyways, I think the patch is fine as-is.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Jacob Keller June 12, 2024, 10:42 p.m. UTC | #5
On 6/12/2024 6:08 AM, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Wed, 12 Jun 2024 15:03:07 +0200
> 
>> From: Jacob Keller <jacob.e.keller@intel.com>
>> Date: Tue, 28 May 2024 17:43:34 -0700
>>
>>>
>>>
>>> On 5/28/2024 6:48 AM, Alexander Lobakin wrote:
>>>> Now that the queue and queue vector structures are separated and laid
>>>> out optimally, group the fields as read-mostly, read-write, and cold
>>>> cachelines and add size assertions to make sure new features won't push
>>>> something out of its place and provoke perf regression.
>>>
>>>
>>>
>>>> Despite looking innocent, this gives up to 2% of perf bump on Rx.
>>>>
>>>
>>> Could you explain this a bit more for my education? This patch does
>>> clearly change the layout from what it was before this patch, but the
>>> commit message here claims it was already laid out optimally? I guess
>>> that wasn't 100% true? Or do these group field macros also provide
>>> further hints to the compiler about read_mostly or cold, etc?
>>
>> Queue structure split placed fields grouped more optimally, but didn't
>> place ro/rw/cold into separate cachelines. This commit performs the
>> separation via libeth_cacheline_group(). Doing that in one commit didn't
>> look atomically, especially given that the queue split is already big
>> enough.
>>
>>>
>>>> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>>> ---
>>>
>>> Having the compiler assert some of this so that we can more easily spot
>>> regressions in the layout is a big benefit.
>>
>> [...]
>>
>>>> @@ -504,59 +505,70 @@ struct idpf_intr_reg {
>>>>  
>>>>  /**
>>>>   * struct idpf_q_vector
>>>> + * @read_mostly: CL group with rarely written hot fields
>>>
>>> I wonder if there is a good way to format the doc here since we almost
>>> want read_mostly to be some sort of header making it clear which fields
>>> belong to it? I don't know how we'd achieve that with current kdoc though.
>>
>> Since commit [0], we need to explicitly describe struct groups in kdocs.
>> @read_mostly and friends are struct groups themselves and in the first
>> patch, where I add these macros, I also add them to the kdoc script, so
>> that it treats them as struct groups, thus they also need to be described.
>> Given that one may use libeth_cacheline_group() to declare some custom
>> groups, like
>>
>> 	libeth_cacheline_group(my_cl,
>> 		fields
>> 	);
>>
>> it makes sense as I'd like to know what this @my_cl is about. Here I use
>> "default" CL names, so this kdocs looks like Ctrl-{C,V} explaining
>> obvious things :D
> 
> Sorry, I read your comment badly =\
> I think this is enough to have it the way it is right now, as you anyway
> has something like:
> 
> * @read_mostly: read-mostly hotpath fields
> * @rm_field1: first read-mostly field
> * @rm_field2: second read-mostly field
> * @read_write: read-write hotpath fields
> * @rw_field1: first read-write field
> ...
> 
> I mean, they are already sorta headers, aren't they? By looking at where
> the next group is described, you can have a picture of which fields
> belong to this one, given that the fields must be described in the same
> order as they're defined in the structure.
> 
> Perhaps we could do
> 
> * @read_mostly: read-mostly hotpath fields
> *  @rm_field1: first read-mostlyfields
> * @read_write: read-write hotpath fields
> 
> i.e. indent the "child" fields, but it doesn't look good I'd say?
>

I was thinking like put a blank line between groups or something, but ya
I think its not really a big deal. Its more than "@read_mostly" looks
like a field name when in reality its more like a group of fields.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index 9e4ba0aaf3ab..731e2a73def5 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -6,6 +6,7 @@ 
 
 #include <linux/dim.h>
 
+#include <net/libeth/cache.h>
 #include <net/page_pool/helpers.h>
 #include <net/tcp.h>
 #include <net/netdev_queues.h>
@@ -504,59 +505,70 @@  struct idpf_intr_reg {
 
 /**
  * struct idpf_q_vector
+ * @read_mostly: CL group with rarely written hot fields
  * @vport: Vport back pointer
- * @napi: napi handler
- * @v_idx: Vector index
- * @intr_reg: See struct idpf_intr_reg
+ * @num_rxq: Number of RX queues
  * @num_txq: Number of TX queues
+ * @num_bufq: Number of buffer queues
  * @num_complq: number of completion queues
+ * @rx: Array of RX queues to service
  * @tx: Array of TX queues to service
+ * @bufq: Array of buffer queues to service
  * @complq: array of completion queues
+ * @intr_reg: See struct idpf_intr_reg
+ * @read_write: CL group with both read/write hot fields
+ * @napi: napi handler
+ * @total_events: Number of interrupts processed
  * @tx_dim: Data for TX net_dim algorithm
  * @tx_itr_value: TX interrupt throttling rate
  * @tx_intr_mode: Dynamic ITR or not
  * @tx_itr_idx: TX ITR index
- * @num_rxq: Number of RX queues
- * @rx: Array of RX queues to service
  * @rx_dim: Data for RX net_dim algorithm
  * @rx_itr_value: RX interrupt throttling rate
  * @rx_intr_mode: Dynamic ITR or not
  * @rx_itr_idx: RX ITR index
- * @num_bufq: Number of buffer queues
- * @bufq: Array of buffer queues to service
- * @total_events: Number of interrupts processed
+ * @cold: CL group with fields no touched on hotpath
+ * @v_idx: Vector index
  * @affinity_mask: CPU affinity mask
  */
 struct idpf_q_vector {
-	struct idpf_vport *vport;
-	struct napi_struct napi;
-	u16 v_idx;
-	struct idpf_intr_reg intr_reg;
-
-	u16 num_txq;
-	u16 num_complq;
-	struct idpf_tx_queue **tx;
-	struct idpf_compl_queue **complq;
-
-	struct dim tx_dim;
-	u16 tx_itr_value;
-	bool tx_intr_mode;
-	u32 tx_itr_idx;
-
-	u16 num_rxq;
-	struct idpf_rx_queue **rx;
-	struct dim rx_dim;
-	u16 rx_itr_value;
-	bool rx_intr_mode;
-	u32 rx_itr_idx;
-
-	u16 num_bufq;
-	struct idpf_buf_queue **bufq;
-
-	u16 total_events;
-
-	cpumask_var_t affinity_mask;
+	libeth_cacheline_group(read_mostly,
+		struct idpf_vport *vport;
+
+		u16 num_rxq;
+		u16 num_txq;
+		u16 num_bufq;
+		u16 num_complq;
+		struct idpf_rx_queue **rx;
+		struct idpf_tx_queue **tx;
+		struct idpf_buf_queue **bufq;
+		struct idpf_compl_queue **complq;
+
+		struct idpf_intr_reg intr_reg;
+	);
+	libeth_cacheline_group(read_write,
+		struct napi_struct napi;
+		u16 total_events;
+
+		struct dim tx_dim;
+		u16 tx_itr_value;
+		bool tx_intr_mode;
+		u32 tx_itr_idx;
+
+		struct dim rx_dim;
+		u16 rx_itr_value;
+		bool rx_intr_mode;
+		u32 rx_itr_idx;
+	);
+	libeth_cacheline_group(cold,
+		u16 v_idx;
+
+		cpumask_var_t affinity_mask;
+	);
 };
+libeth_cacheline_set_assert(struct idpf_q_vector, 104,
+			    424 + 2 * sizeof(struct dim),
+			    8 + sizeof(cpumask_var_t));
 
 struct idpf_rx_queue_stats {
 	u64_stats_t packets;
@@ -610,6 +622,7 @@  struct idpf_txq_stash {
 
 /**
  * struct idpf_rx_queue - software structure representing a receive queue
+ * @read_mostly: CL group with rarely written hot fields
  * @rx: universal receive descriptor array
  * @single_buf: buffer descriptor array in singleq
  * @desc_ring: virtual descriptor ring address
@@ -623,14 +636,16 @@  struct idpf_txq_stash {
  * @idx: For RX queue, it is used to index to total RX queue across groups and
  *	 used for skb reporting.
  * @desc_count: Number of descriptors
+ * @rxdids: Supported RX descriptor ids
+ * @rx_ptype_lkup: LUT of Rx ptypes
+ * @read_write: CL group with both read/write hot fields
  * @next_to_use: Next descriptor to use
  * @next_to_clean: Next descriptor to clean
  * @next_to_alloc: RX buffer to allocate at
- * @rxdids: Supported RX descriptor ids
- * @rx_ptype_lkup: LUT of Rx ptypes
  * @skb: Pointer to the skb
  * @stats_sync: See struct u64_stats_sync
  * @q_stats: See union idpf_rx_queue_stats
+ * @cold: CL group with fields no touched on hotpath
  * @q_id: Queue id
  * @size: Length of descriptor ring in bytes
  * @dma: Physical address of ring
@@ -641,55 +656,63 @@  struct idpf_txq_stash {
  * @rx_max_pkt_size: RX max packet size
  */
 struct idpf_rx_queue {
-	union {
-		union virtchnl2_rx_desc *rx;
-		struct virtchnl2_singleq_rx_buf_desc *single_buf;
+	libeth_cacheline_group(read_mostly,
+		union {
+			union virtchnl2_rx_desc *rx;
+			struct virtchnl2_singleq_rx_buf_desc *single_buf;
 
-		void *desc_ring;
-	};
-	union {
-		struct {
-			struct idpf_bufq_set *bufq_sets;
-			struct napi_struct *napi;
+			void *desc_ring;
 		};
-		struct {
-			struct idpf_rx_buf *rx_buf;
-			struct page_pool *pp;
+		union {
+			struct {
+				struct idpf_bufq_set *bufq_sets;
+				struct napi_struct *napi;
+			};
+			struct {
+				struct idpf_rx_buf *rx_buf;
+				struct page_pool *pp;
+			};
 		};
-	};
-	struct net_device *netdev;
-	void __iomem *tail;
-
-	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
-	u16 idx;
-	u16 desc_count;
-	u16 next_to_use;
-	u16 next_to_clean;
-	u16 next_to_alloc;
-
-	u32 rxdids;
-
-	const struct idpf_rx_ptype_decoded *rx_ptype_lkup;
-	struct sk_buff *skb;
-
-	struct u64_stats_sync stats_sync;
-	struct idpf_rx_queue_stats q_stats;
-
-	/* Slowpath */
-	u32 q_id;
-	u32 size;
-	dma_addr_t dma;
-
-	struct idpf_q_vector *q_vector;
-
-	u16 rx_buffer_low_watermark;
-	u16 rx_hbuf_size;
-	u16 rx_buf_size;
-	u16 rx_max_pkt_size;
-} ____cacheline_aligned;
+		struct net_device *netdev;
+		void __iomem *tail;
+
+		DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
+		u16 idx;
+		u16 desc_count;
+
+		u32 rxdids;
+		const struct idpf_rx_ptype_decoded *rx_ptype_lkup;
+	);
+	libeth_cacheline_group(read_write,
+		u16 next_to_use;
+		u16 next_to_clean;
+		u16 next_to_alloc;
+
+		struct sk_buff *skb;
+
+		struct u64_stats_sync stats_sync;
+		struct idpf_rx_queue_stats q_stats;
+	);
+	libeth_cacheline_group(cold,
+		u32 q_id;
+		u32 size;
+		dma_addr_t dma;
+
+		struct idpf_q_vector *q_vector;
+
+		u16 rx_buffer_low_watermark;
+		u16 rx_hbuf_size;
+		u16 rx_buf_size;
+		u16 rx_max_pkt_size;
+	);
+};
+libeth_cacheline_set_assert(struct idpf_rx_queue, 64,
+			    72 + sizeof(struct u64_stats_sync),
+			    32);
 
 /**
  * struct idpf_tx_queue - software structure representing a transmit queue
+ * @read_mostly: CL group with rarely written hot fields
  * @base_tx: base Tx descriptor array
  * @base_ctx: base Tx context descriptor array
  * @flex_tx: flex Tx descriptor array
@@ -703,22 +726,7 @@  struct idpf_rx_queue {
  * @idx: For TX queue, it is used as index to map between TX queue group and
  *	 hot path TX pointers stored in vport. Used in both singleq/splitq.
  * @desc_count: Number of descriptors
- * @next_to_use: Next descriptor to use
- * @next_to_clean: Next descriptor to clean
- * @netdev: &net_device corresponding to this queue
- * @cleaned_bytes: Splitq only, TXQ only: When a TX completion is received on
- *		   the TX completion queue, it can be for any TXQ associated
- *		   with that completion queue. This means we can clean up to
- *		   N TXQs during a single call to clean the completion queue.
- *		   cleaned_bytes|pkts tracks the clean stats per TXQ during
- *		   that single call to clean the completion queue. By doing so,
- *		   we can update BQL with aggregate cleaned stats for each TXQ
- *		   only once at the end of the cleaning routine.
- * @clean_budget: singleq only, queue cleaning budget
- * @cleaned_pkts: Number of packets cleaned for the above said case
- * @tx_max_bufs: Max buffers that can be transmitted with scatter-gather
  * @tx_min_pkt_len: Min supported packet length
- * @compl_tag_bufid_m: Completion tag buffer id mask
  * @compl_tag_gen_s: Completion tag generation bit
  *	The format of the completion tag will change based on the TXQ
  *	descriptor ring size so that we can maintain roughly the same level
@@ -739,68 +747,92 @@  struct idpf_rx_queue {
  *	--------------------------------
  *
  *	This gives us 8*8160 = 65280 possible unique values.
+ * @netdev: &net_device corresponding to this queue
+ * @read_write: CL group with both read/write hot fields
+ * @next_to_use: Next descriptor to use
+ * @next_to_clean: Next descriptor to clean
+ * @cleaned_bytes: Splitq only, TXQ only: When a TX completion is received on
+ *		   the TX completion queue, it can be for any TXQ associated
+ *		   with that completion queue. This means we can clean up to
+ *		   N TXQs during a single call to clean the completion queue.
+ *		   cleaned_bytes|pkts tracks the clean stats per TXQ during
+ *		   that single call to clean the completion queue. By doing so,
+ *		   we can update BQL with aggregate cleaned stats for each TXQ
+ *		   only once at the end of the cleaning routine.
+ * @clean_budget: singleq only, queue cleaning budget
+ * @cleaned_pkts: Number of packets cleaned for the above said case
+ * @tx_max_bufs: Max buffers that can be transmitted with scatter-gather
+ * @stash: Tx buffer stash for Flow-based scheduling mode
+ * @compl_tag_bufid_m: Completion tag buffer id mask
  * @compl_tag_cur_gen: Used to keep track of current completion tag generation
  * @compl_tag_gen_max: To determine when compl_tag_cur_gen should be reset
- * @stash: Tx buffer stash for Flow-based scheduling mode
  * @stats_sync: See struct u64_stats_sync
  * @q_stats: See union idpf_tx_queue_stats
+ * @cold: CL group with fields no touched on hotpath
  * @q_id: Queue id
  * @size: Length of descriptor ring in bytes
  * @dma: Physical address of ring
  * @q_vector: Backreference to associated vector
  */
 struct idpf_tx_queue {
-	union {
-		struct idpf_base_tx_desc *base_tx;
-		struct idpf_base_tx_ctx_desc *base_ctx;
-		union idpf_tx_flex_desc *flex_tx;
-		struct idpf_flex_tx_ctx_desc *flex_ctx;
-
-		void *desc_ring;
-	};
-	struct idpf_tx_buf *tx_buf;
-	struct idpf_txq_group *txq_grp;
-	struct device *dev;
-	void __iomem *tail;
-
-	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
-	u16 idx;
-	u16 desc_count;
-	u16 next_to_use;
-	u16 next_to_clean;
-
-	struct net_device *netdev;
-
-	union {
-		u32 cleaned_bytes;
-		u32 clean_budget;
-	};
-	u16 cleaned_pkts;
-
-	u16 tx_max_bufs;
-	u16 tx_min_pkt_len;
-
-	u16 compl_tag_bufid_m;
-	u16 compl_tag_gen_s;
-
-	u16 compl_tag_cur_gen;
-	u16 compl_tag_gen_max;
+	libeth_cacheline_group(read_mostly,
+		union {
+			struct idpf_base_tx_desc *base_tx;
+			struct idpf_base_tx_ctx_desc *base_ctx;
+			union idpf_tx_flex_desc *flex_tx;
+			struct idpf_flex_tx_ctx_desc *flex_ctx;
+
+			void *desc_ring;
+		};
+		struct idpf_tx_buf *tx_buf;
+		struct idpf_txq_group *txq_grp;
+		struct device *dev;
+		void __iomem *tail;
+
+		DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
+		u16 idx;
+		u16 desc_count;
+
+		u16 tx_min_pkt_len;
+		u16 compl_tag_gen_s;
+
+		struct net_device *netdev;
+	);
+	libeth_cacheline_group(read_write,
+		u16 next_to_use;
+		u16 next_to_clean;
+
+		union {
+			u32 cleaned_bytes;
+			u32 clean_budget;
+		};
+		u16 cleaned_pkts;
 
-	struct idpf_txq_stash *stash;
+		u16 tx_max_bufs;
+		struct idpf_txq_stash *stash;
 
-	struct u64_stats_sync stats_sync;
-	struct idpf_tx_queue_stats q_stats;
+		u16 compl_tag_bufid_m;
+		u16 compl_tag_cur_gen;
+		u16 compl_tag_gen_max;
 
-	/* Slowpath */
-	u32 q_id;
-	u32 size;
-	dma_addr_t dma;
+		struct u64_stats_sync stats_sync;
+		struct idpf_tx_queue_stats q_stats;
+	);
+	libeth_cacheline_group(cold,
+		u32 q_id;
+		u32 size;
+		dma_addr_t dma;
 
-	struct idpf_q_vector *q_vector;
-} ____cacheline_aligned;
+		struct idpf_q_vector *q_vector;
+	);
+};
+libeth_cacheline_set_assert(struct idpf_tx_queue, 64,
+			    88 + sizeof(struct u64_stats_sync),
+			    24);
 
 /**
  * struct idpf_buf_queue - software structure representing a buffer queue
+ * @read_mostly: CL group with rarely written hot fields
  * @split_buf: buffer descriptor array
  * @rx_buf: Struct with RX buffer related members
  * @rx_buf.buf: See struct idpf_rx_buf
@@ -810,9 +842,11 @@  struct idpf_tx_queue {
  * @tail: Tail offset
  * @flags: See enum idpf_queue_flags_t
  * @desc_count: Number of descriptors
+ * @read_write: CL group with both read/write hot fields
  * @next_to_use: Next descriptor to use
  * @next_to_clean: Next descriptor to clean
  * @next_to_alloc: RX buffer to allocate at
+ * @cold: CL group with fields no touched on hotpath
  * @q_id: Queue id
  * @size: Length of descriptor ring in bytes
  * @dma: Physical address of ring
@@ -822,79 +856,95 @@  struct idpf_tx_queue {
  * @rx_buf_size: Buffer size
  */
 struct idpf_buf_queue {
-	struct virtchnl2_splitq_rx_buf_desc *split_buf;
-	struct {
-		struct idpf_rx_buf *buf;
-		dma_addr_t hdr_buf_pa;
-		void *hdr_buf_va;
-	} rx_buf;
-	struct page_pool *pp;
-	void __iomem *tail;
-
-	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
-	u16 desc_count;
-	u16 next_to_use;
-	u16 next_to_clean;
-	u16 next_to_alloc;
-
-	/* Slowpath */
-	u32 q_id;
-	u32 size;
-	dma_addr_t dma;
-
-	struct idpf_q_vector *q_vector;
-
-	u16 rx_buffer_low_watermark;
-	u16 rx_hbuf_size;
-	u16 rx_buf_size;
-} ____cacheline_aligned;
+	libeth_cacheline_group(read_mostly,
+		struct virtchnl2_splitq_rx_buf_desc *split_buf;
+		struct {
+			struct idpf_rx_buf *buf;
+			dma_addr_t hdr_buf_pa;
+			void *hdr_buf_va;
+		} rx_buf;
+		struct page_pool *pp;
+		void __iomem *tail;
+
+		DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
+		u32 desc_count;
+	);
+	libeth_cacheline_group(read_write,
+		u32 next_to_use;
+		u32 next_to_clean;
+		u32 next_to_alloc;
+	);
+	libeth_cacheline_group(cold,
+		u32 q_id;
+		u32 size;
+		dma_addr_t dma;
+
+		struct idpf_q_vector *q_vector;
+
+		u16 rx_buffer_low_watermark;
+		u16 rx_hbuf_size;
+		u16 rx_buf_size;
+	);
+};
+libeth_cacheline_set_assert(struct idpf_buf_queue, 64, 16, 32);
 
 /**
  * struct idpf_compl_queue - software structure representing a completion queue
+ * @read_mostly: CL group with rarely written hot fields
  * @comp: completion descriptor array
  * @txq_grp: See struct idpf_txq_group
  * @flags: See enum idpf_queue_flags_t
  * @desc_count: Number of descriptors
+ * @clean_budget: queue cleaning budget
+ * @netdev: &net_device corresponding to this queue
+ * @read_write: CL group with both read/write hot fields
  * @next_to_use: Next descriptor to use. Relevant in both split & single txq
  *		 and bufq.
  * @next_to_clean: Next descriptor to clean
- * @netdev: &net_device corresponding to this queue
- * @clean_budget: queue cleaning budget
  * @num_completions: Only relevant for TX completion queue. It tracks the
  *		     number of completions received to compare against the
  *		     number of completions pending, as accumulated by the
  *		     TX queues.
+ * @cold: CL group with fields no touched on hotpath
  * @q_id: Queue id
  * @size: Length of descriptor ring in bytes
  * @dma: Physical address of ring
  * @q_vector: Backreference to associated vector
  */
 struct idpf_compl_queue {
-	struct idpf_splitq_tx_compl_desc *comp;
-	struct idpf_txq_group *txq_grp;
-
-	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
-	u16 desc_count;
-	u16 next_to_use;
-	u16 next_to_clean;
-
-	struct net_device *netdev;
-	u32 clean_budget;
-	u32 num_completions;
+	libeth_cacheline_group(read_mostly,
+		struct idpf_splitq_tx_compl_desc *comp;
+		struct idpf_txq_group *txq_grp;
 
-	/* Slowpath */
-	u32 q_id;
-	u32 size;
-	dma_addr_t dma;
+		DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
+		u32 desc_count;
 
-	struct idpf_q_vector *q_vector;
-} ____cacheline_aligned;
+		u32 clean_budget;
+		struct net_device *netdev;
+	);
+	libeth_cacheline_group(read_write,
+		u32 next_to_use;
+		u32 next_to_clean;
+
+		u32 num_completions;
+	);
+	libeth_cacheline_group(cold,
+		u32 q_id;
+		u32 size;
+		dma_addr_t dma;
+
+		struct idpf_q_vector *q_vector;
+	);
+};
+libeth_cacheline_set_assert(struct idpf_compl_queue, 40, 16, 24);
 
 /**
  * struct idpf_sw_queue
+ * @read_mostly: CL group with rarely written hot fields
  * @ring: Pointer to the ring
  * @flags: See enum idpf_queue_flags_t
  * @desc_count: Descriptor count
+ * @read_write: CL group with both read/write hot fields
  * @next_to_use: Buffer to allocate at
  * @next_to_clean: Next descriptor to clean
  *
@@ -903,13 +953,20 @@  struct idpf_compl_queue {
  * lockless buffer management system and are strictly software only constructs.
  */
 struct idpf_sw_queue {
-	u32 *ring;
-
-	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
-	u16 desc_count;
-	u16 next_to_use;
-	u16 next_to_clean;
-} ____cacheline_aligned;
+	libeth_cacheline_group(read_mostly,
+		u32 *ring;
+
+		DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
+		u32 desc_count;
+	);
+	libeth_cacheline_group(read_write,
+		u32 next_to_use;
+		u32 next_to_clean;
+	);
+};
+libeth_cacheline_group_assert(struct idpf_sw_queue, read_mostly, 24);
+libeth_cacheline_group_assert(struct idpf_sw_queue, read_write, 8);
+libeth_cacheline_struct_assert(struct idpf_sw_queue, 24, 8);
 
 /**
  * struct idpf_rxq_set