[net-next,01/16] qlge: Remove irq_cnt

Message ID	20190617074858.32467-1-bpoirier@suse.com
State	Changes Requested
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Benjamin Poirier <bpoirier@suse.com> To: Manish Chopra <manishc@marvell.com>, GR-Linux-NIC-Dev@marvell.com, netdev@vger.kernel.org Subject: [PATCH net-next 01/16] qlge: Remove irq_cnt Date: Mon, 17 Jun 2019 16:48:43 +0900 Message-Id: <20190617074858.32467-1-bpoirier@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	[net-next,01/16] qlge: Remove irq_cnt \| expand [net-next,01/16] qlge: Remove irq_cnt [net-next,02/16] qlge: Remove page_chunk.last_flag [net-next,03/16] qlge: Deduplicate lbq_buf_size [net-next,04/16] qlge: Remove bq_desc.maplen [net-next,05/16] qlge: Remove rx_ring.sbq_buf_size [net-next,06/16] qlge: Remove useless dma synchronization calls [net-next,07/16] qlge: Deduplicate rx buffer queue management [net-next,08/16] qlge: Fix dma_sync_single calls [net-next,09/16] qlge: Remove rx_ring.type [net-next,10/16] qlge: Factor out duplicated expression [net-next,11/16] qlge: Remove qlge_bq.len & size [net-next,12/16] qlge: Remove useless memset [net-next,13/16] qlge: Replace memset with assignment [net-next,14/16] qlge: Update buffer queue prod index despite oom [net-next,15/16] qlge: Refill rx buffers up to multiple of 16 [net-next,16/16] qlge: Refill empty buffer queues from wq

Benjamin Poirier June 17, 2019, 7:48 a.m. UTC

qlge uses an irq enable/disable refcounting scheme that is:
* poorly implemented
	Uses a spin_lock to protect accesses to the irq_cnt atomic variable
* buggy
	Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
	when using SO_BUSY_POLL.
* unnecessary
	The purpose or irq_cnt is to reduce irq control writes when
	multiple work items result from one irq: the irq is re-enabled
	after all work is done.
	Analysis of the irq handler shows that there is only one case where
	there might be two workers scheduled at once, and those have
	separate irq masking bits.

Therefore, remove irq_cnt.

Additionally, we get a performance improvement:
perf stat -e cycles -a -r5 super_netperf 100 -H 192.168.33.1 -t TCP_RR

Before:
628560
628056
622103
622744
627202
[...]
   268,803,947,669      cycles                 ( +-  0.09% )

After:
636300
634106
634984
638555
634188
[...]
   259,237,291,449      cycles                 ( +-  0.19% )

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
 drivers/net/ethernet/qlogic/qlge/qlge.h      |  7 --
 drivers/net/ethernet/qlogic/qlge/qlge_main.c | 98 ++++++--------------
 drivers/net/ethernet/qlogic/qlge/qlge_mpi.c  |  1 -
 3 files changed, 27 insertions(+), 79 deletions(-)

Manish Chopra June 17, 2019, 4:49 p.m. UTC | #1

> -----Original Message-----
> From: Benjamin Poirier <bpoirier@suse.com>
> Sent: Monday, June 17, 2019 1:19 PM
> To: Manish Chopra <manishc@marvell.com>; GR-Linux-NIC-Dev <GR-Linux-
> NIC-Dev@marvell.com>; netdev@vger.kernel.org
> Subject: [PATCH net-next 01/16] qlge: Remove irq_cnt
> 
> qlge uses an irq enable/disable refcounting scheme that is:
> * poorly implemented
> 	Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> * buggy
> 	Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> 	when using SO_BUSY_POLL.
> * unnecessary
> 	The purpose or irq_cnt is to reduce irq control writes when
> 	multiple work items result from one irq: the irq is re-enabled
> 	after all work is done.
> 	Analysis of the irq handler shows that there is only one case where
> 	there might be two workers scheduled at once, and those have
> 	separate irq masking bits.
> 
> Therefore, remove irq_cnt.
> 
> Additionally, we get a performance improvement:
> perf stat -e cycles -a -r5 super_netperf 100 -H 192.168.33.1 -t TCP_RR
> 
> Before:
> 628560
> 628056
> 622103
> 622744
> 627202
> [...]
>    268,803,947,669      cycles                 ( +-  0.09% )
> 
> After:
> 636300
> 634106
> 634984
> 638555
> 634188
> [...]
>    259,237,291,449      cycles                 ( +-  0.19% )
> 
> Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
> ---
>  drivers/net/ethernet/qlogic/qlge/qlge.h      |  7 --
>  drivers/net/ethernet/qlogic/qlge/qlge_main.c | 98 ++++++--------------
> drivers/net/ethernet/qlogic/qlge/qlge_mpi.c  |  1 -
>  3 files changed, 27 insertions(+), 79 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h
> b/drivers/net/ethernet/qlogic/qlge/qlge.h
> index ad7c5eb8a3b6..5d9a36deda08 100644
> --- a/drivers/net/ethernet/qlogic/qlge/qlge.h
> +++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
> @@ -1982,11 +1982,6 @@ struct intr_context {
>  	u32 intr_dis_mask;	/* value/mask used to disable this intr */
>  	u32 intr_read_mask;	/* value/mask used to read this intr */
>  	char name[IFNAMSIZ * 2];
> -	atomic_t irq_cnt;	/* irq_cnt is used in single vector
> -				 * environment.  It's incremented for each
> -				 * irq handler that is scheduled.  When each
> -				 * handler finishes it decrements irq_cnt and
> -				 * enables interrupts if it's zero. */
>  	irq_handler_t handler;
>  };
> 
> @@ -2074,7 +2069,6 @@ struct ql_adapter {
>  	u32 port;		/* Port number this adapter */
> 
>  	spinlock_t adapter_lock;
> -	spinlock_t hw_lock;
>  	spinlock_t stats_lock;
> 
>  	/* PCI Bus Relative Register Addresses */ @@ -2235,7 +2229,6 @@
> void ql_mpi_reset_work(struct work_struct *work);  void
> ql_mpi_core_to_log(struct work_struct *work);  int ql_wait_reg_rdy(struct
> ql_adapter *qdev, u32 reg, u32 bit, u32 ebit);  void
> ql_queue_asic_error(struct ql_adapter *qdev);
> -u32 ql_enable_completion_interrupt(struct ql_adapter *qdev, u32 intr);
> void ql_set_ethtool_ops(struct net_device *ndev);  int
> ql_read_xgmac_reg64(struct ql_adapter *qdev, u32 reg, u64 *data);  void
> ql_mpi_idc_work(struct work_struct *work); diff --git
> a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
> b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
> index 6cae33072496..0bfbe11db795 100644
> --- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
> +++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
> @@ -625,75 +625,26 @@ static void ql_disable_interrupts(struct ql_adapter
> *qdev)
>  	ql_write32(qdev, INTR_EN, (INTR_EN_EI << 16));  }
> 
> -/* If we're running with multiple MSI-X vectors then we enable on the fly.
> - * Otherwise, we may have multiple outstanding workers and don't want to
> - * enable until the last one finishes. In this case, the irq_cnt gets
> - * incremented every time we queue a worker and decremented every time
> - * a worker finishes.  Once it hits zero we enable the interrupt.
> - */
> -u32 ql_enable_completion_interrupt(struct ql_adapter *qdev, u32 intr)
> +static void ql_enable_completion_interrupt(struct ql_adapter *qdev, u32
> +intr)
>  {
> -	u32 var = 0;
> -	unsigned long hw_flags = 0;
> -	struct intr_context *ctx = qdev->intr_context + intr;
> -
> -	if (likely(test_bit(QL_MSIX_ENABLED, &qdev->flags) && intr)) {
> -		/* Always enable if we're MSIX multi interrupts and
> -		 * it's not the default (zeroeth) interrupt.
> -		 */
> -		ql_write32(qdev, INTR_EN,
> -			   ctx->intr_en_mask);
> -		var = ql_read32(qdev, STS);
> -		return var;
> -	}
> +	struct intr_context *ctx = &qdev->intr_context[intr];
> 
> -	spin_lock_irqsave(&qdev->hw_lock, hw_flags);
> -	if (atomic_dec_and_test(&ctx->irq_cnt)) {
> -		ql_write32(qdev, INTR_EN,
> -			   ctx->intr_en_mask);
> -		var = ql_read32(qdev, STS);
> -	}
> -	spin_unlock_irqrestore(&qdev->hw_lock, hw_flags);
> -	return var;
> +	ql_write32(qdev, INTR_EN, ctx->intr_en_mask);
>  }
> 
> -static u32 ql_disable_completion_interrupt(struct ql_adapter *qdev, u32
> intr)
> +static void ql_disable_completion_interrupt(struct ql_adapter *qdev,
> +u32 intr)
>  {
> -	u32 var = 0;
> -	struct intr_context *ctx;
> +	struct intr_context *ctx = &qdev->intr_context[intr];
> 
> -	/* HW disables for us if we're MSIX multi interrupts and
> -	 * it's not the default (zeroeth) interrupt.
> -	 */
> -	if (likely(test_bit(QL_MSIX_ENABLED, &qdev->flags) && intr))
> -		return 0;
> -
> -	ctx = qdev->intr_context + intr;
> -	spin_lock(&qdev->hw_lock);
> -	if (!atomic_read(&ctx->irq_cnt)) {
> -		ql_write32(qdev, INTR_EN,
> -		ctx->intr_dis_mask);
> -		var = ql_read32(qdev, STS);
> -	}
> -	atomic_inc(&ctx->irq_cnt);
> -	spin_unlock(&qdev->hw_lock);
> -	return var;
> +	ql_write32(qdev, INTR_EN, ctx->intr_dis_mask);
>  }
> 
>  static void ql_enable_all_completion_interrupts(struct ql_adapter *qdev)  {
>  	int i;
> -	for (i = 0; i < qdev->intr_count; i++) {
> -		/* The enable call does a atomic_dec_and_test
> -		 * and enables only if the result is zero.
> -		 * So we precharge it here.
> -		 */
> -		if (unlikely(!test_bit(QL_MSIX_ENABLED, &qdev->flags) ||
> -			i == 0))
> -			atomic_set(&qdev->intr_context[i].irq_cnt, 1);
> -		ql_enable_completion_interrupt(qdev, i);
> -	}
> 
> +	for (i = 0; i < qdev->intr_count; i++)
> +		ql_enable_completion_interrupt(qdev, i);
>  }
> 
>  static int ql_validate_flash(struct ql_adapter *qdev, u32 size, const char
> *str) @@ -2500,21 +2451,22 @@ static irqreturn_t qlge_isr(int irq, void
> *dev_id)
>  	u32 var;
>  	int work_done = 0;
> 
> -	spin_lock(&qdev->hw_lock);
> -	if (atomic_read(&qdev->intr_context[0].irq_cnt)) {
> -		netif_printk(qdev, intr, KERN_DEBUG, qdev->ndev,
> -			     "Shared Interrupt, Not ours!\n");
> -		spin_unlock(&qdev->hw_lock);
> -		return IRQ_NONE;
> -	}
> -	spin_unlock(&qdev->hw_lock);
> +	/* Experience shows that when using INTx interrupts, the device
> does
> +	 * not always auto-mask the interrupt.
> +	 * When using MSI mode, the interrupt must be explicitly disabled
> +	 * (even though it is auto-masked), otherwise a later command to
> +	 * enable it is not effective.
> +	 */
> +	if (!test_bit(QL_MSIX_ENABLED, &qdev->flags))
> +		ql_disable_completion_interrupt(qdev, 0);
> 
> -	var = ql_disable_completion_interrupt(qdev, intr_context->intr);
> +	var = ql_read32(qdev, STS);
> 
>  	/*
>  	 * Check for fatal error.
>  	 */
>  	if (var & STS_FE) {
> +		ql_disable_completion_interrupt(qdev, 0);
>  		ql_queue_asic_error(qdev);
>  		netdev_err(qdev->ndev, "Got fatal error, STS = %x.\n", var);
>  		var = ql_read32(qdev, ERR_STS);
> @@ -2534,7 +2486,6 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
>  		 */
>  		netif_err(qdev, intr, qdev->ndev,
>  			  "Got MPI processor interrupt.\n");
> -		ql_disable_completion_interrupt(qdev, intr_context->intr);
>  		ql_write32(qdev, INTR_MASK, (INTR_MASK_PI << 16));
>  		queue_delayed_work_on(smp_processor_id(),
>  				qdev->workqueue, &qdev->mpi_work, 0);
> @@ -2550,11 +2501,18 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
>  	if (var & intr_context->irq_mask) {
>  		netif_info(qdev, intr, qdev->ndev,
>  			   "Waking handler for rx_ring[0].\n");
> -		ql_disable_completion_interrupt(qdev, intr_context->intr);
>  		napi_schedule(&rx_ring->napi);
>  		work_done++;
> +	} else {
> +		/* Experience shows that the device sometimes signals an
> +		 * interrupt but no work is scheduled from this function.
> +		 * Nevertheless, the interrupt is auto-masked. Therefore, we
> +		 * systematically re-enable the interrupt if we didn't
> +		 * schedule napi.
> +		 */
> +		ql_enable_completion_interrupt(qdev, 0);
>  	}
> -	ql_enable_completion_interrupt(qdev, intr_context->intr);
> +
>  	return work_done ? IRQ_HANDLED : IRQ_NONE;  }
> 
> @@ -3557,7 +3515,6 @@ static int ql_request_irq(struct ql_adapter *qdev)
>  	ql_resolve_queues_to_irqs(qdev);
> 
>  	for (i = 0; i < qdev->intr_count; i++, intr_context++) {
> -		atomic_set(&intr_context->irq_cnt, 0);
>  		if (test_bit(QL_MSIX_ENABLED, &qdev->flags)) {
>  			status = request_irq(qdev->msi_x_entry[i].vector,
>  					     intr_context->handler,
> @@ -4642,7 +4599,6 @@ static int ql_init_device(struct pci_dev *pdev,
> struct net_device *ndev,
>  		goto err_out2;
>  	}
>  	qdev->msg_enable = netif_msg_init(debug, default_msg);
> -	spin_lock_init(&qdev->hw_lock);
>  	spin_lock_init(&qdev->stats_lock);
> 
>  	if (qlge_mpi_coredump) {
> diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_mpi.c
> b/drivers/net/ethernet/qlogic/qlge/qlge_mpi.c
> index 957c72985a06..9e422bbbb6ab 100644
> --- a/drivers/net/ethernet/qlogic/qlge/qlge_mpi.c
> +++ b/drivers/net/ethernet/qlogic/qlge/qlge_mpi.c
> @@ -1257,7 +1257,6 @@ void ql_mpi_work(struct work_struct *work)
>  	/* End polled mode for MPI */
>  	ql_write32(qdev, INTR_MASK, (INTR_MASK_PI << 16) |
> INTR_MASK_PI);
>  	mutex_unlock(&qdev->mpi_mutex);
> -	ql_enable_completion_interrupt(qdev, 0);
>  }
> 
>  void ql_mpi_reset_work(struct work_struct *work)
> --
> 2.21.0

Hello Benjamin, 

Just FYI. I am OOO for a week, so reviewing and testing these patches will take time.

Thanks,
Manish

Manish Chopra June 26, 2019, 8:59 a.m. UTC | #2

> -----Original Message-----
> From: Benjamin Poirier <bpoirier@suse.com>
> Sent: Monday, June 17, 2019 1:19 PM
> To: Manish Chopra <manishc@marvell.com>; GR-Linux-NIC-Dev <GR-Linux-
> NIC-Dev@marvell.com>; netdev@vger.kernel.org
> Subject: [PATCH net-next 01/16] qlge: Remove irq_cnt
> 
> qlge uses an irq enable/disable refcounting scheme that is:
> * poorly implemented
> 	Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> * buggy
> 	Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> 	when using SO_BUSY_POLL.
> * unnecessary
> 	The purpose or irq_cnt is to reduce irq control writes when
> 	multiple work items result from one irq: the irq is re-enabled
> 	after all work is done.
> 	Analysis of the irq handler shows that there is only one case where
> 	there might be two workers scheduled at once, and those have
> 	separate irq masking bits.

I believe you are talking about here for asic error recovery worker and MPI worker.
Which separate IRQ masking bits are you referring here ?

>  static int ql_validate_flash(struct ql_adapter *qdev, u32 size, const char *str)
> @@ -2500,21 +2451,22 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
>  	u32 var;
>  	int work_done = 0;
> 
> -	spin_lock(&qdev->hw_lock);
> -	if (atomic_read(&qdev->intr_context[0].irq_cnt)) {
> -		netif_printk(qdev, intr, KERN_DEBUG, qdev->ndev,
> -			     "Shared Interrupt, Not ours!\n");
> -		spin_unlock(&qdev->hw_lock);
> -		return IRQ_NONE;
> -	}
> -	spin_unlock(&qdev->hw_lock);
> +	/* Experience shows that when using INTx interrupts, the device does
> +	 * not always auto-mask the interrupt.
> +	 * When using MSI mode, the interrupt must be explicitly disabled
> +	 * (even though it is auto-masked), otherwise a later command to
> +	 * enable it is not effective.
> +	 */
> +	if (!test_bit(QL_MSIX_ENABLED, &qdev->flags))
> +		ql_disable_completion_interrupt(qdev, 0);

Current code is disabling completion interrupt in case of MSI-X zeroth vector.
This change will break that behavior. Shouldn't zeroth vector in case of MSI-X also disable completion interrupt ?
If not, please explain why ?

> 
> -	var = ql_disable_completion_interrupt(qdev, intr_context->intr);
> +	var = ql_read32(qdev, STS);
> 
>  	/*
>  	 * Check for fatal error.
>  	 */
>  	if (var & STS_FE) {
> +		ql_disable_completion_interrupt(qdev, 0);

Why need to do it again here ? if before this it can disable completion interrupt for INT-X case and MSI-X zeroth vector case ?

>  		ql_queue_asic_error(qdev);
>  		netdev_err(qdev->ndev, "Got fatal error, STS = %x.\n", var);
>  		var = ql_read32(qdev, ERR_STS);
> @@ -2534,7 +2486,6 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
>  		 */
>  		netif_err(qdev, intr, qdev->ndev,
>  			  "Got MPI processor interrupt.\n");
> -		ql_disable_completion_interrupt(qdev, intr_context->intr);

Why disable interrupt is not required here ?  While in case of Fatal error case above ql_disable_completion_interrupt() is being called ?
Also, in case of MSI-X zeroth vector it will not disable completion interrupt but at the end, it will end of qlge_isr() enabling completion interrupt.
Seems like disabling and enabling might not be in sync in case of MSI-X zeroth vector.

Benjamin Poirier June 26, 2019, 11:36 a.m. UTC | #3

On 2019/06/26 08:59, Manish Chopra wrote:
> > -----Original Message-----
> > From: Benjamin Poirier <bpoirier@suse.com>
> > Sent: Monday, June 17, 2019 1:19 PM
> > To: Manish Chopra <manishc@marvell.com>; GR-Linux-NIC-Dev <GR-Linux-
> > NIC-Dev@marvell.com>; netdev@vger.kernel.org
> > Subject: [PATCH net-next 01/16] qlge: Remove irq_cnt
> > 
> > qlge uses an irq enable/disable refcounting scheme that is:
> > * poorly implemented
> > 	Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> > * buggy
> > 	Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> > 	when using SO_BUSY_POLL.
> > * unnecessary
> > 	The purpose or irq_cnt is to reduce irq control writes when
> > 	multiple work items result from one irq: the irq is re-enabled
> > 	after all work is done.
> > 	Analysis of the irq handler shows that there is only one case where
> > 	there might be two workers scheduled at once, and those have
> > 	separate irq masking bits.
> 
> I believe you are talking about here for asic error recovery worker and MPI worker.
> Which separate IRQ masking bits are you referring here ?

INTR_EN with intr_dis_mask for completion interrupts
INTR_MASK bit INTR_MASK_PI for mpi interrupts

> >  static int ql_validate_flash(struct ql_adapter *qdev, u32 size, const char *str)
> > @@ -2500,21 +2451,22 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
> >  	u32 var;
> >  	int work_done = 0;
> > 
> > -	spin_lock(&qdev->hw_lock);
> > -	if (atomic_read(&qdev->intr_context[0].irq_cnt)) {
> > -		netif_printk(qdev, intr, KERN_DEBUG, qdev->ndev,
> > -			     "Shared Interrupt, Not ours!\n");
> > -		spin_unlock(&qdev->hw_lock);
> > -		return IRQ_NONE;
> > -	}
> > -	spin_unlock(&qdev->hw_lock);
> > +	/* Experience shows that when using INTx interrupts, the device does
> > +	 * not always auto-mask the interrupt.
> > +	 * When using MSI mode, the interrupt must be explicitly disabled
> > +	 * (even though it is auto-masked), otherwise a later command to
> > +	 * enable it is not effective.
> > +	 */
> > +	if (!test_bit(QL_MSIX_ENABLED, &qdev->flags))
> > +		ql_disable_completion_interrupt(qdev, 0);
> 
> Current code is disabling completion interrupt in case of MSI-X zeroth vector.
> This change will break that behavior. Shouldn't zeroth vector in case of MSI-X also disable completion interrupt ?
> If not, please explain why ?

In msix mode there's no need to explicitly disable completion
interrupts, they are reliably auto-masked, according to my observations.
I tested this on two QLE8142 adapters.

Do you have reason to believe this might not always be the case?

> 
> > 
> > -	var = ql_disable_completion_interrupt(qdev, intr_context->intr);
> > +	var = ql_read32(qdev, STS);
> > 
> >  	/*
> >  	 * Check for fatal error.
> >  	 */
> >  	if (var & STS_FE) {
> > +		ql_disable_completion_interrupt(qdev, 0);
> 
> Why need to do it again here ? if before this it can disable completion interrupt for INT-X case and MSI-X zeroth vector case ?

I couldn't test this code path, so I preserved the original behavior.

> 
> >  		ql_queue_asic_error(qdev);
> >  		netdev_err(qdev->ndev, "Got fatal error, STS = %x.\n", var);
> >  		var = ql_read32(qdev, ERR_STS);
> > @@ -2534,7 +2486,6 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
> >  		 */
> >  		netif_err(qdev, intr, qdev->ndev,
> >  			  "Got MPI processor interrupt.\n");
> > -		ql_disable_completion_interrupt(qdev, intr_context->intr);
> 
> Why disable interrupt is not required here ?

The interrupt source _is_ masked:
		ql_write32(qdev, INTR_MASK, (INTR_MASK_PI << 16));

> While in case of Fatal error case above ql_disable_completion_interrupt() is being called ?
> Also, in case of MSI-X zeroth vector it will not disable completion interrupt but at the end, it will end of qlge_isr() enabling completion interrupt.
> Seems like disabling and enabling might not be in sync in case of MSI-X zeroth vector.

I guess you forgot to consider that completion interrupts are
auto-masked in msix mode.

Manish Chopra June 26, 2019, 1:21 p.m. UTC | #4

> In msix mode there's no need to explicitly disable completion interrupts, they
> are reliably auto-masked, according to my observations.
> I tested this on two QLE8142 adapters.
> 
> Do you have reason to believe this might not always be the case?

How did you check auto-masking of MSI-X interrupts ?
I was just wondering about the below comment in ql_disable_completion_interrupt(), where for MSI-X it does disable completion intr for zeroth intr.
Seems special case for zeroth intr in MSI-X particular to this device.

        /* HW disables for us if we're MSIX multi interrupts and
         * it's not the default (zeroeth) interrupt.
         */
        if (likely(test_bit(QL_MSIX_ENABLED, &qdev->flags) && intr))
                return 0;

Benjamin Poirier July 5, 2019, 8:33 a.m. UTC | #5

On 2019/06/26 13:21, Manish Chopra wrote:
> > In msix mode there's no need to explicitly disable completion interrupts, they
> > are reliably auto-masked, according to my observations.
> > I tested this on two QLE8142 adapters.
> > 
> > Do you have reason to believe this might not always be the case?
> 
> How did you check auto-masking of MSI-X interrupts ?
> I was just wondering about the below comment in ql_disable_completion_interrupt(), where for MSI-X it does disable completion intr for zeroth intr.
> Seems special case for zeroth intr in MSI-X particular to this device.
> 
>         /* HW disables for us if we're MSIX multi interrupts and
>          * it's not the default (zeroeth) interrupt.
>          */
>         if (likely(test_bit(QL_MSIX_ENABLED, &qdev->flags) && intr))
>                 return 0;
> 

I checked again and arrived at the same conclusion: in msix mode,
completion interrupts are masked automatically and the adapter does not
raise interrupts until they are enabled at the end of napi polling. That
includes queue 0.

I checked by adding some tracepoints and sending traffic using pktgen.
All udp traffic goes to queue 0 with qlge. Over a 100s interval I got
2970339 q0 interrupts. In all cases, INTR_EN_EN was unset for q0.
Moreover, there were no interrupts that were raised while we were sure
that interrupts were expected to be disabled. I also tested with icmp
and multiple streams of tcp traffic and got similar results.

The driver patch for tracing as well as the analysis script are at the
bottom of this mail. I use them like so:
root@dtest:~# trace-cmd record -C global -b 1000000 -s 1000000 -e qlge:compirq_* -f "intr == 0" -e qlge:q0_intr sleep 100
[...]
root@dtest:~# trace-cmd report -l | ./report.awk | awk '{print $1}' | sort | uniq -c

It took me a few days to reply because while doing that testing I
actually found another problem. It is present before this patch set. In
INTx mode, ql_disable_completion_interrupt() does not immediately
prevent the adapter from raising interrupts. Redoing a similar test as
the previous one while forcing INTx mode via qlge_irq_type=2, I get
something like this:
4966280 0x00004300
   6565 0x0000c300
 137749 def_bad
   7094 ISR1_0

First, we can see what I already wrote in this patch:
+	/* Experience shows that when using INTx interrupts, the device does
+	 * not always auto-mask the interrupt.
(The 0x0000c300 values include INTR_EN_EN)
Second, we can see 137749 instances of interrupts while we were
expecting interrupt generation to be disabled.

If I disable interrupts using INTR_EN_EI instead, I get something like
this:
4672919 0x00004300
     75 0x0000c300
      2 ISR1_0

I'll be including a patch for this in the next iteration of this
patchset.

==== report.awk ====
#!/usr/bin/awk -f

BEGIN {
	enabled = -1;
}

/compirq_enable_b/ {
	enabled = 1;
	next;
}

/compirq_enable_a/ {
	enabled = 2;
	next;
}

/q0_intr/ {
	# INTR_EN
	print $10;

	if ($14 == "0x00000000") {
		print "ISR1_0";
	}

	if (enabled == 0) {
		printf "def_bad "
		print $3;
	} else if (enabled == 1) {
		printf "maybe_bad "
		print $3;
	}
	# at this point we expect the irq to be masked, either automatically
	# or explicitely
	enabled = 0;
	next;
}

==== driver patch ====

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 9a99e0938f08..ab306963eef1 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -43,6 +43,9 @@
 
 #include "qlge.h"
 
+#define CREATE_TRACE_POINTS
+#include "qlge_trace.h"
+
 char qlge_driver_name[] = DRV_NAME;
 const char qlge_driver_version[] = DRV_VERSION;
 
@@ -641,16 +644,20 @@ u32 ql_enable_completion_interrupt(struct ql_adapter *qdev, u32 intr)
 		/* Always enable if we're MSIX multi interrupts and
 		 * it's not the default (zeroeth) interrupt.
 		 */
+		trace_compirq_enable_b(qdev, intr);
 		ql_write32(qdev, INTR_EN,
 			   ctx->intr_en_mask);
+		trace_compirq_enable_a(qdev, intr);
 		var = ql_read32(qdev, STS);
 		return var;
 	}
 
 	spin_lock_irqsave(&qdev->hw_lock, hw_flags);
 	if (atomic_dec_and_test(&ctx->irq_cnt)) {
+		trace_compirq_enable_b(qdev, intr);
 		ql_write32(qdev, INTR_EN,
 			   ctx->intr_en_mask);
+		trace_compirq_enable_a(qdev, intr);
 		var = ql_read32(qdev, STS);
 	}
 	spin_unlock_irqrestore(&qdev->hw_lock, hw_flags);
@@ -671,8 +678,10 @@ static u32 ql_disable_completion_interrupt(struct ql_adapter *qdev, u32 intr)
 	ctx = qdev->intr_context + intr;
 	spin_lock(&qdev->hw_lock);
 	if (!atomic_read(&ctx->irq_cnt)) {
+		trace_compirq_disable_b(qdev, intr);
 		ql_write32(qdev, INTR_EN,
 		ctx->intr_dis_mask);
+		trace_compirq_disable_a(qdev, intr);
 		var = ql_read32(qdev, STS);
 	}
 	atomic_inc(&ctx->irq_cnt);
@@ -2484,6 +2493,7 @@ static irqreturn_t qlge_msix_rx_isr(int irq, void *dev_id)
 {
 	struct rx_ring *rx_ring = dev_id;
 	napi_schedule(&rx_ring->napi);
+	trace_napi_schedule(&rx_ring->napi);
 	return IRQ_HANDLED;
 }
 
@@ -2500,6 +2510,8 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
 	u32 var;
 	int work_done = 0;
 
+	trace_q0_intr(qdev, ql_read32(qdev, STS), ql_read32(qdev, ISR1));
+
 	spin_lock(&qdev->hw_lock);
 	if (atomic_read(&qdev->intr_context[0].irq_cnt)) {
 		netif_printk(qdev, intr, KERN_DEBUG, qdev->ndev,
@@ -2552,6 +2564,7 @@ static irqreturn_t qlge_isr(int irq, void *dev_id)
 			   "Waking handler for rx_ring[0].\n");
 		ql_disable_completion_interrupt(qdev, intr_context->intr);
 		napi_schedule(&rx_ring->napi);
+		trace_napi_schedule(&rx_ring->napi);
 		work_done++;
 	}
 	ql_enable_completion_interrupt(qdev, intr_context->intr);
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_trace.h b/drivers/net/ethernet/qlogic/qlge/qlge_trace.h
new file mode 100644
index 000000000000..f199c6eb785c
--- /dev/null
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_trace.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#if !defined(_QLGE_TRACE_H_) || defined(TRACE_HEADER_MULTI_READ)
+#define _QLGE_TRACE_H_
+
+#include <linux/tracepoint.h>
+
+#include "qlge.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qlge
+#define TRACE_INCLUDE_FILE qlge_trace
+
+#define NO_DEV "(no_device)"
+
+TRACE_EVENT(napi_schedule,
+	    TP_PROTO(struct napi_struct *napi),
+	    TP_ARGS(napi),
+
+	    TP_STRUCT__entry(
+			     __field(struct napi_struct *, napi)
+			     __string(dev_name,
+				      napi->dev ? napi->dev->name : NO_DEV)
+			     __field(unsigned long, napi_state)
+			     ),
+
+	    TP_fast_assign(
+			   __entry->napi = napi;
+			   __assign_str(dev_name,
+					napi->dev ? napi->dev->name : NO_DEV);
+			   __entry->napi_state = READ_ONCE(napi->state);
+			   ),
+
+	    TP_printk("napi schedule on napi struct %p for device %s napi 0x%02lx\n",
+		      __entry->napi, __get_str(dev_name), __entry->napi_state)
+);
+
+DECLARE_EVENT_CLASS(compirq_template,
+		    TP_PROTO(struct ql_adapter *qdev, u32 intr),
+		    TP_ARGS(qdev, intr),
+
+		    TP_STRUCT__entry(
+				     __string(dev_name, qdev->ndev->name)
+				     __field(unsigned int, intr)
+				     ),
+
+		    TP_fast_assign(
+				   __assign_str(dev_name, qdev->ndev->name);
+				   __entry->intr = intr;
+				   ),
+
+		    TP_printk("completion irq toggle device %s intr %d\n",
+			      __get_str(dev_name), __entry->intr)
+);
+
+DEFINE_EVENT(compirq_template, compirq_enable_b,
+	     TP_PROTO(struct ql_adapter *qdev, u32 intr),
+	     TP_ARGS(qdev, intr));
+DEFINE_EVENT(compirq_template, compirq_enable_a,
+	     TP_PROTO(struct ql_adapter *qdev, u32 intr),
+	     TP_ARGS(qdev, intr));
+
+DEFINE_EVENT(compirq_template, compirq_disable_b,
+	     TP_PROTO(struct ql_adapter *qdev, u32 intr),
+	     TP_ARGS(qdev, intr));
+DEFINE_EVENT(compirq_template, compirq_disable_a,
+	     TP_PROTO(struct ql_adapter *qdev, u32 intr),
+	     TP_ARGS(qdev, intr));
+
+TRACE_EVENT(q0_intr,
+	    TP_PROTO(struct ql_adapter *qdev, u32 sts, u32 isr1),
+	    TP_ARGS(qdev, sts, isr1),
+
+	    TP_STRUCT__entry(
+		    __string(dev_name, qdev->ndev->name)
+		    __field(unsigned int, intr_en)
+		    __field(unsigned int, sts)
+		    __field(unsigned int, isr1)
+		    ),
+
+	    TP_fast_assign(
+		    __assign_str(dev_name, qdev->ndev->name);
+		    ql_write32(qdev, INTR_EN, qdev->intr_context[0].intr_read_mask);
+		    __entry->intr_en = ql_read32(qdev, INTR_EN);
+		    __entry->sts = sts;
+		    __entry->isr1 = isr1;
+		    ),
+
+	    TP_printk("interrupt for dev %s INTR_EN 0x%08x STS 0x%08x ISR1 0x%08x\n",
+		      __get_str(dev_name), __entry->intr_en, __entry->sts,
+		      __entry->isr1)
+);
+
+#endif /* _QLGE_TRACE_H_ */
+
+/* This part must be outside protection */
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/net/ethernet/qlogic/qlge
+#include <trace/define_trace.h>

Benjamin Poirier July 15, 2019, 1:40 a.m. UTC | #6

On 2019/06/17 16:48, Benjamin Poirier wrote:
> qlge uses an irq enable/disable refcounting scheme that is:
> * poorly implemented
> 	Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> * buggy
> 	Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> 	when using SO_BUSY_POLL.
> * unnecessary
> 	The purpose or irq_cnt is to reduce irq control writes when
> 	multiple work items result from one irq: the irq is re-enabled
> 	after all work is done.
> 	Analysis of the irq handler shows that there is only one case where
> 	there might be two workers scheduled at once, and those have
> 	separate irq masking bits.
> 
> Therefore, remove irq_cnt.
> 
> Additionally, we get a performance improvement:
> perf stat -e cycles -a -r5 super_netperf 100 -H 192.168.33.1 -t TCP_RR
> 
> Before:
> 628560
> 628056
> 622103
> 622744
> 627202
> [...]
>    268,803,947,669      cycles                 ( +-  0.09% )
> 
> After:
> 636300
> 634106
> 634984
> 638555
> 634188
> [...]
>    259,237,291,449      cycles                 ( +-  0.19% )
> 
> Signed-off-by: Benjamin Poirier <bpoirier@suse.com>

David, Greg,

Before I send v2 of this patchset, how about moving this driver out to
staging? The hardware has been declared EOL by the vendor but what's
more relevant to the Linux kernel is that the quality of this driver is
not on par with many other mainline drivers, IMO. Over in staging, the
code might benefit from the attention of interested parties (drive-by
fixers) or fade away into obscurity.

Now that I check, the code has >500 basic checkpatch issues. While
working on the rx processing code for the current patchset, I noticed
the following more structural issues:
* commit 7c734359d350 ("qlge: Size RX buffers based on MTU.",
  v2.6.33-rc1) introduced dead code in the receive routines, which
  should be rewritten anyways by the admission of the author himself
  (see the comment above ql_build_rx_skb())
* truesize accounting is incorrect (ex: a 9000B frame has truesize 10280
  while containing two frags of order-1 allocations, ie. >16K)
* in some cases the driver allocates skbs with head room but only puts
  data in the frags
* there is an inordinate amount of disparate debugging code, most of
  which is of questionable value. In particular, qlge_dbg.c has hundreds
  of lines of code bitrotting away in ifdef land (doesn't compile since
  commit 18c49b91777c ("qlge: do vlan cleanup", v3.1-rc1), 8 years ago).
* triggering an ethtool regdump will instead hexdump a 176k struct to
  dmesg depending on some module parameters
* many structures are defined full of holes, as we found in this
  thread already; are used inefficiently (struct rx_ring is used for rx
  and tx completions, with some members relevant to one case only); are
  initialized redundantly (ex. memset 0 after alloc_etherdev). The
  driver also has a habit of using runtime checks where compile time
  checks are possible.
* in terms of namespace, the driver uses either qlge_, ql_ (used by
  other qlogic drivers, with clashes, ex: ql_sem_spinlock) or nothing (with
  clashes, ex: struct ob_mac_iocb_req)

I'm guessing other parts of the driver have more issues of the same
nature. The driver also has many smaller issues like deprecated apis,
duplicate or useless comments, weird code structure (some "while" loops
that could be rewritten with simple "for", ex. ql_wait_reg_rdy()) and
weird line wrapping (all over, ex. the ql_set_routing_reg() calls in
qlge_set_multicast_list()).

I'm aware of some precedents for code moving from mainline to staging:
1bb8155080c6 ncpfs: move net/ncpfs to drivers/staging/ncpfs (v4.16-rc1)
6c391ff758eb irda: move drivers/net/irda to drivers/staging/irda/drivers
(v4.14-rc1)

What do you think is the best course of action in this case?

Thanks,
-Benjamin

Greg Kroah-Hartman July 15, 2019, 9:48 a.m. UTC | #7

On Mon, Jul 15, 2019 at 10:40:16AM +0900, Benjamin Poirier wrote:
> On 2019/06/17 16:48, Benjamin Poirier wrote:
> > qlge uses an irq enable/disable refcounting scheme that is:
> > * poorly implemented
> > 	Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> > * buggy
> > 	Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> > 	when using SO_BUSY_POLL.
> > * unnecessary
> > 	The purpose or irq_cnt is to reduce irq control writes when
> > 	multiple work items result from one irq: the irq is re-enabled
> > 	after all work is done.
> > 	Analysis of the irq handler shows that there is only one case where
> > 	there might be two workers scheduled at once, and those have
> > 	separate irq masking bits.
> > 
> > Therefore, remove irq_cnt.
> > 
> > Additionally, we get a performance improvement:
> > perf stat -e cycles -a -r5 super_netperf 100 -H 192.168.33.1 -t TCP_RR
> > 
> > Before:
> > 628560
> > 628056
> > 622103
> > 622744
> > 627202
> > [...]
> >    268,803,947,669      cycles                 ( +-  0.09% )
> > 
> > After:
> > 636300
> > 634106
> > 634984
> > 638555
> > 634188
> > [...]
> >    259,237,291,449      cycles                 ( +-  0.19% )
> > 
> > Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
> 
> David, Greg,
> 
> Before I send v2 of this patchset, how about moving this driver out to
> staging? The hardware has been declared EOL by the vendor but what's
> more relevant to the Linux kernel is that the quality of this driver is
> not on par with many other mainline drivers, IMO. Over in staging, the
> code might benefit from the attention of interested parties (drive-by
> fixers) or fade away into obscurity.
> 
> Now that I check, the code has >500 basic checkpatch issues. While
> working on the rx processing code for the current patchset, I noticed
> the following more structural issues:
> * commit 7c734359d350 ("qlge: Size RX buffers based on MTU.",
>   v2.6.33-rc1) introduced dead code in the receive routines, which
>   should be rewritten anyways by the admission of the author himself
>   (see the comment above ql_build_rx_skb())
> * truesize accounting is incorrect (ex: a 9000B frame has truesize 10280
>   while containing two frags of order-1 allocations, ie. >16K)
> * in some cases the driver allocates skbs with head room but only puts
>   data in the frags
> * there is an inordinate amount of disparate debugging code, most of
>   which is of questionable value. In particular, qlge_dbg.c has hundreds
>   of lines of code bitrotting away in ifdef land (doesn't compile since
>   commit 18c49b91777c ("qlge: do vlan cleanup", v3.1-rc1), 8 years ago).
> * triggering an ethtool regdump will instead hexdump a 176k struct to
>   dmesg depending on some module parameters
> * many structures are defined full of holes, as we found in this
>   thread already; are used inefficiently (struct rx_ring is used for rx
>   and tx completions, with some members relevant to one case only); are
>   initialized redundantly (ex. memset 0 after alloc_etherdev). The
>   driver also has a habit of using runtime checks where compile time
>   checks are possible.
> * in terms of namespace, the driver uses either qlge_, ql_ (used by
>   other qlogic drivers, with clashes, ex: ql_sem_spinlock) or nothing (with
>   clashes, ex: struct ob_mac_iocb_req)
> 
> I'm guessing other parts of the driver have more issues of the same
> nature. The driver also has many smaller issues like deprecated apis,
> duplicate or useless comments, weird code structure (some "while" loops
> that could be rewritten with simple "for", ex. ql_wait_reg_rdy()) and
> weird line wrapping (all over, ex. the ql_set_routing_reg() calls in
> qlge_set_multicast_list()).
> 
> I'm aware of some precedents for code moving from mainline to staging:
> 1bb8155080c6 ncpfs: move net/ncpfs to drivers/staging/ncpfs (v4.16-rc1)
> 6c391ff758eb irda: move drivers/net/irda to drivers/staging/irda/drivers
> (v4.14-rc1)
> 
> What do you think is the best course of action in this case?

Feel free to move it to staging if you want it removed from the tree in
a few releases if no one is willing to do the work to keep it alive.  If
someone comes along later, it's a trivial revert to add it back.

So send a patch moving it, with all of the information you listed above
in a TODO file for the directory, and I'll be glad to take it.

thanks,

greg k-h

[net-next,01/16] qlge: Remove irq_cnt

Commit Message

Comments

Patch