[bisected,regression] e1000e: "Detected Hardware Unit Hang"

Message ID	5229621.KczjbIR22Q@storm
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Thomas Jarosch <thomas.jarosch@intra2net.com> To: Eric Dumazet <eric.dumazet@gmail.com> Cc: 'Linux Netdev List' <netdev@vger.kernel.org>, Eric Dumazet <edumazet@google.com>, Jeff Kirsher <jeffrey.t.kirsher@intel.com>, e1000-devel <e1000-devel@lists.sourceforge.net> Subject: Re: [bisected regression] e1000e: "Detected Hardware Unit Hang" Date: Thu, 15 Jan 2015 11:11:09 +0100 Message-ID: <5229621.KczjbIR22Q@storm> Organization: Intra2net AG User-Agent: KMail/4.14.3 (Linux/3.17.8-200.fc20.x86_64; KDE/4.14.3; x86_64; ; ) In-Reply-To: <1421256052.11734.22.camel@edumazet-glaptop2.roam.corp.google.com> References: <1719052.SGOfRAJhfQ@storm> <1421256052.11734.22.camel@edumazet-glaptop2.roam.corp.google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Sender: netdev-owner@vger.kernel.org Precedence: bulk

Thomas Jarosch Jan. 15, 2015, 10:11 a.m. UTC

On Wednesday, 14. January 2015 09:20:52 Eric Dumazet wrote:
> I would try to use lower data per txd. I am not sure 24KB is really
> supported.
> 
> ( check commit d821a4c4d11ad160925dab2bb009b8444beff484 for details)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c index
> e14fd85f64eb..8d973f7edfbd 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -3897,7 +3897,7 @@ void e1000e_reset(struct e1000_adapter *adapter)
>  	 * limit of 24KB due to receive synchronization limitations.
>  	 */
>  	adapter->tx_fifo_limit = min_t(u32, ((er32(PBA) >> 16) << 10) - 96,
> -				       24 << 10);
> +				       8 << 10);
> 
>  	/* Disable Adaptive Interrupt Moderation if 2 full packets cannot
>  	 * fit in receive buffer.

Thanks for checking!

I just tried that change on top of git f800c25 (git HEAD), same problem. 
Let's see what the Intel wizards come up with.

What "works" is to decrease the page size in git HEAD, too:




When I try a page size of 8192, it starts failing again. I'll now run
a stress test with 4096 to see if the problem is really gone
or just happens more rarely.

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Jan. 15, 2015, 2:43 p.m. UTC | #1

On Thu, 2015-01-15 at 11:11 +0100, Thomas Jarosch wrote:
> On Wednesday, 14. January 2015 09:20:52 Eric Dumazet wrote:
> > I would try to use lower data per txd. I am not sure 24KB is really
> > supported.
> > 
> > ( check commit d821a4c4d11ad160925dab2bb009b8444beff484 for details)
> > 
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> > b/drivers/net/ethernet/intel/e1000e/netdev.c index
> > e14fd85f64eb..8d973f7edfbd 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -3897,7 +3897,7 @@ void e1000e_reset(struct e1000_adapter *adapter)
> >  	 * limit of 24KB due to receive synchronization limitations.
> >  	 */
> >  	adapter->tx_fifo_limit = min_t(u32, ((er32(PBA) >> 16) << 10) - 96,
> > -				       24 << 10);
> > +				       8 << 10);
> > 
> >  	/* Disable Adaptive Interrupt Moderation if 2 full packets cannot
> >  	 * fit in receive buffer.
> 
> Thanks for checking!
> 
> I just tried that change on top of git f800c25 (git HEAD), same problem. 
> Let's see what the Intel wizards come up with.
> 
> What "works" is to decrease the page size in git HEAD, too:
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 85ab7d7..9f0ef97 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2108,7 +2108,7 @@ static inline void __skb_queue_purge(struct 
> sk_buff_head *list)
>                 kfree_skb(skb);
>  }
>  
> -#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
> +#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(4096)
>  #define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)
>  #define NETDEV_PAGECNT_MAX_BIAS           NETDEV_FRAG_PAGE_MAX_SIZE
> 
> 
> 
> When I try a page size of 8192, it starts failing again. I'll now run
> a stress test with 4096 to see if the problem is really gone
> or just happens more rarely.

Sure, you basically reverted my patch.

You are not the first to report a problem caused by this patch.

This patch is known to have uncovered some driver bugs.

We are not going to revert it. We are going to fix the real bugs.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Jarosch Jan. 15, 2015, 2:58 p.m. UTC | #2

On Thursday, 15. January 2015 06:43:29 Eric Dumazet wrote:
> > -#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
> > +#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(4096)
> > 
> >  #define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE <<
> >  NETDEV_FRAG_PAGE_MAX_ORDER) #define NETDEV_PAGECNT_MAX_BIAS          
> >  NETDEV_FRAG_PAGE_MAX_SIZE> 
> > When I try a page size of 8192, it starts failing again. I'll now run
> > a stress test with 4096 to see if the problem is really gone
> > or just happens more rarely.
> 
> Sure, you basically reverted my patch.
> 
> You are not the first to report a problem caused by this patch.
> 
> This patch is known to have uncovered some driver bugs.
> 
> We are not going to revert it. We are going to fix the real bugs.
> 
> Thanks

A colleague mentioned to me he saw the "Hardware Unit Hang" message every 
few days even running on kernel 3.4 (without your patch). Basically I'm 
testing now if that's still the case with 3.19-rc4+ or not.

I'm all for fixing the root cause. I'm just interested if the e1000e
hang can even be triggered when using a max frag page size of 4096.
So far it transferred 751.6 GiB without a hiccup.

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kirsher, Jeffrey T Jan. 15, 2015, 2:59 p.m. UTC | #3

On Thu, 2015-01-15 at 06:43 -0800, Eric Dumazet wrote:
> On Thu, 2015-01-15 at 11:11 +0100, Thomas Jarosch wrote:
> > On Wednesday, 14. January 2015 09:20:52 Eric Dumazet wrote:
> > > I would try to use lower data per txd. I am not sure 24KB is really
> > > supported.
> > > 
> > > ( check commit d821a4c4d11ad160925dab2bb009b8444beff484 for details)
> > > 
> > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > b/drivers/net/ethernet/intel/e1000e/netdev.c index
> > > e14fd85f64eb..8d973f7edfbd 100644
> > > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > @@ -3897,7 +3897,7 @@ void e1000e_reset(struct e1000_adapter *adapter)
> > >  	 * limit of 24KB due to receive synchronization limitations.
> > >  	 */
> > >  	adapter->tx_fifo_limit = min_t(u32, ((er32(PBA) >> 16) << 10) - 96,
> > > -				       24 << 10);
> > > +				       8 << 10);
> > > 
> > >  	/* Disable Adaptive Interrupt Moderation if 2 full packets cannot
> > >  	 * fit in receive buffer.
> > 
> > Thanks for checking!
> > 
> > I just tried that change on top of git f800c25 (git HEAD), same problem. 
> > Let's see what the Intel wizards come up with.
> > 
> > What "works" is to decrease the page size in git HEAD, too:
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 85ab7d7..9f0ef97 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -2108,7 +2108,7 @@ static inline void __skb_queue_purge(struct 
> > sk_buff_head *list)
> >                 kfree_skb(skb);
> >  }
> >  
> > -#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
> > +#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(4096)
> >  #define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)
> >  #define NETDEV_PAGECNT_MAX_BIAS           NETDEV_FRAG_PAGE_MAX_SIZE
> > 
> > 
> > 
> > When I try a page size of 8192, it starts failing again. I'll now run
> > a stress test with 4096 to see if the problem is really gone
> > or just happens more rarely.
> 
> Sure, you basically reverted my patch.
> 
> You are not the first to report a problem caused by this patch.
> 
> This patch is known to have uncovered some driver bugs.
> 
> We are not going to revert it. We are going to fix the real bugs.
> 
> Thanks
> 
> 

Agreed, we are looking into issue Thomas.

Eric Dumazet Jan. 15, 2015, 3:25 p.m. UTC | #4

On Thu, 2015-01-15 at 15:58 +0100, Thomas Jarosch wrote:

> A colleague mentioned to me he saw the "Hardware Unit Hang" message every 
> few days even running on kernel 3.4 (without your patch). Basically I'm 
> testing now if that's still the case with 3.19-rc4+ or not.
> 
> I'm all for fixing the root cause. I'm just interested if the e1000e
> hang can even be triggered when using a max frag page size of 4096.
> So far it transferred 751.6 GiB without a hiccup.

You told it was forwarding setup.

1) What is the NIC receiving traffic.
2) What happens if you disable GRO on it ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Jarosch Jan. 15, 2015, 3:48 p.m. UTC | #5

On Thursday, 15. January 2015 07:25:32 Eric Dumazet wrote:
> On Thu, 2015-01-15 at 15:58 +0100, Thomas Jarosch wrote:
> > A colleague mentioned to me he saw the "Hardware Unit Hang" message
> > every
> > few days even running on kernel 3.4 (without your patch). Basically I'm
> > testing now if that's still the case with 3.19-rc4+ or not.
> > 
> > I'm all for fixing the root cause. I'm just interested if the e1000e
> > hang can even be triggered when using a max frag page size of 4096.
> > So far it transferred 751.6 GiB without a hiccup.
> 
> You told it was forwarding setup.
> 
> 1) What is the NIC receiving traffic.
> 2) What happens if you disable GRO on it ?

The setup is like this:

Win7 notebook (client)
    -> "private LAN" eth0 (e1000e)
        -> "external traffic" eth1 (r8169)

            -> local HTTP server in the intranet
               (2x e1000e using bonding)


Disabling gro on eth1 (r8169) seems to make eth0 (e1000e) stable.
As it usually hangs within seconds, it already transferred 28 GiB right now.

When I switch gro back on, it takes around three seconds until the hang.

Does that point into the right / any direction?

Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Jarosch Jan. 19, 2015, 4:49 p.m. UTC | #6

On Thursday, 15. January 2015 07:25:32 Eric Dumazet wrote:
> On Thu, 2015-01-15 at 15:58 +0100, Thomas Jarosch wrote:
> > A colleague mentioned to me he saw the "Hardware Unit Hang" message
> > every
> > few days even running on kernel 3.4 (without your patch). Basically I'm
> > testing now if that's still the case with 3.19-rc4+ or not.
> > 
> > I'm all for fixing the root cause. I'm just interested if the e1000e
> > hang can even be triggered when using a max frag page size of 4096.
> > So far it transferred 751.6 GiB without a hiccup.
> 
> You told it was forwarding setup.
> 
> 1) What is the NIC receiving traffic.
> 2) What happens if you disable GRO on it ?

one more interesting thing happened: On one production machine,
again an Intel DH61CR board, the issue was triggered even with TSO disabled.
My colleague tried to disable GRO + GSO on the e1000e adapter, too,
though not on the other interfaces.

It's strange the issue appears with TSO disabled,
that worked for three other production level machines.

We've emergency-installed the "4096" max frag page size workaround
for now as fifty people were a bit unhappy without network access... :D

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Jarosch Feb. 11, 2015, 11:23 a.m. UTC | #7

Hi Jeff,

On Thursday, 15. January 2015 06:59:13 Jeff Kirsher wrote:
> > Sure, you basically reverted my patch.
> > 
> > You are not the first to report a problem caused by this patch.
> > 
> > This patch is known to have uncovered some driver bugs.
> > 
> > We are not going to revert it. We are going to fix the real bugs.
> > 
> > Thanks
> 
> Agreed, we are looking into issue Thomas.

any news from the Intel labs what might be going on here?

We started seeing those hangs on "MSI B85M ECO" boards, too,
though it's way more sporadic there.

Thanks,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kirsher, Jeffrey T Feb. 11, 2015, 11:34 a.m. UTC | #8

On Wed, 2015-02-11 at 12:23 +0100, Thomas Jarosch wrote:
> Hi Jeff,
> 
> On Thursday, 15. January 2015 06:59:13 Jeff Kirsher wrote:
> > > Sure, you basically reverted my patch.
> > > 
> > > You are not the first to report a problem caused by this patch.
> > > 
> > > This patch is known to have uncovered some driver bugs.
> > > 
> > > We are not going to revert it. We are going to fix the real bugs.
> > > 
> > > Thanks
> > 
> > Agreed, we are looking into issue Thomas.
> 
> any news from the Intel labs what might be going on here?
> 
> We started seeing those hangs on "MSI B85M ECO" boards, too,
> though it's way more sporadic there.

I have not heard anything, so I have added Aaron Brown to see if he has
any additional information.

Thomas Jarosch Feb. 13, 2015, 4:14 p.m. UTC | #9

Hi Aaron,

On Thursday, 12. February 2015 23:28:27 Brown, Aaron F wrote:
> I do not have any real info.  I had been asked to try and reproduce some
> unit hangs (maybe for this) recently and did not succeed in producing
> them on the parts I have.  Reading through the thread I see this is
> showing up in a NAT environment.  The port that is getting the unit hang
> in the NAT system?

yes, the e1000e NIC is serving the NATed Windows client.

The setup was outlined here:

    http://marc.info/?l=linux-netdev&m=142133691713824&w=2

> I will make some attempts at replicating this with the port in a NAT and
> or forwarding role.  Has a bug been opened for this?  Or has information
> for this specific unit hang been entered into one of the other unit hang
> bugs opened against e1000e?

I didn't do anything(tm). This report sounds like the same issue:

    http://ehc.ac/p/e1000/bugs/378/

Oliver Wagner wrote the problem started to appear
after updating from kernel 3.5 to 3.8.0.35 (new frag size code).

I just noticed now he wrote he has two identical boxes:

---------------------------------------------------
- Box with symptoms: Router/Firewall, packet forwarding
  between different VLANs on eth0 and eth1
- Box without symptoms: Fileserver, eth0/eth1 bonded
  (VLANs used, but no forwarding)
---------------------------------------------------

So it looks like it's related to forwarding somehow,
I've made the same experience IIRC.

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brown, Aaron F Feb. 21, 2015, 1:59 a.m. UTC | #10

> -----Original Message-----

> From: Thomas Jarosch [mailto:thomas.jarosch@intra2net.com]

> Sent: Friday, February 13, 2015 8:15 AM

> To: Brown, Aaron F

> Cc: Kirsher, Jeffrey T; 'Linux Netdev List'; Eric Dumazet; e1000-devel

> Subject: Re: [bisected regression] e1000e: "Detected Hardware Unit Hang"

> 

> Hi Aaron,

> 

> On Thursday, 12. February 2015 23:28:27 Brown, Aaron F wrote:

> > I do not have any real info.  I had been asked to try and reproduce some

> > unit hangs (maybe for this) recently and did not succeed in producing

> > them on the parts I have.  Reading through the thread I see this is

> > showing up in a NAT environment.  The port that is getting the unit hang

> > in the NAT system?

> 

> yes, the e1000e NIC is serving the NATed Windows client.

> 

> The setup was outlined here:

> 

>     http://marc.info/?l=linux-netdev&m=142133691713824&w=2

> 

> > I will make some attempts at replicating this with the port in a NAT and

> > or forwarding role.  Has a bug been opened for this?  Or has information

> > for this specific unit hang been entered into one of the other unit hang

> > bugs opened against e1000e?

> 

> I didn't do anything(tm). This report sounds like the same issue:

> 

>     http://ehc.ac/p/e1000/bugs/378/

> 

> Oliver Wagner wrote the problem started to appear

> after updating from kernel 3.5 to 3.8.0.35 (new frag size code).

> 

> I just noticed now he wrote he has two identical boxes:

> 

> ---------------------------------------------------

> - Box with symptoms: Router/Firewall, packet forwarding

>   between different VLANs on eth0 and eth1

> - Box without symptoms: Fileserver, eth0/eth1 bonded

>   (VLANs used, but no forwarding)

> ---------------------------------------------------

> 

> So it looks like it's related to forwarding somehow,

> I've made the same experience IIRC.


Thanks, that (and the multiple bug write-ups on sourceforge) gave me more than enough to go on.  I was able to replicate it on a handful of systems in my lab.  On effected systems setting up a NAT and stressing the interfaces with even moderate traffic levels triggers it pretty quickly.  It appears that the NAT part is unnecessary, just setting the systems up as a software router and running some traffic across it also triggers it giving the same apparent behavior (tx hang, watchdog timeout trace, port reset.)

And with an internal reproduction of the issue I have created an internal bug report, described my set of reproductions, referenced the similar external ones and assigned it to our current e1000e developer.

Thanks again,
Aaron

> 

> Cheers,

> Thomas

Thomas Jarosch March 23, 2015, 1:58 p.m. UTC | #11

Hi Aaron,

On Saturday, 21. February 2015 01:59:35 Brown, Aaron F wrote:
> Thanks, that (and the multiple bug write-ups on sourceforge) gave me more
> than enough to go on.  I was able to replicate it on a handful of systems
> in my lab.  On effected systems setting up a NAT and stressing the
> interfaces with even moderate traffic levels triggers it pretty quickly. 
> It appears that the NAT part is unnecessary, just setting the systems up
> as a software router and running some traffic across it also triggers it
> giving the same apparent behavior (tx hang, watchdog timeout trace, port
> reset.)
> 
> And with an internal reproduction of the issue I have created an internal
> bug report, described my set of reproductions, referenced the similar
> external ones and assigned it to our current e1000e developer.

just wanted to quickly check if there has been any progress
since the internal bug report has been filed?

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Jarosch May 27, 2015, 4 p.m. UTC | #12

Hi Aaron,

On Monday, 23. March 2015 22:37:08 Brown, Aaron F wrote:
> > >
> > > And with an internal reproduction of the issue I have created an
> > 
> > internal
> > 
> > > bug report, described my set of reproductions, referenced the similar
> > > external ones and assigned it to our current e1000e developer.
> > 
> > 
> > just wanted to quickly check if there has been any progress
> > since the internal bug report has been filed?
> 
> 
> No, no updates beyond a bit of investigation.

any news on this from the Intel labs?

Another two months passed ;) It would be nice to get rid
of the workaround that limits the max fragment size to 4096.

Thanks,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brown, Aaron F May 30, 2015, 1:18 a.m. UTC | #13

> From: Thomas Jarosch [mailto:thomas.jarosch@intra2net.com]
> Sent: Wednesday, May 27, 2015 9:01 AM
> To: Brown, Aaron F
> Cc: Kirsher, Jeffrey T; 'Linux Netdev List'; Eric Dumazet; e1000-devel
> Subject: Re: RE: [bisected regression] e1000e: "Detected Hardware Unit
> Hang"
> 
> Hi Aaron,
> 
> On Monday, 23. March 2015 22:37:08 Brown, Aaron F wrote:
> > > >
> > > > And with an internal reproduction of the issue I have created an
> > >
> > > internal
> > >
> > > > bug report, described my set of reproductions, referenced the
> similar
> > > > external ones and assigned it to our current e1000e developer.
> > >
> > >
> > > just wanted to quickly check if there has been any progress
> > > since the internal bug report has been filed?
> >
> >
> > No, no updates beyond a bit of investigation.
> 
> any news on this from the Intel labs?

Nothing significant.  Another one of our testers (whom works more closely with the current e1000e driver owner than I) has managed to replicate it on several systems and I know the developer spent some time poking around the setup, but I don't think he's found the root cause yet and has been busy chasing a number of other issues.

> 
> Another two months passed ;) It would be nice to get rid
> of the workaround that limits the max fragment size to 4096.
> 
> Thanks,
> Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Jarosch July 29, 2015, 8:51 a.m. UTC | #14

Hi Jeff and Yanir,

On Saturday, 30. May 2015 01:18:44 Brown, Aaron F wrote:
> > any news on this from the Intel labs?
> 
> Nothing significant.  Another one of our testers (whom works more closely
> with the current e1000e driver owner than I) has managed to replicate it
> on several systems and I know the developer spent some time poking around
> the setup, but I don't think he's found the root cause yet and has been
> busy chasing a number of other issues.

so, any news from the Intel labs? I've seen some "hang fixes"
on 03.06.2015, but I'm not sure if they are related to this issue.

This problem is pretty annoying: We have a performance penalty for all 
network cards right now as the buffer size of the core network stack
had to be decreased to 4096 bytes again on our side.
(https://www.marc.info/?l=linux-netdev&m=142131668206333)
Better than no e1000e network connectivity though.

The initial report on this issue was on 14.01.2015:
https://www.marc.info/?l=linux-netdev&m=142124954120315

Best regards,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Juliana Rodrigueiro May 2, 2019, 12:58 p.m. UTC | #15

Hi All.

While updating to kernel 4.19, we realised that a problem reported in 2015 for 
kernel 3.7 is still around. Please see this link for more details: https://
marc.info/?l=linux-netdev&m=142124954120315

Basically, when using the e1000e driver, each few minutes the following 
messages appear in dmesg or system log.

[12465.174759] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
  TDH                  <c6>
  TDT                  <fb>
  next_to_use          <fb>
  next_to_clean        <c4>
buffer_info[next_to_clean]:
  time_stamp           <2e5e92>
  next_to_watch        <c6>
  jiffies              <2e67e8>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <3000>
PCI Status             <10>

Back in 2015, we applied a workaround that decreases the page size:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85ab7d7..9f0ef97 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2108,7 +2108,7 @@ static inline void __skb_queue_purge(struct 
sk_buff_head *list)
                kfree_skb(skb);
 }
 
-#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
+#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(4096)
 #define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)
 #define NETDEV_PAGECNT_MAX_BIAS           NETDEV_FRAG_PAGE_MAX_SIZE
 

Testing kernel 4.19 with the same hardware showed the same problems, so we 
tried to adapt the old workaround to the current code:

diff -u -r -p linux-4.19.i686/net/core/sock.c linux-4.19.i686.e1000e/net/core/
sock.c
--- linux-4.19.i686/net/core/sock.c     2019-03-22 13:55:24.198266383 +0100
+++ linux-4.19.i686.e1000e/net/core/sock.c      2019-03-22 13:56:43.165765856 
+0100
@@ -2183,7 +2183,8 @@ static void sk_leave_memory_pressure(str
 }
 
 /* On 32bit arches, an skb frag is limited to 2^15 */
-#define SKB_FRAG_PAGE_ORDER    get_order(32768)
+/* Limit to 4096 instead of 32768 */
+#define SKB_FRAG_PAGE_ORDER    get_order(4096)
 
 /**
  * skb_page_frag_refill - check that a page_frag contains enough room


Unfortunately, this patch does not help with the "Unit Hang" messages anymore, 
the problem occurs with any page size.


Some insight in how to deal with this problem would be very much appreciated.

Thank you!

[bisected,regression] e1000e: "Detected Hardware Unit Hang"

Commit Message

Comments

Patch