Message ID | b0b17697-f23e-8fa5-3757-604a86f3a095@nvidia.com |
---|---|
State | Not Applicable |
Headers | show |
Series | [BUG] net: stmmac: Panic observed in stmmac_napi_poll_rx() | expand |
On Fri, May 14, 2021 at 03:24:58PM +0100, Jon Hunter wrote: > Hello! > > I have been looking into some random crashes that appear to stem from > the stmmac_napi_poll_rx() function. There are two different panics I > have observed which are ... [...] > The bug being triggered in skbuff.h is the following ... > > void *skb_pull(struct sk_buff *skb, unsigned int len); > static inline void *__skb_pull(struct sk_buff *skb, unsigned int len) > { > skb->len -= len; > BUG_ON(skb->len < skb->data_len); > return skb->data += len; > } > > Looking into the above panic triggered in skbuff.h, when this occurs > I have noticed that the value of skb->data_len is unusually large ... > > __skb_pull: len 1500 (14), data_len 4294967274 [...] The big value looks suspiciously similar to (unsigned)-EINVAL. > I then added some traces to stmmac_napi_poll_rx() and > stmmac_rx_buf2_len() to trace the values of various various variables > and when the problem occurs I see ... > > stmmac_napi_poll_rx: stmmac_rx: count 0, len 1518, buf1 66, buf2 1452 > stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 66, plen 1518 > stmmac_napi_poll_rx: stmmac_rx: count 1, len 1518, buf1 66, buf2 1452 > stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 66, plen 1536 > stmmac_napi_poll_rx: stmmac_rx: count 2, len 1602, buf1 66, buf2 1536 > stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 1602, plen 1518 > stmmac_napi_poll_rx: stmmac_rx: count 2, len 1518, buf1 0, buf2 4294967212 > stmmac_napi_poll_rx: stmmac_rx: dma_buf_sz 1536, buf1 0, buf2 4294967212 And this one to (unsigned)-EILSEQ. > Note I added the above BUG_ON to trap unusually large buffers. Let > me know if you have any thoughts. Do above ring any bell? Best Regards Michał Mirosław
On 14/05/2021 22:49, Michał Mirosław wrote: > On Fri, May 14, 2021 at 03:24:58PM +0100, Jon Hunter wrote: >> Hello! >> >> I have been looking into some random crashes that appear to stem from >> the stmmac_napi_poll_rx() function. There are two different panics I >> have observed which are ... > [...] >> The bug being triggered in skbuff.h is the following ... >> >> void *skb_pull(struct sk_buff *skb, unsigned int len); >> static inline void *__skb_pull(struct sk_buff *skb, unsigned int len) >> { >> skb->len -= len; >> BUG_ON(skb->len < skb->data_len); >> return skb->data += len; >> } >> >> Looking into the above panic triggered in skbuff.h, when this occurs >> I have noticed that the value of skb->data_len is unusually large ... >> >> __skb_pull: len 1500 (14), data_len 4294967274 > [...] > > The big value looks suspiciously similar to (unsigned)-EINVAL. Yes it does and at first, I thought it was being set to -EINVAL. However, from tracing the length variables I can see that this is not the case. >> I then added some traces to stmmac_napi_poll_rx() and >> stmmac_rx_buf2_len() to trace the values of various various variables >> and when the problem occurs I see ... >> >> stmmac_napi_poll_rx: stmmac_rx: count 0, len 1518, buf1 66, buf2 1452 >> stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 66, plen 1518 >> stmmac_napi_poll_rx: stmmac_rx: count 1, len 1518, buf1 66, buf2 1452 >> stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 66, plen 1536 >> stmmac_napi_poll_rx: stmmac_rx: count 2, len 1602, buf1 66, buf2 1536 >> stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 1602, plen 1518 >> stmmac_napi_poll_rx: stmmac_rx: count 2, len 1518, buf1 0, buf2 4294967212 >> stmmac_napi_poll_rx: stmmac_rx: dma_buf_sz 1536, buf1 0, buf2 4294967212 > > And this one to (unsigned)-EILSEQ. Yes but this simply comes from 1518-1602 = -84. So it is purely coincidence. Jon
On 5/17/21 1:39 PM, Jon Hunter wrote: > > On 14/05/2021 22:49, Michał Mirosław wrote: >> On Fri, May 14, 2021 at 03:24:58PM +0100, Jon Hunter wrote: >>> Hello! >>> >>> I have been looking into some random crashes that appear to stem from >>> the stmmac_napi_poll_rx() function. There are two different panics I >>> have observed which are ... >> [...] >>> The bug being triggered in skbuff.h is the following ... >>> >>> void *skb_pull(struct sk_buff *skb, unsigned int len); >>> static inline void *__skb_pull(struct sk_buff *skb, unsigned int len) >>> { >>> skb->len -= len; >>> BUG_ON(skb->len < skb->data_len); >>> return skb->data += len; >>> } >>> >>> Looking into the above panic triggered in skbuff.h, when this occurs >>> I have noticed that the value of skb->data_len is unusually large ... >>> >>> __skb_pull: len 1500 (14), data_len 4294967274 >> [...] >> >> The big value looks suspiciously similar to (unsigned)-EINVAL. > > Yes it does and at first, I thought it was being set to -EINVAL. > However, from tracing the length variables I can see that this is not > the case. > >>> I then added some traces to stmmac_napi_poll_rx() and >>> stmmac_rx_buf2_len() to trace the values of various various variables >>> and when the problem occurs I see ... >>> >>> stmmac_napi_poll_rx: stmmac_rx: count 0, len 1518, buf1 66, buf2 1452 >>> stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 66, plen 1518 >>> stmmac_napi_poll_rx: stmmac_rx: count 1, len 1518, buf1 66, buf2 1452 >>> stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 66, plen 1536 >>> stmmac_napi_poll_rx: stmmac_rx: count 2, len 1602, buf1 66, buf2 1536 >>> stmmac_napi_poll_rx: stmmac_rx_buf2_len: len 1602, plen 1518 >>> stmmac_napi_poll_rx: stmmac_rx: count 2, len 1518, buf1 0, buf2 4294967212 >>> stmmac_napi_poll_rx: stmmac_rx: dma_buf_sz 1536, buf1 0, buf2 4294967212 >> >> And this one to (unsigned)-EILSEQ. > > Yes but this simply comes from 1518-1602 = -84. So it is purely > coincidence. > > Jon > I dug around this a little bit. It looks like the issue occurs when we get (pardon my terminology, I haven't dealt with networking stuff much) a split packet. What happens is we first process the first frame, growing 'len'. buf1_len, I think, hits the "First descriptor, get split header length" case and the length is 66. buf2_len hits the rx_not_ls case and the length is 1536. In total 1602. Then the condition 'likely(status & rx_not_ls)' passes and we goto back to 'read_again', and read the next frame. Here we eventually get to buf2_len again. stmmac_get_rx_frame_len returns 1518 for this frame which sounds reasonable, that's what we normally get for non-split frames. So what we get is 1518 - 1602 which overflows. I can dig around a bit more but it would be nice if someone with a bit more knowledge of the hardware could comment on the above. Thanks, Mikko
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 4749bd0af160..26336de80fe8 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -3728,7 +3728,7 @@ static unsigned int stmmac_rx_buf2_len(struct stmmac_priv *priv, /* Not last descriptor */ if (status & rx_not_ls) - return priv->dma_buf_sz; + return priv->dma_buf_sz - len; plen = stmmac_get_rx_frame_len(priv, p, coe); @@ -3848,7 +3848,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) buf1_len = stmmac_rx_buf1_len(priv, p, status, len); len += buf1_len; - buf2_len = stmmac_rx_buf2_len(priv, p, status, len); + buf2_len = stmmac_rx_buf2_len(priv, p, status, buf1_len); len += buf2_len; The patch I added to trace the various variable is as follows ... diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 4749bd0af160..975059bb3165 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -3723,15 +3723,20 @@ static unsigned int stmmac_rx_buf2_len(struct stmmac_priv *priv, unsigned int plen = 0; /* Not split header, buffer is not available */ - if (!priv->sph) + if (!priv->sph) { + trace_printk("%s: len %u, plen %u\n", __func__, len, priv->dma_buf_sz); return 0; + } /* Not last descriptor */ - if (status & rx_not_ls) + if (status & rx_not_ls) { + trace_printk("%s: len %u, plen %u\n", __func__, len, priv->dma_buf_sz); return priv->dma_buf_sz; + } plen = stmmac_get_rx_frame_len(priv, p, coe); + trace_printk("%s: len %u, plen %u\n", __func__, len, plen); /* Last descriptor */ return plen - len; } @@ -3850,7 +3855,13 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) len += buf1_len; buf2_len = stmmac_rx_buf2_len(priv, p, status, len); len += buf2_len; - + trace_printk("%s: count %u, len %u, buf1 %u, buf2 %u\n", __func__, + count, len, buf1_len, buf2_len); + if (buf1_len > priv->dma_buf_sz || buf2_len > priv->dma_buf_sz) { + trace_printk("%s: dma_buf_sz %u, buf1 %u, buf2 %u\n", + __func__, priv->dma_buf_sz, buf1_len, buf2_len); + BUG_ON(1); + } /* ACS is set; GMAC core strips PAD/FCS for IEEE 802.3 * Type frames (LLC/LLC-SNAP) *