Message ID | 1439456364-4530-1-git-send-email-mhocko@kernel.org |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
On 08/13/2015 10:58 AM, mhocko@kernel.org wrote: > From: Michal Hocko <mhocko@suse.com> > > The patch c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") > added the checks for page->pfmemalloc to __skb_fill_page_desc(): > > if (page->pfmemalloc && !page->mapping) > skb->pfmemalloc = true; > > It assumes page->mapping == NULL implies that page->pfmemalloc can be > trusted. However, __delete_from_page_cache() can set set page->mapping > to NULL and leave page->index value alone. Due to being in union, a > non-zero page->index will be interpreted as true page->pfmemalloc. > > So the assumption is invalid if the networking code can see such a > page. And it seems it can. We have encountered this with a NFS over > loopback setup when such a page is attached to a new skbuf. There is no > copying going on in this case so the page confuses __skb_fill_page_desc > which interprets the index as pfmemalloc flag and the network stack > drops packets that have been allocated using the reserves unless they > are to be queued on sockets handling the swapping which is the case here ^ not ? The full story (according to Jiri Bohac and my understanding, I don't know much about netdev) is that the __skb_fill_page_desc() is invoked here during *sending* and normally the skb->pfmemalloc would be ignored in the end. But because it is a localhost connection, the receiving code will think it was a memalloc allocation during receive, and then do the socket restriction. Given that this apparently isn't the first case of this localhost issue, I wonder if network code should just clear skb->pfmemalloc during send (or maybe just send over localhost). That would be probably easier than distinguish the __skb_fill_page_desc() callers for send vs receive. > and that leads to hangs when the nfs client waits for a response from > the server which has been dropped and thus never arrive. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 13-08-15 11:13:04, Vlastimil Babka wrote: > On 08/13/2015 10:58 AM, mhocko@kernel.org wrote: > >From: Michal Hocko <mhocko@suse.com> > > > >The patch c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") > >added the checks for page->pfmemalloc to __skb_fill_page_desc(): > > > > if (page->pfmemalloc && !page->mapping) > > skb->pfmemalloc = true; > > > >It assumes page->mapping == NULL implies that page->pfmemalloc can be > >trusted. However, __delete_from_page_cache() can set set page->mapping > >to NULL and leave page->index value alone. Due to being in union, a > >non-zero page->index will be interpreted as true page->pfmemalloc. > > > >So the assumption is invalid if the networking code can see such a > >page. And it seems it can. We have encountered this with a NFS over > >loopback setup when such a page is attached to a new skbuf. There is no > >copying going on in this case so the page confuses __skb_fill_page_desc > >which interprets the index as pfmemalloc flag and the network stack > >drops packets that have been allocated using the reserves unless they > >are to be queued on sockets handling the swapping which is the case here > > ^ not ? Dohh, you are right of course, updated... > The full story (according to Jiri Bohac and my understanding, I don't know > much about netdev) is that the __skb_fill_page_desc() is invoked here during > *sending* and normally the skb->pfmemalloc would be ignored in the end. But > because it is a localhost connection, the receiving code will think it was a > memalloc allocation during receive, and then do the socket restriction. > > Given that this apparently isn't the first case of this localhost issue, I > wonder if network code should just clear skb->pfmemalloc during send (or > maybe just send over localhost). That would be probably easier than > distinguish the __skb_fill_page_desc() callers for send vs receive. Maybe the networking code can behave "better" in this particular case but the core thing remains though. Relying on page->mapping as you have properly found out during the debugging cannot be used for the reliable detection of pfmemalloc. So I would argue that a more robust detection is really worthwhile. Note there are other places which even do not bother to test for mapping - maybe they are safe but I got lost quickly when trying to track the allocation source to be clear that nothing could have stepped in in the meantime.
On Thu, Aug 13, 2015 at 10:58:54AM +0200, mhocko@kernel.org wrote: > From: Michal Hocko <mhocko@suse.com> > > The patch c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") > added the checks for page->pfmemalloc to __skb_fill_page_desc(): > > if (page->pfmemalloc && !page->mapping) > skb->pfmemalloc = true; > > It assumes page->mapping == NULL implies that page->pfmemalloc can be > trusted. However, __delete_from_page_cache() can set set page->mapping > to NULL and leave page->index value alone. Due to being in union, a > non-zero page->index will be interpreted as true page->pfmemalloc. > > So the assumption is invalid if the networking code can see such a > page. And it seems it can. We have encountered this with a NFS over > loopback setup when such a page is attached to a new skbuf. There is no > copying going on in this case so the page confuses __skb_fill_page_desc > which interprets the index as pfmemalloc flag and the network stack > drops packets that have been allocated using the reserves unless they > are to be queued on sockets handling the swapping which is the case here > and that leads to hangs when the nfs client waits for a response from > the server which has been dropped and thus never arrive. > > The struct page is already heavily packed so rather than finding > another hole to put it in, let's do a trick instead. We can reuse the > index again but define it to an impossible value (-1UL). This is the > page index so it should never see the value that large. Replace all > direct users of page->pfmemalloc by page_is_pfmemalloc which will > hide this nastiness from unspoiled eyes. > > The information will get lost if somebody wants to use page->index > obviously but that was the case before and the original code expected > that the information should be persisted somewhere else if that is > really needed (e.g. what SLAB and SLUB do). > > Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") > Cc: stable # 3.6+ > Debugged-by: Vlastimil Babka <vbabka@suse.com> > Debugged-by: Jiri Bohac <jbohac@suse.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de>
On Thu, 2015-08-13 at 11:13 +0200, Vlastimil Babka wrote: > Given that this apparently isn't the first case of this localhost issue, > I wonder if network code should just clear skb->pfmemalloc during send > (or maybe just send over localhost). That would be probably easier than > distinguish the __skb_fill_page_desc() callers for send vs receive. Would this still needed after this patch ? It is sad we do not have a SNMP counter to at least count how often we drop skb because pfmemalloc is set. I'll provide such a patch. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/13/2015 04:40 PM, Eric Dumazet wrote: > On Thu, 2015-08-13 at 11:13 +0200, Vlastimil Babka wrote: > >> Given that this apparently isn't the first case of this localhost issue, >> I wonder if network code should just clear skb->pfmemalloc during send >> (or maybe just send over localhost). That would be probably easier than >> distinguish the __skb_fill_page_desc() callers for send vs receive. > > Would this still needed after this patch ? Not until another corner case is discovered :) Or something passes a genuine pfmemalloc page to a socket (sending contents of some slab objects perhaps, where the slab page was allocated as pfmemalloc? Dunno if that can happen right now). > It is sad we do not have a SNMP counter to at least count how often we > drop skb because pfmemalloc is set. > > I'll provide such a patch. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index 982fdcdc795b..b5b2925103ec 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -216,7 +216,7 @@ static void fm10k_reuse_rx_page(struct fm10k_ring *rx_ring, static inline bool fm10k_page_is_reserved(struct page *page) { - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); } static bool fm10k_can_reuse_rx_page(struct fm10k_rx_buffer *rx_buffer, diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 2f70a9b152bd..830466c49987 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -6566,7 +6566,7 @@ static void igb_reuse_rx_page(struct igb_ring *rx_ring, static inline bool igb_page_is_reserved(struct page *page) { - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); } static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer, diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 9aa6104e34ea..ae21e0b06c3a 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1832,7 +1832,7 @@ static void ixgbe_reuse_rx_page(struct ixgbe_ring *rx_ring, static inline bool ixgbe_page_is_reserved(struct page *page) { - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); } /** diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index e71cdde9cb01..1d7b00b038a2 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -765,7 +765,7 @@ static void ixgbevf_reuse_rx_page(struct ixgbevf_ring *rx_ring, static inline bool ixgbevf_page_is_reserved(struct page *page) { - return (page_to_nid(page) != numa_mem_id()) || page->pfmemalloc; + return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); } /** diff --git a/include/linux/mm.h b/include/linux/mm.h index 2e872f92dbac..bf6f117fcf4d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1003,6 +1003,34 @@ static inline int page_mapped(struct page *page) } /* + * Return true only if the page has been allocated with + * ALLOC_NO_WATERMARKS and the low watermark was not + * met implying that the system is under some pressure. + */ +static inline bool page_is_pfmemalloc(struct page *page) +{ + /* + * Page index cannot be this large so this must be + * a pfmemalloc page. + */ + return page->index == -1UL; +} + +/* + * Only to be called by the page allocator on a freshly allocated + * page. + */ +static inline void set_page_pfmemalloc(struct page *page) +{ + page->index = -1UL; +} + +static inline void clear_page_pfmemalloc(struct page *page) +{ + page->index = 0; +} + +/* * Different kinds of faults, as returned by handle_mm_fault(). * Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0038ac7466fd..15549578d559 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -63,15 +63,6 @@ struct page { union { pgoff_t index; /* Our offset within mapping. */ void *freelist; /* sl[aou]b first free object */ - bool pfmemalloc; /* If set by the page allocator, - * ALLOC_NO_WATERMARKS was set - * and the low watermark was not - * met implying that the system - * is under some pressure. The - * caller should try ensure - * this page is only used to - * free other pages. - */ }; union { diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d6cdd6e87d53..6dcee34a476f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1602,20 +1602,16 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; /* - * Propagate page->pfmemalloc to the skb if we can. The problem is - * that not all callers have unique ownership of the page. If - * pfmemalloc is set, we check the mapping as a mapping implies - * page->index is set (index and pfmemalloc share space). - * If it's a valid mapping, we cannot use page->pfmemalloc but we - * do not lose pfmemalloc information as the pages would not be - * allocated using __GFP_MEMALLOC. + * Propagate page pfmemalloc to the skb if we can. The problem is + * that not all callers have unique ownership of the page but rely + * on page_is_pfmemalloc doing the right thing(tm). */ frag->page.p = page; frag->page_offset = off; skb_frag_size_set(frag, size); page = compound_head(page); - if (page->pfmemalloc && !page->mapping) + if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; } @@ -2263,7 +2259,7 @@ static inline struct page *dev_alloc_page(void) static inline void skb_propagate_pfmemalloc(struct page *page, struct sk_buff *skb) { - if (page && page->pfmemalloc) + if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index beda41710802..a0e65dc89784 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1343,12 +1343,15 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, set_page_owner(page, order, gfp_flags); /* - * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was necessary to + * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to * allocate the page. The expectation is that the caller is taking * steps that will free more memory. The caller should avoid the page * being used for !PFMEMALLOC purposes. */ - page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS); + if (alloc_flags & ALLOC_NO_WATERMARKS) + set_page_pfmemalloc(page); + else + clear_page_pfmemalloc(page); return 0; } @@ -3345,7 +3348,7 @@ void *__alloc_page_frag(struct page_frag_cache *nc, atomic_add(size - 1, &page->_count); /* reset page count bias and offset to start of new frag */ - nc->pfmemalloc = page->pfmemalloc; + nc->pfmemalloc = page_is_pfmemalloc(page); nc->pagecnt_bias = size; nc->offset = size; } diff --git a/mm/slab.c b/mm/slab.c index 200e22412a16..bbd0b47dc6a9 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1603,7 +1603,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, } /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ - if (unlikely(page->pfmemalloc)) + if (page_is_pfmemalloc(page)) pfmemalloc_active = true; nr_pages = (1 << cachep->gfporder); @@ -1614,7 +1614,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, add_zone_page_state(page_zone(page), NR_SLAB_UNRECLAIMABLE, nr_pages); __SetPageSlab(page); - if (page->pfmemalloc) + if (page_is_pfmemalloc(page)) SetPageSlabPfmemalloc(page); if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { diff --git a/mm/slub.c b/mm/slub.c index 816df0016555..0eacabff35ab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1427,7 +1427,7 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) inc_slabs_node(s, page_to_nid(page), page->objects); page->slab_cache = s; __SetPageSlab(page); - if (page->pfmemalloc) + if (page_is_pfmemalloc) SetPageSlabPfmemalloc(page); start = page_address(page); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b6a19ca0f99e..6bb3d82c9bbe 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -340,7 +340,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size) if (skb && frag_size) { skb->head_frag = 1; - if (virt_to_head_page(data)->pfmemalloc) + if (page_is_pfmemalloc(virt_to_head_page(data))) skb->pfmemalloc = 1; } return skb;