diff mbox

[04/16] mm: allow PF_MEMALLOC from softirq context

Message ID 1340375443-22455-5-git-send-email-mgorman@suse.de
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Mel Gorman June 22, 2012, 2:30 p.m. UTC
This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. This patch borrows the task flags from whatever process happens
to be preempted by the softirq. It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process. This
should be safe due to the following reasons

Softirqs can run on multiple CPUs sure but the same task should not be
	executing the same softirq code. Neither should the softirq
	handler be preempted by any other softirq handler so the flags
	should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
	preempted by hardware interrupts so PF_MEMALLOC is inherited
	by the hard IRQ. However, this is similar to a process in
	reclaim being preempted by a hardirq. While PF_MEMALLOC is
	set, gfp_to_alloc_flags() distinguishes between hard and
	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
	flag.

If the softirq is deferred to ksoftirq then its flags may be used
        instead of a normal tasks but as the softirq cannot be preempted,
        the PF_MEMALLOC flag does not leak to other code by accident.

[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |    7 +++++++
 kernel/softirq.c      |    9 +++++++++
 mm/page_alloc.c       |    6 +++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

Comments

Sebastian Andrzej Siewior June 26, 2012, 4:55 p.m. UTC | #1
On Fri, Jun 22, 2012 at 03:30:31PM +0100, Mel Gorman wrote:
> This is needed to allow network softirq packet processing to make
> use of PF_MEMALLOC.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b6c0727..5c6d9c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2265,7 +2265,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
>  		if (gfp_mask & __GFP_MEMALLOC)
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
> +		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
> +		else if (!in_interrupt() &&
> +				((current->flags & PF_MEMALLOC) ||
> +				 unlikely(test_thread_flag(TIF_MEMDIE))))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  	}

You allocate in RX path with __GFP_MEMALLOC and your sk->sk_allocation has
also __GFP_MEMALLOC set. That means you should get ALLOC_NO_WATERMARKS in
alloc_flags. Is this to done to avoid GFP annotations in skb_share_check() and
friends on your __netif_receive_skb() path?

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman June 27, 2012, 8:26 a.m. UTC | #2
On Tue, Jun 26, 2012 at 06:55:13PM +0200, Sebastian Andrzej Siewior wrote:
> On Fri, Jun 22, 2012 at 03:30:31PM +0100, Mel Gorman wrote:
> > This is needed to allow network softirq packet processing to make
> > use of PF_MEMALLOC.
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b6c0727..5c6d9c6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2265,7 +2265,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> >  		if (gfp_mask & __GFP_MEMALLOC)
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > -		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
> > +		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		else if (!in_interrupt() &&
> > +				((current->flags & PF_MEMALLOC) ||
> > +				 unlikely(test_thread_flag(TIF_MEMDIE))))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  	}
> 
> You allocate in RX path with __GFP_MEMALLOC and your sk->sk_allocation has
> also __GFP_MEMALLOC set. That means you should get ALLOC_NO_WATERMARKS in
> alloc_flags.

In the cases where they are annotated correctly, yes. It is recordeed if
the page gets allocated from the PFMEMALLOC reserves. If the received
packet is not SOCK_MEMALLOC and the page was allocated from PFMEMALLOC
reserves it is then discarded and the packet must be retransmitted.

> Is this to done to avoid GFP annotations in skb_share_check() and
> friends on your __netif_receive_skb() path?
> 

I don't get your question as the annotations are not being avoided. If they
are set, they are used. In the __netif_receive_skb path, PF_MEMALLOC is
set for PFMEMALLOC skbs to avoid having to annotate every single allocation
call site.
Sebastian Andrzej Siewior July 8, 2012, 6:12 p.m. UTC | #3
On Wed, Jun 27, 2012 at 09:26:14AM +0100, Mel Gorman wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index b6c0727..5c6d9c6 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2265,7 +2265,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > >  		if (gfp_mask & __GFP_MEMALLOC)
> > >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > > -		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
> > > +		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> > > +		else if (!in_interrupt() &&
> > > +				((current->flags & PF_MEMALLOC) ||
> > > +				 unlikely(test_thread_flag(TIF_MEMDIE))))
> > >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > >  	}
> > 
> > You allocate in RX path with __GFP_MEMALLOC and your sk->sk_allocation has
> > also __GFP_MEMALLOC set. That means you should get ALLOC_NO_WATERMARKS in
> > alloc_flags.
> 
> In the cases where they are annotated correctly, yes. It is recordeed if
> the page gets allocated from the PFMEMALLOC reserves. If the received
> packet is not SOCK_MEMALLOC and the page was allocated from PFMEMALLOC
> reserves it is then discarded and the packet must be retransmitted.

Let me try again:
- lets assume your allocation happens with alloc_page(), without
  __GFP_MEMALLOC in GFP_FLAGS and with PF_MEMALLOC in current->flags. Now
  you may get memory which you wouldn't receive otherwise (without
  PF_MEMALLOC). Okay, understood. So you don't have to annotate each page
  allocation in your receive path for instance as long as the process has the
  flag set.
- lets assume your allocation happens with kmalloc() without __GFP_MEMALLOC
  and current->flags has PF_MEMALLOC ORed and your SLAB pool is empty. This
  forces SLAB to allocate more pages from the buddy allocator with it will
  receive more likely (due to ->current->flags + PF_MEMALLOC) but SLAB will
  drop this extra memory because the page has ->pf_memory (or something like
  that) set and the GFP_FLAGS do not have __GFP_MEMALLOC set.

Is there something I missed?

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman July 9, 2012, 10:04 a.m. UTC | #4
On Sun, Jul 08, 2012 at 08:12:11PM +0200, Sebastian Andrzej Siewior wrote:
> On Wed, Jun 27, 2012 at 09:26:14AM +0100, Mel Gorman wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index b6c0727..5c6d9c6 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2265,7 +2265,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > >  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > > >  		if (gfp_mask & __GFP_MEMALLOC)
> > > >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > -		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
> > > > +		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > > > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > +		else if (!in_interrupt() &&
> > > > +				((current->flags & PF_MEMALLOC) ||
> > > > +				 unlikely(test_thread_flag(TIF_MEMDIE))))
> > > >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > > >  	}
> > > 
> > > You allocate in RX path with __GFP_MEMALLOC and your sk->sk_allocation has
> > > also __GFP_MEMALLOC set. That means you should get ALLOC_NO_WATERMARKS in
> > > alloc_flags.
> > 
> > In the cases where they are annotated correctly, yes. It is recordeed if
> > the page gets allocated from the PFMEMALLOC reserves. If the received
> > packet is not SOCK_MEMALLOC and the page was allocated from PFMEMALLOC
> > reserves it is then discarded and the packet must be retransmitted.
> 
> Let me try again:
> - lets assume your allocation happens with alloc_page(), without
>   __GFP_MEMALLOC in GFP_FLAGS and with PF_MEMALLOC in current->flags. Now
>   you may get memory which you wouldn't receive otherwise (without
>   PF_MEMALLOC). Okay, understood. So you don't have to annotate each page
>   allocation in your receive path for instance as long as the process has the
>   flag set.

Yes.

> - lets assume your allocation happens with kmalloc() without __GFP_MEMALLOC
>   and current->flags has PF_MEMALLOC ORed and your SLAB pool is empty. This
>   forces SLAB to allocate more pages from the buddy allocator with it will
>   receive more likely (due to ->current->flags + PF_MEMALLOC) but SLAB will
>   drop this extra memory because the page has ->pf_memory (or something like
>   that) set and the GFP_FLAGS do not have __GFP_MEMALLOC set.
> 

It's recorded if the slab page was allocated from PFMEMALLOC reserves (see
patch 2 from the swap over NBD series). slab will use this page for objects
but only allocate them to callers that pass a gfp_pfmemalloc_allowed() check.
kmalloc() users with either __GFP_MEMALLOC or PF_MEMALLOC will get
the pages they need but they will not "leak" to !_GFP_MEMALLOC users as
that would potentially deadlock.
Sebastian Andrzej Siewior July 9, 2012, 4:57 p.m. UTC | #5
On Mon, Jul 09, 2012 at 11:04:42AM +0100, Mel Gorman wrote:
> > - lets assume your allocation happens with kmalloc() without __GFP_MEMALLOC
> >   and current->flags has PF_MEMALLOC ORed and your SLAB pool is empty. This
> >   forces SLAB to allocate more pages from the buddy allocator with it will
> >   receive more likely (due to ->current->flags + PF_MEMALLOC) but SLAB will
> >   drop this extra memory because the page has ->pf_memory (or something like
> >   that) set and the GFP_FLAGS do not have __GFP_MEMALLOC set.
> > 
> 
> It's recorded if the slab page was allocated from PFMEMALLOC reserves (see
> patch 2 from the swap over NBD series). slab will use this page for objects
> but only allocate them to callers that pass a gfp_pfmemalloc_allowed() check.
> kmalloc() users with either __GFP_MEMALLOC or PF_MEMALLOC will get
> the pages they need but they will not "leak" to !_GFP_MEMALLOC users as
> that would potentially deadlock.

Argh, I missed that gfp_to_alloc_flags() is not only called from
within the buddy allocater but also from slab. So this is fine then :)

One thing:
You only get current->flags |= PF_MEMALLOC in softirq _if_ the skb, which is 
passed to netif_receive_skb(), was allocated with __GFP_MEMALLOC. That
means if the NIC's RX allocation did not require an allocation from the
emergency pool (without ->pfmemalloc set) then you never use this extra
pool, even if this skb would end up in your swap socket. Also, the other way
around, where you allocate it from the emergency pool but it is a user
socket and you could drop it.

What about extending sk_set_memalloc() to record socket's ips + ports
in a separate list so that skb_pfmemalloc_protocol() might use that
information and decide on per-protocol basis if the skb is worth to
spend more ressource to deliver it. That means you would enable the
extra pool if the currently received skb is part of your swap socket and
not if the skb was allocated from the emergency pool.

That said, there is nothing wrong with the code as of now and this
optimization could be added later (if at all).

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman July 10, 2012, 11:09 a.m. UTC | #6
On Mon, Jul 09, 2012 at 06:57:10PM +0200, Sebastian Andrzej Siewior wrote:
> On Mon, Jul 09, 2012 at 11:04:42AM +0100, Mel Gorman wrote:
> > > - lets assume your allocation happens with kmalloc() without __GFP_MEMALLOC
> > >   and current->flags has PF_MEMALLOC ORed and your SLAB pool is empty. This
> > >   forces SLAB to allocate more pages from the buddy allocator with it will
> > >   receive more likely (due to ->current->flags + PF_MEMALLOC) but SLAB will
> > >   drop this extra memory because the page has ->pf_memory (or something like
> > >   that) set and the GFP_FLAGS do not have __GFP_MEMALLOC set.
> > > 
> > 
> > It's recorded if the slab page was allocated from PFMEMALLOC reserves (see
> > patch 2 from the swap over NBD series). slab will use this page for objects
> > but only allocate them to callers that pass a gfp_pfmemalloc_allowed() check.
> > kmalloc() users with either __GFP_MEMALLOC or PF_MEMALLOC will get
> > the pages they need but they will not "leak" to !_GFP_MEMALLOC users as
> > that would potentially deadlock.
> 
> Argh, I missed that gfp_to_alloc_flags() is not only called from
> within the buddy allocater but also from slab. So this is fine then :)
> 

Good to hear. I appreciate you taking the time to give it a solid review
like this looking for holes.

> One thing:
> You only get current->flags |= PF_MEMALLOC in softirq _if_ the skb, which is 
> passed to netif_receive_skb(), was allocated with __GFP_MEMALLOC. That
> means if the NIC's RX allocation did not require an allocation from the
> emergency pool (without ->pfmemalloc set) then you never use this extra
> pool, even if this skb would end up in your swap socket. Also, the other way
> around, where you allocate it from the emergency pool but it is a user
> socket and you could drop it.
> 

While there is a possibility that packets may get dropped later like this,
they still get retransmitted and eventually it'll get through.  This is
not optimal but optimised swap-over-network was not the primary goal of
the series, deadlock avoidance was.

> What about extending sk_set_memalloc() to record socket's ips + ports
> in a separate list so that skb_pfmemalloc_protocol() might use that
> information and decide on per-protocol basis if the skb is worth to
> spend more ressource to deliver it. That means you would enable the
> extra pool if the currently received skb is part of your swap socket and
> not if the skb was allocated from the emergency pool.
> 
> That said, there is nothing wrong with the code as of now and this
> optimization could be added later (if at all).
> 

I think it is a good idea but it could also be done later iff a user had
a serious problem with the performance and that this made a measurable
difference. The series is already quite complex and I'd rather not add to
that complexity without strong motivation.
diff mbox

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 08384db..706e405 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1913,6 +1913,13 @@  static inline void rcu_switch_from(struct task_struct *prev)
 
 #endif
 
+static inline void tsk_restore_flags(struct task_struct *task,
+				unsigned long orig_flags, unsigned long flags)
+{
+	task->flags &= ~flags;
+	task->flags |= orig_flags & flags;
+}
+
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p,
 			       const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 671f959..b73e681 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,14 @@  asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long old_flags = current->flags;
+
+	/*
+	 * Mask out PF_MEMALLOC s current task context is borrowed for the
+	 * softirq. A softirq handled such as network RX might set PF_MEMALLOC
+	 * again if the socket is related to swap
+	 */
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -265,6 +273,7 @@  restart:
 
 	account_system_vtime(current);
 	__local_bh_enable(SOFTIRQ_OFFSET);
+	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6c0727..5c6d9c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2265,7 +2265,11 @@  gfp_to_alloc_flags(gfp_t gfp_mask)
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
 		if (gfp_mask & __GFP_MEMALLOC)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				((current->flags & PF_MEMALLOC) ||
+				 unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}