diff mbox

[Bugme-new,Bug,33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb

Message ID alpine.DEB.2.00.1105101242420.2875@router.home
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Christoph Lameter (Ampere) May 10, 2011, 5:43 p.m. UTC
Draft for a patch


Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath

Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Probe that address before dereferencing the pointer to the page.
All of that needs to occur with interrupts disabled since an interrupt
could cause the page status to change (as pointed out by Eric).

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet May 10, 2011, 6:05 p.m. UTC | #1
Le mardi 10 mai 2011 à 12:43 -0500, Christoph Lameter a écrit :
> Draft for a patch
> 
> 
> Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath
> 
> Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
> marked as invalid to retrieve the pointer to the next free object.
> 
> Probe that address before dereferencing the pointer to the page.
> All of that needs to occur with interrupts disabled since an interrupt
> could cause the page status to change (as pointed out by Eric).
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> ---
>  mm/slub.c |   23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2011-05-10 12:35:30.000000000 -0500
> +++ linux-2.6/mm/slub.c	2011-05-10 12:38:53.000000000 -0500
> @@ -261,6 +261,27 @@ static inline void *get_freepointer(stru
>  	return *(void **)(object + s->offset);
>  }
> 
> +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> +{
> +	void *p;
> +
> +#ifdef CONFIG_PAGE_ALLOC
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	if (probe_kernel_address(object))
> +		p = NULL;	/* Invalid */
> +	else
> +		p = get_freepointer(s, object);
> +
> +	local_irq_restore(flags);
> +#else
> +	p = get_freepointer(s, object);
> +#endif
> +	return p;
> +}
> +
>  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>  {
>  	*(void **)(object + s->offset) = fp;
> @@ -1933,7 +1954,7 @@ redo:
>  		if (unlikely(!irqsafe_cpu_cmpxchg_double(
>  				s->cpu_slab->freelist, s->cpu_slab->tid,
>  				object, tid,
> -				get_freepointer(s, object), next_tid(tid)))) {
> +				get_freepointer_safe(s, object), next_tid(tid)))) {
> 
>  			note_cmpxchg_failure("slab_alloc", s, tid);
>  			goto redo;


Really this wont work Stephen

You have to disable IRQ _before_ even fetching 'object'

Or else, you can have an IRQ, allocate this object, pass to another cpu.

This other cpu can free the object and unmap page right after you did
the probe_kernel_address(object) (successfully), and before your cpu :

p = get_freepointer(s, object); << BUG >>

I really dont understand your motivation to keep the buggy commit.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 10, 2011, 6:28 p.m. UTC | #2
On Tue, 10 May 2011, Eric Dumazet wrote:

> > +	else
> > +		p = get_freepointer(s, object);
> > +
> > +	local_irq_restore(flags);
> > +#else
> > +	p = get_freepointer(s, object);
> > +#endif
> > +	return p;
> > +}
> > +
> >  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
> >  {
> >  	*(void **)(object + s->offset) = fp;
> > @@ -1933,7 +1954,7 @@ redo:
> >  		if (unlikely(!irqsafe_cpu_cmpxchg_double(
> >  				s->cpu_slab->freelist, s->cpu_slab->tid,
> >  				object, tid,
> > -				get_freepointer(s, object), next_tid(tid)))) {
> > +				get_freepointer_safe(s, object), next_tid(tid)))) {
> >
> >  			note_cmpxchg_failure("slab_alloc", s, tid);
> >  			goto redo;
>
>
> Really this wont work Stephen

I am not Stephen.

> You have to disable IRQ _before_ even fetching 'object'

The object pointer is being obtained from a per cpu structure and
not from the page. What is the problem with fetching the object pointer?

> Or else, you can have an IRQ, allocate this object, pass to another cpu.

If that occurs then TID is being incremented and we will restart the loop
getting the new object pointer from the per cpu structure. The object
pointer that we were considering is irrelevant.

> This other cpu can free the object and unmap page right after you did
> the probe_kernel_address(object) (successfully), and before your cpu :
>
> p = get_freepointer(s, object); << BUG >>

If the other cpu frees the object and unmaps the page then
get_freepointer_safe() can obtain an arbitrary value since the TID was
incremented. We will restart the loop and discard the value retrieved.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 10, 2011, 7:05 p.m. UTC | #3
On Tue, 10 May 2011, Christoph Lameter wrote:

> > This other cpu can free the object and unmap page right after you did
> > the probe_kernel_address(object) (successfully), and before your cpu :
> >
> > p = get_freepointer(s, object); << BUG >>
>
> If the other cpu frees the object and unmaps the page then
> get_freepointer_safe() can obtain an arbitrary value since the TID was
> incremented. We will restart the loop and discard the value retrieved.

Ok. Forgot the element there of a different cpu. A different cpu cannot
unmap the page or free the page since the page is in a frozen state while
we allocate from it.

The page is only handled by the cpu it was assigned to until the cpu which
froze it releases it.

The only case that we need to protect against here is the case when an
interrupt or reschedule causes the *same* cpu to release the page. In that
case the TID must have been incremented.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 10, 2011, 7:32 p.m. UTC | #4
Le mardi 10 mai 2011 à 13:28 -0500, Christoph Lameter a écrit :
> On Tue, 10 May 2011, Eric Dumazet wrote:
> 
> > > +	else
> > > +		p = get_freepointer(s, object);
> > > +
> > > +	local_irq_restore(flags);
> > > +#else
> > > +	p = get_freepointer(s, object);
> > > +#endif
> > > +	return p;
> > > +}
> > > +
> > >  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
> > >  {
> > >  	*(void **)(object + s->offset) = fp;
> > > @@ -1933,7 +1954,7 @@ redo:
> > >  		if (unlikely(!irqsafe_cpu_cmpxchg_double(
> > >  				s->cpu_slab->freelist, s->cpu_slab->tid,
> > >  				object, tid,
> > > -				get_freepointer(s, object), next_tid(tid)))) {
> > > +				get_freepointer_safe(s, object), next_tid(tid)))) {
> > >
> > >  			note_cmpxchg_failure("slab_alloc", s, tid);
> > >  			goto redo;
> >
> >
> > Really this wont work Stephen
> 
> I am not Stephen.
> 

Yes, sorry Christoph.

> > You have to disable IRQ _before_ even fetching 'object'
> 
> The object pointer is being obtained from a per cpu structure and
> not from the page. What is the problem with fetching the object pointer?
> 
> > Or else, you can have an IRQ, allocate this object, pass to another cpu.
> 
> If that occurs then TID is being incremented and we will restart the loop
> getting the new object pointer from the per cpu structure. The object
> pointer that we were considering is irrelevant.
> 

Problem is not restart the loop, but avoiding accessing a non valid
memory area.

> > This other cpu can free the object and unmap page right after you did
> > the probe_kernel_address(object) (successfully), and before your cpu :
> >
> > p = get_freepointer(s, object); << BUG >>
> 
> If the other cpu frees the object and unmaps the page then
> get_freepointer_safe() can obtain an arbitrary value since the TID was
> incremented. We will restart the loop and discard the value retrieved.
> 



In current code I see :

tid = c->tid;
barrier();
object = c->freelist;

There is no guarantee c->tid is fetched before c->freelist by cpu.

You need rmb() here.


I claim all this would be far more simple disabling IRQ before fetching
c->tid and c->freelist, in DEBUG_PAGE_ALLOC case.

You would not even need to use magic probe_kernel_read()


Why do you try so _hard_ trying to optimize this, I really wonder.
Nobody is able to read this code anymore and prove its correct.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 10, 2011, 7:38 p.m. UTC | #5
On Tue, 10 May 2011, Eric Dumazet wrote:

> > > You have to disable IRQ _before_ even fetching 'object'
> >
> > The object pointer is being obtained from a per cpu structure and
> > not from the page. What is the problem with fetching the object pointer?
> >
> > > Or else, you can have an IRQ, allocate this object, pass to another cpu.
> >
> > If that occurs then TID is being incremented and we will restart the loop
> > getting the new object pointer from the per cpu structure. The object
> > pointer that we were considering is irrelevant.
> >
>
> Problem is not restart the loop, but avoiding accessing a non valid
> memory area.

Yes and could you please explain clearly what the problem is?

>
> > > This other cpu can free the object and unmap page right after you did
> > > the probe_kernel_address(object) (successfully), and before your cpu :
> > >
> > > p = get_freepointer(s, object); << BUG >>
> >
> > If the other cpu frees the object and unmaps the page then
> > get_freepointer_safe() can obtain an arbitrary value since the TID was
> > incremented. We will restart the loop and discard the value retrieved.
> >
>
>
>
> In current code I see :
>
> tid = c->tid;
> barrier();
> object = c->freelist;
>
> There is no guarantee c->tid is fetched before c->freelist by cpu.
>
> You need rmb() here.

Nope. This is not processor to processor concurrency. this_cpu operations
only deal with concurrency issues on the same processor. I.e. interrupts
and preemption.

> I claim all this would be far more simple disabling IRQ before fetching
> c->tid and c->freelist, in DEBUG_PAGE_ALLOC case.
>
> You would not even need to use magic probe_kernel_read()
>
>
> Why do you try so _hard_ trying to optimize this, I really wonder.
> Nobody is able to read this code anymore and prove its correct.

Optimizing? You think about this as concurrency issue between multiple
cpus. That is fundamentally wrong. This is dealing with access to per cpu
data and the concurrency issues are only with code running on the *same*
cpu.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 10, 2011, 8:06 p.m. UTC | #6
Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit :

> Optimizing? You think about this as concurrency issue between multiple
> cpus. That is fundamentally wrong. This is dealing with access to per cpu
> data and the concurrency issues are only with code running on the *same*
> cpu.
> 

If you enable irqs, then this object can be allocated by _this_ cpu and
given to another one.

Another cpu can free the page, forcing you to call a very expensive
function, that might give obsolete result as soon it returns.

Maybe I am just tired tonight, this seems very obvious, I must miss
something.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 10, 2011, 8:33 p.m. UTC | #7
On Tue, 10 May 2011, Eric Dumazet wrote:

> Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit :
>
> > Optimizing? You think about this as concurrency issue between multiple
> > cpus. That is fundamentally wrong. This is dealing with access to per cpu
> > data and the concurrency issues are only with code running on the *same*
> > cpu.
> >
>
> If you enable irqs, then this object can be allocated by _this_ cpu and
> given to another one.

That will cause an incrementing of the tid.

> Another cpu can free the page, forcing you to call a very expensive
> function, that might give obsolete result as soon it returns.

No the other cpu cannot free the page since the page is pinned by
the current cpu (see PageFrozen()).

> Maybe I am just tired tonight, this seems very obvious, I must miss
> something.

Yeah you are way off thinking about cpu to cpu concurrency issues that do
not apply here.
Eric Dumazet May 10, 2011, 8:45 p.m. UTC | #8
Le mardi 10 mai 2011 à 15:33 -0500, Christoph Lameter a écrit :
> On Tue, 10 May 2011, Eric Dumazet wrote:
> 
> > Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit :
> >
> > > Optimizing? You think about this as concurrency issue between multiple
> > > cpus. That is fundamentally wrong. This is dealing with access to per cpu
> > > data and the concurrency issues are only with code running on the *same*
> > > cpu.
> > >
> >
> > If you enable irqs, then this object can be allocated by _this_ cpu and
> > given to another one.
> 
> That will cause an incrementing of the tid.
> 
> > Another cpu can free the page, forcing you to call a very expensive
> > function, that might give obsolete result as soon it returns.
> 
> No the other cpu cannot free the page since the page is pinned by
> the current cpu (see PageFrozen()).
> 

What happens then ? Other cpu calls kfree() on last nonfreed object for
this slab, and yet the page stay frozen ? How this page is going to be
freed at all ?

> > Maybe I am just tired tonight, this seems very obvious, I must miss
> > something.
> 
> Yeah you are way off thinking about cpu to cpu concurrency issues that do
> not apply here.

I fail to understand how current cpu can assert page ownership, if IRQs
are enabled, this seems obvious it cannot.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 10, 2011, 9:22 p.m. UTC | #9
On Tue, 10 May 2011, Eric Dumazet wrote:

> > No the other cpu cannot free the page since the page is pinned by
> > the current cpu (see PageFrozen()).
> >
>
> What happens then ? Other cpu calls kfree() on last nonfreed object for
> this slab, and yet the page stay frozen ? How this page is going to be
> freed at all ?

Yes the page stays frozen. The freed objects are used to replenish the
percpu free list when it becomes empty.

The page is going to be freed when a kmalloc() finds that the per cpu
freelist is empty and that the freelist of the page is also empty. Then
interrupts are disabled, the old page is unfrozen and a new
page is acquired for allocation.

> > > Maybe I am just tired tonight, this seems very obvious, I must miss
> > > something.
> >
> > Yeah you are way off thinking about cpu to cpu concurrency issues that do
> > not apply here.
>
> I fail to understand how current cpu can assert page ownership, if IRQs
> are enabled, this seems obvious it cannot.

The cpu sets a page flag called PageFrozen() and points a per cpu pointer
to the page.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 11, 2011, 3:12 a.m. UTC | #10
Le mardi 10 mai 2011 à 16:22 -0500, Christoph Lameter a écrit :
> On Tue, 10 May 2011, Eric Dumazet wrote:
> 
> > > No the other cpu cannot free the page since the page is pinned by
> > > the current cpu (see PageFrozen()).
> > >
> >
> > What happens then ? Other cpu calls kfree() on last nonfreed object for
> > this slab, and yet the page stay frozen ? How this page is going to be
> > freed at all ?
> 
> Yes the page stays frozen. The freed objects are used to replenish the
> percpu free list when it becomes empty.
> 
> The page is going to be freed when a kmalloc() finds that the per cpu
> freelist is empty and that the freelist of the page is also empty. Then
> interrupts are disabled, the old page is unfrozen and a new
> page is acquired for allocation.
> 
> > > > Maybe I am just tired tonight, this seems very obvious, I must miss
> > > > something.
> > >
> > > Yeah you are way off thinking about cpu to cpu concurrency issues that do
> > > not apply here.
> >
> > I fail to understand how current cpu can assert page ownership, if IRQs
> > are enabled, this seems obvious it cannot.
> 
> The cpu sets a page flag called PageFrozen() and points a per cpu pointer
> to the page.
> 
> 

So, if I understand you, there is no problem at all and no patch even
needed ? I can start a stress test and you guarantee there wont be a
crash ?

Sorry, its 5h11 in the morning here ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 12, 2011, 2:36 p.m. UTC | #11
On Wed, 11 May 2011, Eric Dumazet wrote:

> > The cpu sets a page flag called PageFrozen() and points a per cpu pointer
> > to the page.
> >
> >
>
> So, if I understand you, there is no problem at all and no patch even
> needed ? I can start a stress test and you guarantee there wont be a
> crash ?
>
> Sorry, its 5h11 in the morning here ;)

There is a problem if an interrupt or a preemption occurs and there is no
object left on the page. Then the current page will be unfrozen and a new
page put into place for allocation. The old page may then be freed by some other
process on another processor before we continue the interrupted
slab_alloc().

When slab_alloc() resumes in this scenario then it will ultimately see
that the tid was incremented and so the cmpxchg will fail. But before we
do the cmpxchgwe determine the pointer to the next object. And for that
we access the old page. The access must not cause a page fault (which it
currently does with CONFIG_DEBUG_PAGEALLOC). That is why we need the patch introducing

	get_freepointer_safe()

The result does not matter since we will repeat the cmpxchg loop.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-10 12:35:30.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-10 12:38:53.000000000 -0500
@@ -261,6 +261,27 @@  static inline void *get_freepointer(stru
 	return *(void **)(object + s->offset);
 }

+static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
+{
+	void *p;
+
+#ifdef CONFIG_PAGE_ALLOC
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (probe_kernel_address(object))
+		p = NULL;	/* Invalid */
+	else
+		p = get_freepointer(s, object);
+
+	local_irq_restore(flags);
+#else
+	p = get_freepointer(s, object);
+#endif
+	return p;
+}
+
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	*(void **)(object + s->offset) = fp;
@@ -1933,7 +1954,7 @@  redo:
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer(s, object), next_tid(tid)))) {
+				get_freepointer_safe(s, object), next_tid(tid)))) {

 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;