Message ID | alpine.DEB.2.00.1105101242420.2875@router.home |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
Le mardi 10 mai 2011 à 12:43 -0500, Christoph Lameter a écrit : > Draft for a patch > > > Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath > > Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have > marked as invalid to retrieve the pointer to the next free object. > > Probe that address before dereferencing the pointer to the page. > All of that needs to occur with interrupts disabled since an interrupt > could cause the page status to change (as pointed out by Eric). > > Signed-off-by: Christoph Lameter <cl@linux.com> > --- > mm/slub.c | 23 ++++++++++++++++++++++- > 1 file changed, 22 insertions(+), 1 deletion(-) > > Index: linux-2.6/mm/slub.c > =================================================================== > --- linux-2.6.orig/mm/slub.c 2011-05-10 12:35:30.000000000 -0500 > +++ linux-2.6/mm/slub.c 2011-05-10 12:38:53.000000000 -0500 > @@ -261,6 +261,27 @@ static inline void *get_freepointer(stru > return *(void **)(object + s->offset); > } > > +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) > +{ > + void *p; > + > +#ifdef CONFIG_PAGE_ALLOC > + unsigned long flags; > + > + local_irq_save(flags); > + > + if (probe_kernel_address(object)) > + p = NULL; /* Invalid */ > + else > + p = get_freepointer(s, object); > + > + local_irq_restore(flags); > +#else > + p = get_freepointer(s, object); > +#endif > + return p; > +} > + > static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) > { > *(void **)(object + s->offset) = fp; > @@ -1933,7 +1954,7 @@ redo: > if (unlikely(!irqsafe_cpu_cmpxchg_double( > s->cpu_slab->freelist, s->cpu_slab->tid, > object, tid, > - get_freepointer(s, object), next_tid(tid)))) { > + get_freepointer_safe(s, object), next_tid(tid)))) { > > note_cmpxchg_failure("slab_alloc", s, tid); > goto redo; Really this wont work Stephen You have to disable IRQ _before_ even fetching 'object' Or else, you can have an IRQ, allocate this object, pass to another cpu. This other cpu can free the object and unmap page right after you did the probe_kernel_address(object) (successfully), and before your cpu : p = get_freepointer(s, object); << BUG >> I really dont understand your motivation to keep the buggy commit. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 10 May 2011, Eric Dumazet wrote: > > + else > > + p = get_freepointer(s, object); > > + > > + local_irq_restore(flags); > > +#else > > + p = get_freepointer(s, object); > > +#endif > > + return p; > > +} > > + > > static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) > > { > > *(void **)(object + s->offset) = fp; > > @@ -1933,7 +1954,7 @@ redo: > > if (unlikely(!irqsafe_cpu_cmpxchg_double( > > s->cpu_slab->freelist, s->cpu_slab->tid, > > object, tid, > > - get_freepointer(s, object), next_tid(tid)))) { > > + get_freepointer_safe(s, object), next_tid(tid)))) { > > > > note_cmpxchg_failure("slab_alloc", s, tid); > > goto redo; > > > Really this wont work Stephen I am not Stephen. > You have to disable IRQ _before_ even fetching 'object' The object pointer is being obtained from a per cpu structure and not from the page. What is the problem with fetching the object pointer? > Or else, you can have an IRQ, allocate this object, pass to another cpu. If that occurs then TID is being incremented and we will restart the loop getting the new object pointer from the per cpu structure. The object pointer that we were considering is irrelevant. > This other cpu can free the object and unmap page right after you did > the probe_kernel_address(object) (successfully), and before your cpu : > > p = get_freepointer(s, object); << BUG >> If the other cpu frees the object and unmaps the page then get_freepointer_safe() can obtain an arbitrary value since the TID was incremented. We will restart the loop and discard the value retrieved. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 10 May 2011, Christoph Lameter wrote: > > This other cpu can free the object and unmap page right after you did > > the probe_kernel_address(object) (successfully), and before your cpu : > > > > p = get_freepointer(s, object); << BUG >> > > If the other cpu frees the object and unmaps the page then > get_freepointer_safe() can obtain an arbitrary value since the TID was > incremented. We will restart the loop and discard the value retrieved. Ok. Forgot the element there of a different cpu. A different cpu cannot unmap the page or free the page since the page is in a frozen state while we allocate from it. The page is only handled by the cpu it was assigned to until the cpu which froze it releases it. The only case that we need to protect against here is the case when an interrupt or reschedule causes the *same* cpu to release the page. In that case the TID must have been incremented. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 10 mai 2011 à 13:28 -0500, Christoph Lameter a écrit : > On Tue, 10 May 2011, Eric Dumazet wrote: > > > > + else > > > + p = get_freepointer(s, object); > > > + > > > + local_irq_restore(flags); > > > +#else > > > + p = get_freepointer(s, object); > > > +#endif > > > + return p; > > > +} > > > + > > > static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) > > > { > > > *(void **)(object + s->offset) = fp; > > > @@ -1933,7 +1954,7 @@ redo: > > > if (unlikely(!irqsafe_cpu_cmpxchg_double( > > > s->cpu_slab->freelist, s->cpu_slab->tid, > > > object, tid, > > > - get_freepointer(s, object), next_tid(tid)))) { > > > + get_freepointer_safe(s, object), next_tid(tid)))) { > > > > > > note_cmpxchg_failure("slab_alloc", s, tid); > > > goto redo; > > > > > > Really this wont work Stephen > > I am not Stephen. > Yes, sorry Christoph. > > You have to disable IRQ _before_ even fetching 'object' > > The object pointer is being obtained from a per cpu structure and > not from the page. What is the problem with fetching the object pointer? > > > Or else, you can have an IRQ, allocate this object, pass to another cpu. > > If that occurs then TID is being incremented and we will restart the loop > getting the new object pointer from the per cpu structure. The object > pointer that we were considering is irrelevant. > Problem is not restart the loop, but avoiding accessing a non valid memory area. > > This other cpu can free the object and unmap page right after you did > > the probe_kernel_address(object) (successfully), and before your cpu : > > > > p = get_freepointer(s, object); << BUG >> > > If the other cpu frees the object and unmaps the page then > get_freepointer_safe() can obtain an arbitrary value since the TID was > incremented. We will restart the loop and discard the value retrieved. > In current code I see : tid = c->tid; barrier(); object = c->freelist; There is no guarantee c->tid is fetched before c->freelist by cpu. You need rmb() here. I claim all this would be far more simple disabling IRQ before fetching c->tid and c->freelist, in DEBUG_PAGE_ALLOC case. You would not even need to use magic probe_kernel_read() Why do you try so _hard_ trying to optimize this, I really wonder. Nobody is able to read this code anymore and prove its correct. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 10 May 2011, Eric Dumazet wrote: > > > You have to disable IRQ _before_ even fetching 'object' > > > > The object pointer is being obtained from a per cpu structure and > > not from the page. What is the problem with fetching the object pointer? > > > > > Or else, you can have an IRQ, allocate this object, pass to another cpu. > > > > If that occurs then TID is being incremented and we will restart the loop > > getting the new object pointer from the per cpu structure. The object > > pointer that we were considering is irrelevant. > > > > Problem is not restart the loop, but avoiding accessing a non valid > memory area. Yes and could you please explain clearly what the problem is? > > > > This other cpu can free the object and unmap page right after you did > > > the probe_kernel_address(object) (successfully), and before your cpu : > > > > > > p = get_freepointer(s, object); << BUG >> > > > > If the other cpu frees the object and unmaps the page then > > get_freepointer_safe() can obtain an arbitrary value since the TID was > > incremented. We will restart the loop and discard the value retrieved. > > > > > > In current code I see : > > tid = c->tid; > barrier(); > object = c->freelist; > > There is no guarantee c->tid is fetched before c->freelist by cpu. > > You need rmb() here. Nope. This is not processor to processor concurrency. this_cpu operations only deal with concurrency issues on the same processor. I.e. interrupts and preemption. > I claim all this would be far more simple disabling IRQ before fetching > c->tid and c->freelist, in DEBUG_PAGE_ALLOC case. > > You would not even need to use magic probe_kernel_read() > > > Why do you try so _hard_ trying to optimize this, I really wonder. > Nobody is able to read this code anymore and prove its correct. Optimizing? You think about this as concurrency issue between multiple cpus. That is fundamentally wrong. This is dealing with access to per cpu data and the concurrency issues are only with code running on the *same* cpu. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit : > Optimizing? You think about this as concurrency issue between multiple > cpus. That is fundamentally wrong. This is dealing with access to per cpu > data and the concurrency issues are only with code running on the *same* > cpu. > If you enable irqs, then this object can be allocated by _this_ cpu and given to another one. Another cpu can free the page, forcing you to call a very expensive function, that might give obsolete result as soon it returns. Maybe I am just tired tonight, this seems very obvious, I must miss something. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 10 May 2011, Eric Dumazet wrote: > Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit : > > > Optimizing? You think about this as concurrency issue between multiple > > cpus. That is fundamentally wrong. This is dealing with access to per cpu > > data and the concurrency issues are only with code running on the *same* > > cpu. > > > > If you enable irqs, then this object can be allocated by _this_ cpu and > given to another one. That will cause an incrementing of the tid. > Another cpu can free the page, forcing you to call a very expensive > function, that might give obsolete result as soon it returns. No the other cpu cannot free the page since the page is pinned by the current cpu (see PageFrozen()). > Maybe I am just tired tonight, this seems very obvious, I must miss > something. Yeah you are way off thinking about cpu to cpu concurrency issues that do not apply here.
Le mardi 10 mai 2011 à 15:33 -0500, Christoph Lameter a écrit : > On Tue, 10 May 2011, Eric Dumazet wrote: > > > Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit : > > > > > Optimizing? You think about this as concurrency issue between multiple > > > cpus. That is fundamentally wrong. This is dealing with access to per cpu > > > data and the concurrency issues are only with code running on the *same* > > > cpu. > > > > > > > If you enable irqs, then this object can be allocated by _this_ cpu and > > given to another one. > > That will cause an incrementing of the tid. > > > Another cpu can free the page, forcing you to call a very expensive > > function, that might give obsolete result as soon it returns. > > No the other cpu cannot free the page since the page is pinned by > the current cpu (see PageFrozen()). > What happens then ? Other cpu calls kfree() on last nonfreed object for this slab, and yet the page stay frozen ? How this page is going to be freed at all ? > > Maybe I am just tired tonight, this seems very obvious, I must miss > > something. > > Yeah you are way off thinking about cpu to cpu concurrency issues that do > not apply here. I fail to understand how current cpu can assert page ownership, if IRQs are enabled, this seems obvious it cannot. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 10 May 2011, Eric Dumazet wrote: > > No the other cpu cannot free the page since the page is pinned by > > the current cpu (see PageFrozen()). > > > > What happens then ? Other cpu calls kfree() on last nonfreed object for > this slab, and yet the page stay frozen ? How this page is going to be > freed at all ? Yes the page stays frozen. The freed objects are used to replenish the percpu free list when it becomes empty. The page is going to be freed when a kmalloc() finds that the per cpu freelist is empty and that the freelist of the page is also empty. Then interrupts are disabled, the old page is unfrozen and a new page is acquired for allocation. > > > Maybe I am just tired tonight, this seems very obvious, I must miss > > > something. > > > > Yeah you are way off thinking about cpu to cpu concurrency issues that do > > not apply here. > > I fail to understand how current cpu can assert page ownership, if IRQs > are enabled, this seems obvious it cannot. The cpu sets a page flag called PageFrozen() and points a per cpu pointer to the page. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 10 mai 2011 à 16:22 -0500, Christoph Lameter a écrit : > On Tue, 10 May 2011, Eric Dumazet wrote: > > > > No the other cpu cannot free the page since the page is pinned by > > > the current cpu (see PageFrozen()). > > > > > > > What happens then ? Other cpu calls kfree() on last nonfreed object for > > this slab, and yet the page stay frozen ? How this page is going to be > > freed at all ? > > Yes the page stays frozen. The freed objects are used to replenish the > percpu free list when it becomes empty. > > The page is going to be freed when a kmalloc() finds that the per cpu > freelist is empty and that the freelist of the page is also empty. Then > interrupts are disabled, the old page is unfrozen and a new > page is acquired for allocation. > > > > > Maybe I am just tired tonight, this seems very obvious, I must miss > > > > something. > > > > > > Yeah you are way off thinking about cpu to cpu concurrency issues that do > > > not apply here. > > > > I fail to understand how current cpu can assert page ownership, if IRQs > > are enabled, this seems obvious it cannot. > > The cpu sets a page flag called PageFrozen() and points a per cpu pointer > to the page. > > So, if I understand you, there is no problem at all and no patch even needed ? I can start a stress test and you guarantee there wont be a crash ? Sorry, its 5h11 in the morning here ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 11 May 2011, Eric Dumazet wrote: > > The cpu sets a page flag called PageFrozen() and points a per cpu pointer > > to the page. > > > > > > So, if I understand you, there is no problem at all and no patch even > needed ? I can start a stress test and you guarantee there wont be a > crash ? > > Sorry, its 5h11 in the morning here ;) There is a problem if an interrupt or a preemption occurs and there is no object left on the page. Then the current page will be unfrozen and a new page put into place for allocation. The old page may then be freed by some other process on another processor before we continue the interrupted slab_alloc(). When slab_alloc() resumes in this scenario then it will ultimately see that the tid was incremented and so the cmpxchg will fail. But before we do the cmpxchgwe determine the pointer to the next object. And for that we access the old page. The access must not cause a page fault (which it currently does with CONFIG_DEBUG_PAGEALLOC). That is why we need the patch introducing get_freepointer_safe() The result does not matter since we will repeat the cmpxchg loop. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-05-10 12:35:30.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-05-10 12:38:53.000000000 -0500 @@ -261,6 +261,27 @@ static inline void *get_freepointer(stru return *(void **)(object + s->offset); } +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) +{ + void *p; + +#ifdef CONFIG_PAGE_ALLOC + unsigned long flags; + + local_irq_save(flags); + + if (probe_kernel_address(object)) + p = NULL; /* Invalid */ + else + p = get_freepointer(s, object); + + local_irq_restore(flags); +#else + p = get_freepointer(s, object); +#endif + return p; +} + static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) { *(void **)(object + s->offset) = fp; @@ -1933,7 +1954,7 @@ redo: if (unlikely(!irqsafe_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, object, tid, - get_freepointer(s, object), next_tid(tid)))) { + get_freepointer_safe(s, object), next_tid(tid)))) { note_cmpxchg_failure("slab_alloc", s, tid); goto redo;
Draft for a patch Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have marked as invalid to retrieve the pointer to the next free object. Probe that address before dereferencing the pointer to the page. All of that needs to occur with interrupts disabled since an interrupt could cause the page status to change (as pointed out by Eric). Signed-off-by: Christoph Lameter <cl@linux.com> --- mm/slub.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html