Message ID | alpine.LSU.2.00.1103192041390.1592@sister.anvils (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Sat, 2011-03-19 at 21:11 -0700, Hugh Dickins wrote: > > As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range > patches in mmotm, which again assault PowerPC's expectations, and > cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y. > > This time in vmalloc as well as vfree; and Peter's fix to the last > lot, which went into 2.6.38, doesn't protect against these ones. > Here's what I now see when I swapon and swapoff: Right. And we said from day one we had the HARD WIRED assumption that arch_enter/leave_lazy_mmu_mode() was ALWAYS going to be called within a PTE lock section, and we did get reassurance that it was going to remain so. So why is it ok for them to change those and break us like that ? Seriously, this is going out of control. If we can't even rely on fundamental locking assumptions in the VM to remain reasonably stable or at least get some amount of -care- from who changes them as to whether they break others and work with us to fix them, wtf ? I don't know what the right way to fix that is. We have an absolute requirement that the batching we start within a lazy MMU section is complete and flushed before any other PTE in that section can be touched by anything else. Do we -at least- keep that guarantee ? If yes, then maybe preempt_disable/enable() around arch_enter/leave_lazy_mmu_mode() in apply_to_pte_range() would do... Or maybe I should just prevent any batching of init_mm :-( Cheers, Ben.
On Mon, 21 Mar 2011, Benjamin Herrenschmidt wrote: > On Sat, 2011-03-19 at 21:11 -0700, Hugh Dickins wrote: > > > > As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range > > patches in mmotm, which again assault PowerPC's expectations, and > > cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y. > > > > This time in vmalloc as well as vfree; and Peter's fix to the last > > lot, which went into 2.6.38, doesn't protect against these ones. > > Here's what I now see when I swapon and swapoff: > > Right. And we said from day one we had the HARD WIRED assumption that > arch_enter/leave_lazy_mmu_mode() was ALWAYS going to be called within > a PTE lock section, and we did get reassurance that it was going to > remain so. > > So why is it ok for them to change those and break us like that ? It's not ok. Sounds like Andrew should not forward mm-remove-unused-token-argument-from-apply_to_page_range-callback.patch mm-add-apply_to_page_range_batch.patch ioremap-use-apply_to_page_range_batch-for-ioremap_page_range.patch vmalloc-use-plain-pte_clear-for-unmaps.patch vmalloc-use-apply_to_page_range_batch-for-vunmap_page_range.patch vmalloc-use-apply_to_page_range_batch-for-vmap_page_range_noflush.patch vmalloc-use-apply_to_page_range_batch-in-alloc_vm_area.patch xen-mmu-use-apply_to_page_range_batch-in-xen_remap_domain_mfn_range.patch xen-grant-table-use-apply_to_page_range_batch.patch or some subset (the vmalloc-use-apply ones? and the ioremap one?) of that set to Linus for 2.6.39. Your call. > > Seriously, this is going out of control. If we can't even rely on > fundamental locking assumptions in the VM to remain reasonably stable > or at least get some amount of -care- from who changes them as to > whether they break others and work with us to fix them, wtf ? I know next to nothing of arch_enter/leave_lazy_mmu_mode(), and the same is probably true of most mm developers. The only people who have it defined to anything interesting appear to be powerpc and xen and lguest: so it would be a gentleman's agreement between you and Jeremy and Rusty. If Jeremy has changed the rules without your agreement, then you can fight a duel at daybreak, or, since your daybreaks are at different times, Jeremy's patches just shouldn't go forward yet. > > I don't know what the right way to fix that is. We have an absolute > requirement that the batching we start within a lazy MMU section > is complete and flushed before any other PTE in that section can be > touched by anything else. Do we -at least- keep that guarantee ? I'm guessing it's a guarantee of the same kind as led me to skip page_table_lock on init_mm in 2.6.15: no locking to guarantee it, but it would have to be a kernel bug, in a driver or wherever, for us to be accessing such a section while it was in transit (short of speculative access prior to tlb flush). > > If yes, then maybe preempt_disable/enable() around > arch_enter/leave_lazy_mmu_mode() in apply_to_pte_range() would do... > > Or maybe I should just prevent any batching of init_mm :-( I don't see where you're doing batching on init_mm today: it looks as if Jeremy's patches, by using the same code as he has for user mms, are now enabling batching on init_mm, and you should :-) But I may be all wrong: it's between you and Jeremy, and until he defends them, his patches should not go forward. Hugh
On Sun, 2011-03-20 at 18:41 -0700, Hugh Dickins wrote: > > I don't know what the right way to fix that is. We have an absolute > > requirement that the batching we start within a lazy MMU section > > is complete and flushed before any other PTE in that section can be > > touched by anything else. Do we -at least- keep that guarantee ? > > I'm guessing it's a guarantee of the same kind as led me to skip > page_table_lock on init_mm in 2.6.15: no locking to guarantee it, > but it would have to be a kernel bug, in a driver or wherever, > for us to be accessing such a section while it was in transit > (short of speculative access prior to tlb flush). As long as the races to avoid are between map/unmap vs. access, yes, it -should- be fine, and we used to not do demand faulting on kernel space (but for how long ?). I'm wondering why we don't just stick a ptl in there or is there a good reason why we can't ? > I don't see where you're doing batching on init_mm today: > it looks as if Jeremy's patches, by using the same code as he has > for user mms, are now enabling batching on init_mm, and you should :-) > > But I may be all wrong: it's between you and Jeremy, > and until he defends them, his patches should not go forward. We don't do it today (batching). Jeremy's patches have the side effect of "enabling" it, which isn't wrong per-se ... but on our side relies on some locking assumptions we are missing. Cheers, Ben.
On Mon, 21 Mar 2011, Benjamin Herrenschmidt wrote: > > As long as the races to avoid are between map/unmap vs. access, yes, it > -should- be fine, and we used to not do demand faulting on kernel space > (but for how long ?). I'm wondering why we don't just stick a ptl in > there or is there a good reason why we can't ? We can - but we usually prefer to avoid unnecessary locking. An arch function which locks init_mm.page_table_lock on powerpc, but does nothing on others? Hugh
On Sun, 2011-03-20 at 19:20 -0700, Hugh Dickins wrote: > > As long as the races to avoid are between map/unmap vs. access, yes, it > > -should- be fine, and we used to not do demand faulting on kernel space > > (but for how long ?). I'm wondering why we don't just stick a ptl in > > there or is there a good reason why we can't ? > > We can - but we usually prefer to avoid unnecessary locking. > An arch function which locks init_mm.page_table_lock on powerpc, > but does nothing on others? That still means gratuitous differences between how the normal and kernel page tables are handled. Maybe that's not worth bothering ... Cheers, Ben.
On 03/20/2011 11:53 PM, Benjamin Herrenschmidt wrote: > On Sat, 2011-03-19 at 21:11 -0700, Hugh Dickins wrote: >> As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range >> patches in mmotm, which again assault PowerPC's expectations, and >> cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y. >> >> This time in vmalloc as well as vfree; and Peter's fix to the last >> lot, which went into 2.6.38, doesn't protect against these ones. >> Here's what I now see when I swapon and swapoff: > Right. And we said from day one we had the HARD WIRED assumption that > arch_enter/leave_lazy_mmu_mode() was ALWAYS going to be called within > a PTE lock section, and we did get reassurance that it was going to > remain so. > > So why is it ok for them to change those and break us like that ? In general, the pagetable's locking rules are that all *usermode* pte updates have to be done under a pte lock, but kernel mode ones do not; they generally have some kind of per-subsystem ad-hoc locking where needed, which may or may not be no-preempt. Originally the enter/leave_lazy_mmu_mode did require preemption to be disabled for the whole time, but that was incompatible with the above locking rules, and resulted in preemption being disabled for long periods when using lazy mode which wouldn't normally happen. This raised a number of complaints. To address this, I changed the x86 implementation to deal with preemption in lazy mode by dropping out for lazy mode at context switch time (and recording the fact that we were in lazy mode with a TIF flag and re-entering on the next context switch). > Seriously, this is going out of control. If we can't even rely on > fundamental locking assumptions in the VM to remain reasonably stable > or at least get some amount of -care- from who changes them as to > whether they break others and work with us to fix them, wtf ? > > I don't know what the right way to fix that is. We have an absolute > requirement that the batching we start within a lazy MMU section > is complete and flushed before any other PTE in that section can be > touched by anything else. Do we -at least- keep that guarantee ? > > If yes, then maybe preempt_disable/enable() around > arch_enter/leave_lazy_mmu_mode() in apply_to_pte_range() would do... > > Or maybe I should just prevent any batching of init_mm :-( I'm very sorry about that, I didn't realize power was also using that interface. Unfortunately, the "no preemption" definition was an error, and had to be changed to match the pre-existing locking rules. Could you implement a similar "flush batched pte updates on context switch" as x86? J
On Mon, 2011-03-21 at 11:24 +0000, Jeremy Fitzhardinge wrote: > > I'm very sorry about that, I didn't realize power was also using that > interface. Unfortunately, the "no preemption" definition was an error, > and had to be changed to match the pre-existing locking rules. > > Could you implement a similar "flush batched pte updates on context > switch" as x86? Well, we already do that for -rt & co. However, we have another issue which is the reason we used those lazy_mmu hooks to do our flushing. Our PTEs eventually get faulted into a hash table which is what the real MMU uses. We must never (ever) allow that hash table to contain a duplicate entry for a given virtual address. When we do a batch, we remove things from the linux PTE, and keep a reference in our batch structure, and only update the hash table at the end of the batch. That means that we must not allow a hash fault to populate the hash with a "new" PTE value prior to the old one having been flushed out (which is possible if they different in protection attributes for example). For that to happen, we must basically not allow a page fault to re-populate a PTE invalidated by a batch before that batch has completed. That translates to batches must only happen within a PTE lock section. Cheers, Ben.
On 03/21/2011 10:52 PM, Benjamin Herrenschmidt wrote: > On Mon, 2011-03-21 at 11:24 +0000, Jeremy Fitzhardinge wrote: >> I'm very sorry about that, I didn't realize power was also using that >> interface. Unfortunately, the "no preemption" definition was an error, >> and had to be changed to match the pre-existing locking rules. >> >> Could you implement a similar "flush batched pte updates on context >> switch" as x86? > Well, we already do that for -rt & co. > > However, we have another issue which is the reason we used those > lazy_mmu hooks to do our flushing. > > Our PTEs eventually get faulted into a hash table which is what the real > MMU uses. We must never (ever) allow that hash table to contain a > duplicate entry for a given virtual address. > > When we do a batch, we remove things from the linux PTE, and keep a > reference in our batch structure, and only update the hash table at the > end of the batch. Wouldn't implicitly ending a batch on context switch get the same effect? > That means that we must not allow a hash fault to populate the hash with > a "new" PTE value prior to the old one having been flushed out (which is > possible if they different in protection attributes for example). For > that to happen, we must basically not allow a page fault to re-populate > a PTE invalidated by a batch before that batch has completed. Kernel ptes are not generally populated on fault though, unless there's something in power? On x86 it can happen when syncing a process's kernel pmd with the init_mm one, but that shouldn't happen in the middle of an update since you'd deadlock anyway. If a particular kernel subsystem has its own locks to manage the ptes for a kernel mapping, then that should prevent any nested updates within a batch shouldn't it? > That translates to batches must only happen within a PTE lock section. Well, in that case, I guess your best bet is to disable batching for kernel pagetable updates. These apply_to_page_range() changes are the first time any attempt to batch kernel pagetable updates has been made (otherwise you would have seen this problem earlier), so not batching them will not be a regression for you. But I'm not sure what the proper fix to get batching in your case will be. But the assumption that there's a pte lock for kernel ptes is not valid. J
On Mon, 21 Mar 2011 13:22:30 +1100 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Sun, 2011-03-20 at 19:20 -0700, Hugh Dickins wrote: > > > As long as the races to avoid are between map/unmap vs. access, yes, it > > > -should- be fine, and we used to not do demand faulting on kernel space > > > (but for how long ?). I'm wondering why we don't just stick a ptl in > > > there or is there a good reason why we can't ? > > > > We can - but we usually prefer to avoid unnecessary locking. > > An arch function which locks init_mm.page_table_lock on powerpc, > > but does nothing on others? > > That still means gratuitous differences between how the normal and > kernel page tables are handled. Maybe that's not worth bothering ... So what will we do here? I still have mm-remove-unused-token-argument-from-apply_to_page_range-callback.patch mm-add-apply_to_page_range_batch.patch ioremap-use-apply_to_page_range_batch-for-ioremap_page_range.patch vmalloc-use-plain-pte_clear-for-unmaps.patch vmalloc-use-apply_to_page_range_batch-for-vunmap_page_range.patch vmalloc-use-apply_to_page_range_batch-for-vmap_page_range_noflush.patch vmalloc-use-apply_to_page_range_batch-in-alloc_vm_area.patch xen-mmu-use-apply_to_page_range_batch-in-xen_remap_domain_mfn_range.patch xen-grant-table-use-apply_to_page_range_batch.patch floating around and at some stage they may cause merge problems.
On 03/30/2011 01:53 PM, Andrew Morton wrote: > On Mon, 21 Mar 2011 13:22:30 +1100 > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > >> On Sun, 2011-03-20 at 19:20 -0700, Hugh Dickins wrote: >>>> As long as the races to avoid are between map/unmap vs. access, yes, it >>>> -should- be fine, and we used to not do demand faulting on kernel space >>>> (but for how long ?). I'm wondering why we don't just stick a ptl in >>>> there or is there a good reason why we can't ? >>> We can - but we usually prefer to avoid unnecessary locking. >>> An arch function which locks init_mm.page_table_lock on powerpc, >>> but does nothing on others? >> That still means gratuitous differences between how the normal and >> kernel page tables are handled. Maybe that's not worth bothering ... > So what will we do here? I still have > > mm-remove-unused-token-argument-from-apply_to_page_range-callback.patch > mm-add-apply_to_page_range_batch.patch > ioremap-use-apply_to_page_range_batch-for-ioremap_page_range.patch > vmalloc-use-plain-pte_clear-for-unmaps.patch > vmalloc-use-apply_to_page_range_batch-for-vunmap_page_range.patch > vmalloc-use-apply_to_page_range_batch-for-vmap_page_range_noflush.patch > vmalloc-use-apply_to_page_range_batch-in-alloc_vm_area.patch > xen-mmu-use-apply_to_page_range_batch-in-xen_remap_domain_mfn_range.patch > xen-grant-table-use-apply_to_page_range_batch.patch > > floating around and at some stage they may cause merge problems. Well, my understanding of the situation is: 1. There's a basic asymmetry between user and kernel pagetables, in that the former has a standard pte locking scheme, whereas the latter uses ad-hoc locking depending on what particular subsystem is doing the changes (presumably to its own private piece of kernel virtual address space) 2. Power was assuming that all lazy mmu updates were done under spinlock, or are at least non-preemptable. This is incompatible with 1), but it was moot because no kernel updates were done lazily 3. These patches add the first instance of lazy mmu updates, which reveals the mismatch between Power's expectations and the actual locking rules. So the options are: 1. Change the locking rules for kernel updates to also require a pte lock 2. Special-case batched kernel updates to include a pte lock 3. Make Power deal with preemption during a batched kernel update 4. Never do batched kernel updates on Power 5. Never do batched kernel updates (the current state) 1 seems like a big change. 2 is pretty awkward, and has the side-effect of increasing preemption latencies (since if you're doing enough updates to be worth batching, you'll be disabling preemption for a longish time). I don't know how complex 3 is; I guess it depends on the details of the batched hashtable update thingy. 4 looks like it should be simple. 5 is the default do-nothing state, but it seems unfair on anyone who can actually take advantage of batched updates. Ben, how hard would something like 3 or 4 be to implement? Thanks, J
On Wed, 2011-03-30 at 14:07 -0700, Jeremy Fitzhardinge wrote: > > 1. Change the locking rules for kernel updates to also require a pte lock > 2. Special-case batched kernel updates to include a pte lock > 3. Make Power deal with preemption during a batched kernel update > 4. Never do batched kernel updates on Power > 5. Never do batched kernel updates (the current state) We deal with preemption already since the PTL turns into a mutex on -rt, so we could bring that patch into mainline. The easiest approach however for now would be to not do the kernel batched updates on kernel (solution 4), and I can sort it out later if I want to enable it. The problem is that it's hard for me to "fix" that with the current accessors as arch_enter_lazy_mmu_mode() don't get any argument that could point me to which mm is being operated on. Jeremy, I haven't had a chance to look at your patches in detail, do you just use those accessors or do you create new ones for batching kernel updates in which case powerpc could just make them do nothing ? Else, we could have one patch that adds an mm argument accross the tree, it shouldn't be too hard. Later on, I can bring in the stuff from -rt stuff to enable lazy batching of kernel pages on power if I wish to do so. Cheers, Ben
On 03/30/2011 05:52 PM, Benjamin Herrenschmidt wrote: > We deal with preemption already since the PTL turns into a mutex on -rt, > so we could bring that patch into mainline. The easiest approach however > for now would be to not do the kernel batched updates on kernel > (solution 4), and I can sort it out later if I want to enable it. > > The problem is that it's hard for me to "fix" that with the current > accessors as arch_enter_lazy_mmu_mode() don't get any argument that > could point me to which mm is being operated on. > > Jeremy, I haven't had a chance to look at your patches in detail, do > you just use those accessors or do you create new ones for batching > kernel updates in which case powerpc could just make them do nothing ? > > Else, we could have one patch that adds an mm argument accross the tree, > it shouldn't be too hard. No, its the same accessors for both, since the need to distinguish them hasn't really come up. Could you put a "if (preemptable()) return;" guard in your implementations? Otherwise I have no objections to passing the mm in (we'll probably just continue to ignore the arg in x86-land). Thanks, J
On Thu, 2011-03-31 at 10:21 -0700, Jeremy Fitzhardinge wrote: > > No, its the same accessors for both, since the need to distinguish them > hasn't really come up. Could you put a "if (preemptable()) return;" > guard in your implementations? That would be a band-aid but would probably do the trick for now for !-rt, tho it wouldn't do the right thing for -rt... > Otherwise I have no objections to passing the mm in (we'll probably just > continue to ignore the arg in x86-land). Ben.
On 03/31/2011 01:38 PM, Benjamin Herrenschmidt wrote: > On Thu, 2011-03-31 at 10:21 -0700, Jeremy Fitzhardinge wrote: >> No, its the same accessors for both, since the need to distinguish them >> hasn't really come up. Could you put a "if (preemptable()) return;" >> guard in your implementations? > That would be a band-aid but would probably do the trick for now > for !-rt, tho it wouldn't do the right thing for -rt... Hi Ben, Have you had a chance to look at doing a workaround/fix for these power problems? I believe that's the only holdup to putting in the batching changes. I'd like to get them in for the next window if possible, since they're a pretty significant performance win for us. Thanks, J
--- mmotm/mm/memory.c +++ fixed/mm/memory.c @@ -2021,9 +2021,11 @@ static int apply_to_pte_range(struct mm_ int err; spinlock_t *uninitialized_var(ptl); - pte = (mm == &init_mm) ? - pte_alloc_kernel(pmd, addr) : - pte_alloc_map_lock(mm, pmd, addr, &ptl); + if (mm == &init_mm) { + pte = pte_alloc_kernel(pmd, addr); + preempt_disable(); + } else + pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; @@ -2033,7 +2035,9 @@ static int apply_to_pte_range(struct mm_ err = fn(pte, (end - addr) / PAGE_SIZE, addr, data); arch_leave_lazy_mmu_mode(); - if (mm != &init_mm) + if (mm == &init_mm) + preempt_enable(); + else pte_unmap_unlock(pte, ptl); return err; }