Message ID | 1425741651-29152-5-git-send-email-mgorman@suse.de (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Michael Ellerman |
Headers | show |
* Mel Gorman <mgorman@suse.de> wrote: > Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 > > Across the board the 4.0-rc1 numbers are much slower, and the > degradation is far worse when using the large memory footprint > configs. Perf points straight at the cause - this is from 4.0-rc1 on > the "-o bhash=101073" config: > > [...] > 4.0.0-rc1 4.0.0-rc1 3.19.0 > vanilla slowscan-v2 vanilla > User 53384.29 56093.11 46119.12 > System 692.14 311.64 306.41 > Elapsed 1236.87 1328.61 1039.88 > > Note that the system CPU usage is now similar to 3.19-vanilla. Similar, but still worse, and also the elapsed time is still much worse. User time is much higher, although it's the same amount of work done on every kernel, right? > I also tested with a workload very similar to Dave's. The machine > configuration and storage is completely different so it's not an > equivalent test unfortunately. It's reporting the elapsed time and > CPU time while fsmark is running to create the inodes and when > runnig xfsrepair afterwards > > xfsrepair > 4.0.0-rc1 4.0.0-rc1 3.19.0 > vanilla slowscan-v2 vanilla > Min real-fsmark 1157.41 ( 0.00%) 1150.38 ( 0.61%) 1164.44 ( -0.61%) > Min syst-fsmark 3998.06 ( 0.00%) 3988.42 ( 0.24%) 4016.12 ( -0.45%) > Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) 442.64 ( 11.05%) > Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) 194.97 ( 61.05%) > Amean real-fsmark 1166.63 ( 0.00%) 1155.97 ( 0.91%) 1166.28 ( 0.03%) > Amean syst-fsmark 4020.94 ( 0.00%) 4004.19 ( 0.42%) 4025.87 ( -0.12%) > Amean real-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) > Amean syst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) > Stddev real-fsmark 6.55 ( 0.00%) 3.97 ( 39.30%) 1.44 ( 77.98%) > Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) 9.76 ( 39.86%) > Stddev real-xfsrepair 11.17 ( 0.00%) 3.41 ( 69.43%) 5.57 ( 50.17%) > Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) 5.69 ( 59.31%) > CoeffVar real-fsmark 0.56 ( 0.00%) 0.34 ( 38.74%) 0.12 ( 77.97%) > CoeffVar syst-fsmark 0.40 ( 0.00%) 0.38 ( 6.57%) 0.24 ( 39.93%) > CoeffVar real-xfsrepair 2.20 ( 0.00%) 0.74 ( 66.22%) 1.24 ( 43.47%) > CoeffVar syst-xfsrepair 2.69 ( 0.00%) 7.08 (-163.23%) 2.80 ( -4.23%) > Max real-fsmark 1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) > Max syst-fsmark 4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) > Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) > Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) > > The key point is that system CPU usage for xfsrepair (syst-xfsrepair) > is almost cut in half. It's still not as low as 3.19-vanilla but it's > much closer > > 4.0.0-rc1 4.0.0-rc1 3.19.0 > vanilla slowscan-v2 vanilla > NUMA alloc hit 146138883 121929782 104019526 > NUMA alloc miss 13146328 11456356 7806370 > NUMA interleave hit 0 0 0 > NUMA alloc local 146060848 121865921 103953085 > NUMA base PTE updates 242201535 117237258 216624143 > NUMA huge PMD updates 113270 52121 127782 > NUMA page range updates 300195775 143923210 282048527 > NUMA hint faults 180388025 87299060 147235021 > NUMA hint local faults 72784532 32939258 61866265 > NUMA hint local percent 40 37 42 > NUMA pages migrated 71175262 41395302 23237799 > > Note the big differences in faults trapped and pages migrated. > 3.19-vanilla still migrated fewer pages but if necessary the > threshold at which we start throttling migrations can be lowered. This too is still worse than what v3.19 had. So what worries me is that Dave bisected the regression to: 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") And clearly your patch #4 just tunes balancing/migration intensity - is that a workaround for the real problem/bug? And the patch Dave bisected to is a relatively simple patch. Why not simply revert it to see whether that cures much of the problem? Am I missing something fundamental? Thanks, Ingo
On Sat, Mar 07, 2015 at 05:36:58PM +0100, Ingo Molnar wrote: > > * Mel Gorman <mgorman@suse.de> wrote: > > > Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 > > > > Across the board the 4.0-rc1 numbers are much slower, and the > > degradation is far worse when using the large memory footprint > > configs. Perf points straight at the cause - this is from 4.0-rc1 on > > the "-o bhash=101073" config: > > > > [...] > > > 4.0.0-rc1 4.0.0-rc1 3.19.0 > > vanilla slowscan-v2 vanilla > > User 53384.29 56093.11 46119.12 > > System 692.14 311.64 306.41 > > Elapsed 1236.87 1328.61 1039.88 > > > > Note that the system CPU usage is now similar to 3.19-vanilla. > > Similar, but still worse, and also the elapsed time is still much > worse. User time is much higher, although it's the same amount of work > done on every kernel, right? > Elapsed time is primarily worse on one benchmark -- numa01 which is an adverse workload. The user time differences are also dominated by that benchmark 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2r7 vanilla Time User-NUMA01 32883.59 ( 0.00%) 35288.00 ( -7.31%) 25695.96 ( 21.86%) Time User-NUMA01_THEADLOCAL 17453.20 ( 0.00%) 17765.79 ( -1.79%) 17404.36 ( 0.28%) Time User-NUMA02 2063.70 ( 0.00%) 2063.22 ( 0.02%) 2037.65 ( 1.26%) Time User-NUMA02_SMT 983.70 ( 0.00%) 976.01 ( 0.78%) 981.02 ( 0.27%) > > I also tested with a workload very similar to Dave's. The machine > > configuration and storage is completely different so it's not an > > equivalent test unfortunately. It's reporting the elapsed time and > > CPU time while fsmark is running to create the inodes and when > > runnig xfsrepair afterwards > > > > xfsrepair > > 4.0.0-rc1 4.0.0-rc1 3.19.0 > > vanilla slowscan-v2 vanilla > > Min real-fsmark 1157.41 ( 0.00%) 1150.38 ( 0.61%) 1164.44 ( -0.61%) > > Min syst-fsmark 3998.06 ( 0.00%) 3988.42 ( 0.24%) 4016.12 ( -0.45%) > > Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) 442.64 ( 11.05%) > > Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) 194.97 ( 61.05%) > > Amean real-fsmark 1166.63 ( 0.00%) 1155.97 ( 0.91%) 1166.28 ( 0.03%) > > Amean syst-fsmark 4020.94 ( 0.00%) 4004.19 ( 0.42%) 4025.87 ( -0.12%) > > Amean real-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) > > Amean syst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) > > Stddev real-fsmark 6.55 ( 0.00%) 3.97 ( 39.30%) 1.44 ( 77.98%) > > Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) 9.76 ( 39.86%) > > Stddev real-xfsrepair 11.17 ( 0.00%) 3.41 ( 69.43%) 5.57 ( 50.17%) > > Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) 5.69 ( 59.31%) > > CoeffVar real-fsmark 0.56 ( 0.00%) 0.34 ( 38.74%) 0.12 ( 77.97%) > > CoeffVar syst-fsmark 0.40 ( 0.00%) 0.38 ( 6.57%) 0.24 ( 39.93%) > > CoeffVar real-xfsrepair 2.20 ( 0.00%) 0.74 ( 66.22%) 1.24 ( 43.47%) > > CoeffVar syst-xfsrepair 2.69 ( 0.00%) 7.08 (-163.23%) 2.80 ( -4.23%) > > Max real-fsmark 1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) > > Max syst-fsmark 4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) > > Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) > > Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) > > > > The key point is that system CPU usage for xfsrepair (syst-xfsrepair) > > is almost cut in half. It's still not as low as 3.19-vanilla but it's > > much closer > > > > 4.0.0-rc1 4.0.0-rc1 3.19.0 > > vanilla slowscan-v2 vanilla > > NUMA alloc hit 146138883 121929782 104019526 > > NUMA alloc miss 13146328 11456356 7806370 > > NUMA interleave hit 0 0 0 > > NUMA alloc local 146060848 121865921 103953085 > > NUMA base PTE updates 242201535 117237258 216624143 > > NUMA huge PMD updates 113270 52121 127782 > > NUMA page range updates 300195775 143923210 282048527 > > NUMA hint faults 180388025 87299060 147235021 > > NUMA hint local faults 72784532 32939258 61866265 > > NUMA hint local percent 40 37 42 > > NUMA pages migrated 71175262 41395302 23237799 > > > > Note the big differences in faults trapped and pages migrated. > > 3.19-vanilla still migrated fewer pages but if necessary the > > threshold at which we start throttling migrations can be lowered. > > This too is still worse than what v3.19 had. > Yes. > So what worries me is that Dave bisected the regression to: > > 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") > > And clearly your patch #4 just tunes balancing/migration intensity - > is that a workaround for the real problem/bug? > The patch makes NUMA hinting faults use standard page table handling routines and protections to trap the faults. Fundamentally it's safer even though it appears to cause more traps to be handled. I've been assuming this is related to the different permissions PTEs get and when they are visible on all CPUs. This path is addressing the symptom that more faults are being handled and that it needs to be less aggressive. I've gone through that patch and didn't spot anything else that is doing wrong that is not already handled in this series. Did you spot anything obviously wrong in that patch that isn't addressed in this series? > And the patch Dave bisected to is a relatively simple patch. > Why not simply revert it to see whether that cures much of the > problem? > Because it also means reverting all the PROT_NONE handling and going back to _PAGE_NUMA tricks which I expect would be naked by Linus.
On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar <mingo@kernel.org> wrote: > > And the patch Dave bisected to is a relatively simple patch. > Why not simply revert it to see whether that cures much of the > problem? So the problem with that is that "pmd_set_numa()" and friends simply no longer exist. So we can't just revert that one patch, it's the whole series, and the whole point of the series. What confuses me is that the only real change that I can see in that patch is the change to "change_huge_pmd()". Everything else is pretty much a 100% equivalent transformation, afaik. Of course, I may be wrong about that, and missing something silly. And the changes to "change_huge_pmd()" were basically re-done differently by subsequent patches anyway. The *only* change I see remaining is that change_huge_pmd() now does entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); set_pmd_at(mm, addr, pmd, entry); for all changes. It used to do that "pmdp_set_numa()" for the prot_numa case, which did just pmd_t pmd = *pmdp; pmd = pmd_mknuma(pmd); set_pmd_at(mm, addr, pmdp, pmd); instead. I don't like the old pmdp_set_numa() because it can drop dirty bits, so I think the old code was actively buggy. But I do *not* see why the new code would cause more migrations to happen. There's probably something really stupid I'm missing. Linus
* Mel Gorman <mgorman@suse.de> wrote: > xfsrepair > 4.0.0-rc1 4.0.0-rc1 3.19.0 > vanilla slowscan-v2 vanilla > Min real-fsmark 1157.41 ( 0.00%) 1150.38 ( 0.61%) 1164.44 ( -0.61%) > Min syst-fsmark 3998.06 ( 0.00%) 3988.42 ( 0.24%) 4016.12 ( -0.45%) > Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) 442.64 ( 11.05%) > Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) 194.97 ( 61.05%) > Amean real-fsmark 1166.63 ( 0.00%) 1155.97 ( 0.91%) 1166.28 ( 0.03%) > Amean syst-fsmark 4020.94 ( 0.00%) 4004.19 ( 0.42%) 4025.87 ( -0.12%) > Amean real-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) > Amean syst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) > Stddev real-fsmark 6.55 ( 0.00%) 3.97 ( 39.30%) 1.44 ( 77.98%) > Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) 9.76 ( 39.86%) > Stddev real-xfsrepair 11.17 ( 0.00%) 3.41 ( 69.43%) 5.57 ( 50.17%) > Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) 5.69 ( 59.31%) > CoeffVar real-fsmark 0.56 ( 0.00%) 0.34 ( 38.74%) 0.12 ( 77.97%) > CoeffVar syst-fsmark 0.40 ( 0.00%) 0.38 ( 6.57%) 0.24 ( 39.93%) > CoeffVar real-xfsrepair 2.20 ( 0.00%) 0.74 ( 66.22%) 1.24 ( 43.47%) > CoeffVar syst-xfsrepair 2.69 ( 0.00%) 7.08 (-163.23%) 2.80 ( -4.23%) > Max real-fsmark 1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) > Max syst-fsmark 4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) > Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) > Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) Btw., I think it would be nice if these numbers listed v3.19 performance in the first column, to make it clear at a glance how much regression we still have? Thanks, Ingo
* Mel Gorman <mgorman@suse.de> wrote: > Elapsed time is primarily worse on one benchmark -- numa01 which is > an adverse workload. The user time differences are also dominated by > that benchmark > > 4.0.0-rc1 4.0.0-rc1 3.19.0 > vanilla slowscan-v2r7 vanilla > Time User-NUMA01 32883.59 ( 0.00%) 35288.00 ( -7.31%) 25695.96 ( 21.86%) > Time User-NUMA01_THEADLOCAL 17453.20 ( 0.00%) 17765.79 ( -1.79%) 17404.36 ( 0.28%) > Time User-NUMA02 2063.70 ( 0.00%) 2063.22 ( 0.02%) 2037.65 ( 1.26%) > Time User-NUMA02_SMT 983.70 ( 0.00%) 976.01 ( 0.78%) 981.02 ( 0.27%) But even for 'numa02', the simplest of the workloads, there appears to be some of a regression relative to v3.19, which ought to be beyond the noise of the measurement (which would be below 1% I suspect), and as such relevant, right? And the XFS numbers still show significant regression compared to v3.19 - and that cannot be ignored as artificial, 'adversarial' workload, right? For example, from your numbers: xfsrepair 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla ... Amean real-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) Amean syst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) if I interpret the numbers correctly, it shows that compared to v3.19, system time increased by 38% - which is rather significant! > > So what worries me is that Dave bisected the regression to: > > > > 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") > > > > And clearly your patch #4 just tunes balancing/migration intensity > > - is that a workaround for the real problem/bug? > > The patch makes NUMA hinting faults use standard page table handling > routines and protections to trap the faults. Fundamentally it's > safer even though it appears to cause more traps to be handled. I've > been assuming this is related to the different permissions PTEs get > and when they are visible on all CPUs. This path is addressing the > symptom that more faults are being handled and that it needs to be > less aggressive. But the whole cleanup ought to have been close to an identity transformation from the CPU's point of view - and your measurements seem to confirm Dave's findings. And your measurement was on bare metal, while Dave's is on a VM, and both show a significant slowdown on the xfs tests even with your slow-tuning patch applied, so it's unlikely to be a measurement fluke or some weird platform property. > I've gone through that patch and didn't spot anything else that is > doing wrong that is not already handled in this series. Did you spot > anything obviously wrong in that patch that isn't addressed in this > series? I didn't spot anything wrong, but is that a basis to go forward and work around the regression, in a way that doesn't even recover lost performance? > > And the patch Dave bisected to is a relatively simple patch. Why > > not simply revert it to see whether that cures much of the > > problem? > > Because it also means reverting all the PROT_NONE handling and going > back to _PAGE_NUMA tricks which I expect would be naked by Linus. Yeah, I realize that (and obviously I support the PROT_NONE direction that Peter Zijlstra prototyped with the original sched/numa series), but can we leave this much of a regression on the table? I hate to be such a pain in the neck, but especially the 'down tuning' of the scanning intensity will make an apples to apples comparison harder! I'd rather not do the slow-tuning part and leave sucky performance in place for now and have an easy method plus the motivation to find and fix the real cause of the regression, than to partially hide it this way ... Thanks, Ingo
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > And the patch Dave bisected to is a relatively simple patch. Why > > not simply revert it to see whether that cures much of the > > problem? > > So the problem with that is that "pmd_set_numa()" and friends simply > no longer exist. So we can't just revert that one patch, it's the > whole series, and the whole point of the series. Yeah. > What confuses me is that the only real change that I can see in that > patch is the change to "change_huge_pmd()". Everything else is > pretty much a 100% equivalent transformation, afaik. Of course, I > may be wrong about that, and missing something silly. Well, there's a difference in what we write to the pte: #define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL and our expectation was that the two should be equivalent methods from the POV of the NUMA balancing code, right? > And the changes to "change_huge_pmd()" were basically re-done > differently by subsequent patches anyway. > > The *only* change I see remaining is that change_huge_pmd() now does > > entry = pmdp_get_and_clear_notify(mm, addr, pmd); > entry = pmd_modify(entry, newprot); > set_pmd_at(mm, addr, pmd, entry); > > for all changes. It used to do that "pmdp_set_numa()" for the > prot_numa case, which did just > > pmd_t pmd = *pmdp; > pmd = pmd_mknuma(pmd); > set_pmd_at(mm, addr, pmdp, pmd); > > instead. > > I don't like the old pmdp_set_numa() because it can drop dirty bits, > so I think the old code was actively buggy. Could we, as a silly testing hack not to be applied, write a hack-patch that re-introduces the racy way of setting the NUMA bit, to confirm that it is indeed this difference that changes pte visibility across CPUs enough to create so many more faults? Because if the answer is 'yes', then we can safely say: 'we regressed performance because correctness [not dropping dirty bits] comes before performance'. If the answer is 'no', then we still have a mystery (and a regression) to track down. As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) to double check that the position of the bit does not matter? I don't think we've exhaused all avenues of analysis here. Thanks, Ingo
On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar <mingo@kernel.org> wrote: > > Well, there's a difference in what we write to the pte: > > #define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > and our expectation was that the two should be equivalent methods from > the POV of the NUMA balancing code, right? Right. But yes, we might have screwed something up. In particular, there might be something that thinks it cares about the global bit, but doesn't notice that the present bit isn't set, so it considers the protnone mappings to be global and causes lots more tlb flushes etc. >> I don't like the old pmdp_set_numa() because it can drop dirty bits, >> so I think the old code was actively buggy. > > Could we, as a silly testing hack not to be applied, write a > hack-patch that re-introduces the racy way of setting the NUMA bit, to > confirm that it is indeed this difference that changes pte visibility > across CPUs enough to create so many more faults? So one of Mel's patches did that, but I don't know if Dave tested it. And thinking about it, it *may* be safe for huge-pages, if they always already have the dirty bit set to begin with. And I don't see how we could have a clean hugepage (apart from the special case of the zeropage, which is read-only, so races on teh dirty bit aren't an issue). So it might actually be that the non-atomic version is safe for hpages. And we could possibly get rid of the "atomic read-and-clear" even for the non-numa case. I'd rather do it for both cases than for just one of them. But: > As a second hack (not to be applied), could we change: > > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > to: > > #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) > > to double check that the position of the bit does not matter? Agreed. We should definitely try that. Dave? Also, is there some sane way for me to actually see this behavior on a regular machine with just a single socket? Dave is apparently running in some fake-numa setup, I'm wondering if this is easy enough to reproduce that I could see it myself. Linus
On Sun, Mar 8, 2015 at 11:35 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> As a second hack (not to be applied), could we change: >> >> #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL >> >> to: >> >> #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) >> >> to double check that the position of the bit does not matter? > > Agreed. We should definitely try that. There's a second reason to do that, actually: the __supported_pte_mask thing, _and_ the pageattr stuff in __split_large_page() etc play games with _PAGE_GLOBAL. As does drivers/lguest for some reason. So looking at this all, there's a lot of room for confusion with _PAGE_GLOBAL. That kind of confusion would certainly explain the whole "the changes _look_ like they do the same thing, but don't" - because of silly semantic conflicts with PROTNONE vs GLOBAL. Linus
On Sun, Mar 08, 2015 at 11:02:23AM +0100, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > And the patch Dave bisected to is a relatively simple patch. Why > > > not simply revert it to see whether that cures much of the > > > problem? > > > > So the problem with that is that "pmd_set_numa()" and friends simply > > no longer exist. So we can't just revert that one patch, it's the > > whole series, and the whole point of the series. > > Yeah. > > > What confuses me is that the only real change that I can see in that > > patch is the change to "change_huge_pmd()". Everything else is > > pretty much a 100% equivalent transformation, afaik. Of course, I > > may be wrong about that, and missing something silly. > > Well, there's a difference in what we write to the pte: > > #define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > and our expectation was that the two should be equivalent methods from > the POV of the NUMA balancing code, right? > Functionally yes but performance-wise no. We are now using the global bit for NUMA faults at the very least. > > And the changes to "change_huge_pmd()" were basically re-done > > differently by subsequent patches anyway. > > > > The *only* change I see remaining is that change_huge_pmd() now does > > > > entry = pmdp_get_and_clear_notify(mm, addr, pmd); > > entry = pmd_modify(entry, newprot); > > set_pmd_at(mm, addr, pmd, entry); > > > > for all changes. It used to do that "pmdp_set_numa()" for the > > prot_numa case, which did just > > > > pmd_t pmd = *pmdp; > > pmd = pmd_mknuma(pmd); > > set_pmd_at(mm, addr, pmdp, pmd); > > > > instead. > > > > I don't like the old pmdp_set_numa() because it can drop dirty bits, > > so I think the old code was actively buggy. > > Could we, as a silly testing hack not to be applied, write a > hack-patch that re-introduces the racy way of setting the NUMA bit, to > confirm that it is indeed this difference that changes pte visibility > across CPUs enough to create so many more faults? > This was already done and tested by Dave but while it helped, it was not enough. As the approach was inherently unsafe it was dropped and the throttling approach taken. However, the fact it made little difference may indicate that this is somehow related to the global bit being used. > Because if the answer is 'yes', then we can safely say: 'we regressed > performance because correctness [not dropping dirty bits] comes before > performance'. > > If the answer is 'no', then we still have a mystery (and a regression) > to track down. > > As a second hack (not to be applied), could we change: > > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > to: > > #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) > In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it can be done. > to double check that the position of the bit does not matter? > It's worth checking in case it's a case of how the global bit is treated. However, note that Dave is currently travelling for LSF/MM in Boston and there is a chance he cannot test this week at all. I'm just after landing in the hotel myself. I'll try find time during during one of the breaks tomorrow but if the wireless is too crap then accessing the test machine remotely might be an issue. > I don't think we've exhaused all avenues of analysis here. > True.
On Sun, Mar 08, 2015 at 11:35:59AM -0700, Linus Torvalds wrote: > On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar <mingo@kernel.org> wrote: > But: > > > As a second hack (not to be applied), could we change: > > > > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > > > > to: > > > > #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) > > > > to double check that the position of the bit does not matter? > > Agreed. We should definitely try that. > > Dave? As Mel has already mentioned, I'm in Boston for LSFMM and don't have access to the test rig I've used to generate this. > Also, is there some sane way for me to actually see this behavior on a > regular machine with just a single socket? Dave is apparently running > in some fake-numa setup, I'm wondering if this is easy enough to > reproduce that I could see it myself. Should be - I don't actually use 500TB of storage to generate this - 50GB on an SSD is all you need from the storage side. I just use a sparse backing file to make it look like a 500TB device. :P i.e. create an XFS filesystem on a 500TB sparse file with "mkfs.xfs -d size=500t,file=1 /path/to/file.img", mount it on loopback or as a virtio,cache=none device for the guest vm and then use fsmark to generate several million files spread across many, many directories such as: $ fs_mark -D 10000 -S0 -n 100000 -s 1 -L 32 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 That should only take a few minutes to run - if you throw 8p at it then it should run at >100k files/s being created. Then unmount and run "xfs_repair -o bhash=101703 /path/to/file.img" on the resultant image file. Cheers, Dave.
On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner <david@fromorbit.com> wrote: > >> Also, is there some sane way for me to actually see this behavior on a >> regular machine with just a single socket? Dave is apparently running >> in some fake-numa setup, I'm wondering if this is easy enough to >> reproduce that I could see it myself. > > Should be - I don't actually use 500TB of storage to generate this - > 50GB on an SSD is all you need from the storage side. I just use a > sparse backing file to make it look like a 500TB device. :P What's your virtual environment setup? Kernel config, and virtualization environment to actually get that odd fake NUMA thing happening? Linus
On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: > On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner <david@fromorbit.com> wrote: > > > >> Also, is there some sane way for me to actually see this behavior on a > >> regular machine with just a single socket? Dave is apparently running > >> in some fake-numa setup, I'm wondering if this is easy enough to > >> reproduce that I could see it myself. > > > > Should be - I don't actually use 500TB of storage to generate this - > > 50GB on an SSD is all you need from the storage side. I just use a > > sparse backing file to make it look like a 500TB device. :P > > What's your virtual environment setup? Kernel config, and > virtualization environment to actually get that odd fake NUMA thing > happening? I don't have the exact .config with me (test machines at home are shut down because I'm half a world away), but it's pretty much this (copied and munged from a similar test vm on my laptop): $ cat run-vm-4.sh sudo qemu-system-x86_64 \ -machine accel=kvm \ -no-fd-bootchk \ -localtime \ -boot c \ -serial pty \ -nographic \ -alt-grab \ -smp 16 -m 16384 \ -hda /data/vm-2/root.img \ -drive file=/vm/vm-4/vm-4-test.img,if=virtio,cache=none \ -drive file=/vm/vm-4/vm-4-scratch.img,if=virtio,cache=none \ -drive file=/vm/vm-4/vm-4-500TB.img,if=virtio,cache=none \ -kernel /vm/vm-4/vmlinuz \ -append "console=ttyS0,115200 root=/dev/sda1,numa=fake=4" $ And on the host I have /vm on a ssd that is an XFS filesystem, and I've created /vm/vm-4/vm-4-500TB.img by doing: $ xfs_io -f -c "truncate 500t" -c "extsize 1m" /vm/vm-4/vm-4-500TB.img and in the guest the filesystem is created with: # mkfs.xfs -f -mcrc=1,finobt=1 /dev/vdc And that will create a 500TB filesystem that you can then mount and run fsmark on it, then unmount and run xfs_repair on it. the .config I have on my laptop is from 3.18-rc something, but it should work just with a make oldconfig update. I'ts attached below. Hopefully this will be sufficient for you, otherwise it'll have to wait until I get home to get the exact configs for you. Cheers, Dave.
On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: >> >> What's your virtual environment setup? Kernel config, and >> virtualization environment to actually get that odd fake NUMA thing >> happening? > > I don't have the exact .config with me (test machines at home > are shut down because I'm half a world away), but it's pretty much > this (copied and munged from a similar test vm on my laptop): [ snip snip ] Ok, I hate debugging by symptoms anyway, so I didn't do any of this, and went back to actually *thinking* about the code instead of trying to reproduce this and figure things out by trial and error. And I think I figured it out. Of course, since I didn't actually test anything, what do I know, but I feel good about it, because I think I can explain why that patch that on the face of it shouldn't change anything actually did. So, the old code just did all those manual page table changes, clearing the present bit and setting the NUMA bit instead. The new code _ostensibly_ does the same, except it clears the present bit and sets the PROTNONE bit instead. However, rather than playing special games with just those two bits, it uses the normal pte accessor functions, and in particular uses vma->vm_page_prot to reset the protections back. Which is a nice cleanup and really makes the code look saner, and does the same thing. Except it really isn't the same thing at all. Why? The protection bits in the page tables are *not* the same as vma->vm_page_prot. Yes, they start out that way, but they don't stay that way. And no, I'm not talking about dirty and accessed bits. The difference? COW. Any private mapping is marked read-only in vma->vm_page_prot, and then the COW (or the initial write) makes it read-write. And so, when we did - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); that isn't equivalent at all - it makes the page read-only, because it restores it to its original state. Now, that isn't actually what hurts most, I suspect. Judging by the profiles, we don't suddenly take a lot of new COW faults. No, what hurts most is that the NUMA balancing code does this: /* * Avoid grouping on DSO/COW pages in specific and RO pages * in general, RO pages shouldn't hurt as much anyway since * they can be in shared cache state. */ if (!pte_write(pte)) flags |= TNF_NO_GROUP; and that "!pte_write(pte)" is basically now *always* true for private mappings (which is 99% of all mappings). In other words, I think the patch unintentionally made the NUMA code basically always do the TNF_NO_GROUP case. I think that a quick hack for testing might be to just replace that "!pte_write()" with "!pte_dirty()", and seeing how that acts. Comments? Linus
On Tue, Mar 10, 2015 at 04:55:52PM -0700, Linus Torvalds wrote: > On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: > >> > >> What's your virtual environment setup? Kernel config, and > >> virtualization environment to actually get that odd fake NUMA thing > >> happening? > > > > I don't have the exact .config with me (test machines at home > > are shut down because I'm half a world away), but it's pretty much > > this (copied and munged from a similar test vm on my laptop): > > [ snip snip ] > > Ok, I hate debugging by symptoms anyway, so I didn't do any of this, > and went back to actually *thinking* about the code instead of trying > to reproduce this and figure things out by trial and error. > > And I think I figured it out. > <SNIP> I believe you're correct and it matches what was observed. I'm still travelling and wireless is dirt but managed to queue a test using pmd_dirty 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanilla ptewrite-v1r20 Time User-NUMA01 25695.96 ( 0.00%) 32883.59 (-27.97%) 24012.80 ( 6.55%) Time User-NUMA01_THEADLOCAL 17404.36 ( 0.00%) 17453.20 ( -0.28%) 17950.54 ( -3.14%) Time User-NUMA02 2037.65 ( 0.00%) 2063.70 ( -1.28%) 2046.88 ( -0.45%) Time User-NUMA02_SMT 981.02 ( 0.00%) 983.70 ( -0.27%) 983.68 ( -0.27%) Time System-NUMA01 194.70 ( 0.00%) 602.44 (-209.42%) 158.90 ( 18.39%) Time System-NUMA01_THEADLOCAL 98.52 ( 0.00%) 78.10 ( 20.73%) 107.66 ( -9.28%) Time System-NUMA02 9.28 ( 0.00%) 6.47 ( 30.28%) 9.25 ( 0.32%) Time System-NUMA02_SMT 3.79 ( 0.00%) 5.06 (-33.51%) 3.92 ( -3.43%) Time Elapsed-NUMA01 558.84 ( 0.00%) 755.96 (-35.27%) 532.41 ( 4.73%) Time Elapsed-NUMA01_THEADLOCAL 382.54 ( 0.00%) 382.22 ( 0.08%) 390.48 ( -2.08%) Time Elapsed-NUMA02 49.83 ( 0.00%) 49.38 ( 0.90%) 49.79 ( 0.08%) Time Elapsed-NUMA02_SMT 46.59 ( 0.00%) 47.70 ( -2.38%) 47.77 ( -2.53%) Time CPU-NUMA01 4632.00 ( 0.00%) 4429.00 ( 4.38%) 4539.00 ( 2.01%) Time CPU-NUMA01_THEADLOCAL 4575.00 ( 0.00%) 4586.00 ( -0.24%) 4624.00 ( -1.07%) Time CPU-NUMA02 4107.00 ( 0.00%) 4191.00 ( -2.05%) 4129.00 ( -0.54%) Time CPU-NUMA02_SMT 2113.00 ( 0.00%) 2072.00 ( 1.94%) 2067.00 ( 2.18%) 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanillaptewrite-v1r20 User 46119.12 53384.29 44994.10 System 306.41 692.14 279.78 Elapsed 1039.88 1236.87 1022.92 There are still some difference but it's much closer to what it was. The balancing stats are almost looking similar to 3.19 NUMA base PTE updates 222840103 304513172 230724075 NUMA huge PMD updates 434894 594467 450274 NUMA page range updates 445505831 608880276 461264363 NUMA hint faults 601358 733491 626176 NUMA hint local faults 371571 511530 359215 NUMA hint local percent 61 69 57 NUMA pages migrated 7073177 26366701 6829196 XFS repair on the same machine is not fully restore either but a big enough move in the right direction to indicate this was the relevant change. xfsrepair 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanilla ptewrite-v1r20 Amean real-fsmark 1166.28 ( 0.00%) 1166.63 ( -0.03%) 1184.97 ( -1.60%) Amean syst-fsmark 4025.87 ( 0.00%) 4020.94 ( 0.12%) 4071.10 ( -1.12%) Amean real-xfsrepair 447.66 ( 0.00%) 507.85 (-13.45%) 460.94 ( -2.97%) Amean syst-xfsrepair 202.93 ( 0.00%) 519.88 (-156.19%) 282.45 (-39.19%)
On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman <mgorman@suse.de> wrote: > > I believe you're correct and it matches what was observed. I'm still > travelling and wireless is dirt but managed to queue a test using pmd_dirty Ok, thanks. I'm not entirely happy with that change, and I suspect the whole heuristic should be looked at much more (maybe it should also look at whether it's executable, for example), but it's a step in the right direction. So I committed it and added a comment, and wrote a commit log about it. I suspect any further work is post-4.0-release, unless somebody comes up with something small and simple and obviously better. Linus
On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote: > On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman <mgorman@suse.de> wrote: > > > > I believe you're correct and it matches what was observed. I'm still > > travelling and wireless is dirt but managed to queue a test using pmd_dirty > > Ok, thanks. > > I'm not entirely happy with that change, and I suspect the whole > heuristic should be looked at much more (maybe it should also look at > whether it's executable, for example), but it's a step in the right > direction. > I can follow up when I'm back in work properly. As you have already pulled this in directly, can you also consider pulling in "mm: thp: return the correct value for change_huge_pmd" please? The other two patches were very minor can be resent through the normal paths later.
On Thu, Mar 12, 2015 at 06:49:26PM +0000, Mel Gorman wrote: > On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote: > > On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman <mgorman@suse.de> wrote: > > > > > > I believe you're correct and it matches what was observed. I'm still > > > travelling and wireless is dirt but managed to queue a test using pmd_dirty > > > > Ok, thanks. > > > > I'm not entirely happy with that change, and I suspect the whole > > heuristic should be looked at much more (maybe it should also look at > > whether it's executable, for example), but it's a step in the right > > direction. > > > > I can follow up when I'm back in work properly. As you have already pulled > this in directly, can you also consider pulling in "mm: thp: return the > correct value for change_huge_pmd" please? The other two patches were very > minor can be resent through the normal paths later. TO close the loop here, now I'm back home and can run tests: config 3.19 4.0-rc1 4.0-rc4 defaults 8m08s 9m34s 9m14s -o ag_stride=-1 4m04s 4m38s 4m11s -o bhash=101073 6m04s 17m43s 7m35s -o ag_stride=-1,bhash=101073 4m54s 9m58s 7m50s It's better but there are still significant regressions, especially for the large memory footprint cases. I haven't had a chance to look at any stats or profiles yet, so I don't know yet whether this is still page fault related or some other problem.... Cheers, Dave
On Tue, Mar 17, 2015 at 12:06 AM, Dave Chinner <david@fromorbit.com> wrote: > > TO close the loop here, now I'm back home and can run tests: > > config 3.19 4.0-rc1 4.0-rc4 > defaults 8m08s 9m34s 9m14s > -o ag_stride=-1 4m04s 4m38s 4m11s > -o bhash=101073 6m04s 17m43s 7m35s > -o ag_stride=-1,bhash=101073 4m54s 9m58s 7m50s > > It's better but there are still significant regressions, especially > for the large memory footprint cases. I haven't had a chance to look > at any stats or profiles yet, so I don't know yet whether this is > still page fault related or some other problem.... Ok. I'd love to see some data on what changed between 3.19 and rc4 in the profiles, just to see whether it's "more page faults due to extra COW", or whether it's due to "more TLB flushes because of the pte_write() vs pte_dirty()" differences. I'm *guessing* lot of the remaining issues are due to extra page fault overhead because I'd expect write/dirty to be fairly 1:1, but there could be differences due to shared memory use and/or just writebacks of dirty pages that become clean. I guess you can also see in vmstat.mm_migrate_pages whether it's because of excessive migration (because of bad grouping) or not. So not just profiles data. At the same time, I feel fairly happy about the situation - we at least understand what is going on, and the "3x worse performance" case is at least gone. Even if that last case still looks horrible. So it's still a bad performance regression, but at the same time I think your test setup (big 500 TB filesystem, but then a fake-numa thing with just 4GB per node) is specialized and unrealistic enough that I don't feel it's all that relevant from a *real-world* standpoint, and so I wouldn't be uncomfortable saying "ok, the page table handling cleanup caused some issues, but we know about them and how to fix them longer-term". So I don't consider this a 4.0 showstopper or a "we need to revert for now" issue. If it's a case of "we take a lot more page faults because we handle the NUMA fault and then have a COW fault almost immediately", then the fix is likely to do the same early-cow that the normal non-numa-fault case does. In fact, my gut feel is that we should try to unify that numa/regula fault handling path a bit more, but that would be a pretty invasive patch. Linus
On Tue, Mar 17, 2015 at 09:53:57AM -0700, Linus Torvalds wrote: > On Tue, Mar 17, 2015 at 12:06 AM, Dave Chinner <david@fromorbit.com> wrote: > > > > TO close the loop here, now I'm back home and can run tests: > > > > config 3.19 4.0-rc1 4.0-rc4 > > defaults 8m08s 9m34s 9m14s > > -o ag_stride=-1 4m04s 4m38s 4m11s > > -o bhash=101073 6m04s 17m43s 7m35s > > -o ag_stride=-1,bhash=101073 4m54s 9m58s 7m50s > > > > It's better but there are still significant regressions, especially > > for the large memory footprint cases. I haven't had a chance to look > > at any stats or profiles yet, so I don't know yet whether this is > > still page fault related or some other problem.... > > Ok. I'd love to see some data on what changed between 3.19 and rc4 in > the profiles, just to see whether it's "more page faults due to extra > COW", or whether it's due to "more TLB flushes because of the > pte_write() vs pte_dirty()" differences. I'm *guessing* lot of the > remaining issues are due to extra page fault overhead because I'd > expect write/dirty to be fairly 1:1, but there could be differences > due to shared memory use and/or just writebacks of dirty pages that > become clean. > > I guess you can also see in vmstat.mm_migrate_pages whether it's > because of excessive migration (because of bad grouping) or not. So > not just profiles data. On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I was using during steady state shows: 471,752 migrate:mm_migrate_pages ( +- 7.38% ) The migrate pages rate is even higher than in 4.0-rc1 (~360,000) and 3.19 (~55,000), so that looks like even more of a problem than before. And the profile looks like: - 43.73% 0.05% [kernel] [k] native_flush_tlb_others - native_flush_tlb_others - 99.87% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.84% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault (grrrr - running perf with call stack profiling for long enough oom-kills xfs_repair) And the vmstats are: 3.19: numa_hit 5163221 numa_miss 121274 numa_foreign 121274 numa_interleave 12116 numa_local 5153127 numa_other 131368 numa_pte_updates 36482466 numa_huge_pte_updates 0 numa_hint_faults 34816515 numa_hint_faults_local 9197961 numa_pages_migrated 1228114 pgmigrate_success 1228114 pgmigrate_fail 0 4.0-rc1: numa_hit 36952043 numa_miss 92471 numa_foreign 92471 numa_interleave 10964 numa_local 36927384 numa_other 117130 numa_pte_updates 84010995 numa_huge_pte_updates 0 numa_hint_faults 81697505 numa_hint_faults_local 21765799 numa_pages_migrated 32916316 pgmigrate_success 32916316 pgmigrate_fail 0 4.0-rc4: numa_hit 23447345 numa_miss 47471 numa_foreign 47471 numa_interleave 10877 numa_local 23438564 numa_other 56252 numa_pte_updates 60901329 numa_huge_pte_updates 0 numa_hint_faults 58777092 numa_hint_faults_local 16478674 numa_pages_migrated 20075156 pgmigrate_success 20075156 pgmigrate_fail 0 Page migrations are still up by a factor of ~20 on 3.19. > At the same time, I feel fairly happy about the situation - we at > least understand what is going on, and the "3x worse performance" case > is at least gone. Even if that last case still looks horrible. > > So it's still a bad performance regression, but at the same time I > think your test setup (big 500 TB filesystem, but then a fake-numa > thing with just 4GB per node) is specialized and unrealistic enough > that I don't feel it's all that relevant from a *real-world* > standpoint, I don't buy it. The regression is triggered by the search algorithm that xfs_repair uses, and it's fairly common. It just uses a large hash table which has at least a 50% miss rate (every I/O misses on the initial lookup). The page faults are triggered by searching all these pointer chasing misses. IOWs the filesystem size under test is irrelevant - the amount of metadata in the FS determines the xfs_repair memory footprint. The fs size only determines concurrency, but this particular test case (ag_stride=-1) turns off the concurrency. Hence we'll see the same problem with a 1TB filesystem with 50 million inodes in it, and there's *lots* of those around. Also, to address the "fake-numa" setup - the problem is node-local allocation policy, not the size of the nodes. The repair threads wander all over the machine, even when there are only 8 threads running (ag_stride=-1): $ ps wauxH |grep xfs_repair |wc -l 8 $ top - 07:31:08 up 24 min, 2 users, load average: 1.75, 0.62, 0.69 Tasks: 240 total, 1 running, 239 sleeping, 0 stopped, 0 zombie %Cpu0 : 3.2 us, 13.4 sy, 0.0 ni, 63.6 id, 19.8 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 1.4 us, 8.5 sy, 0.0 ni, 74.7 id, 15.4 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 1.0 us, 1.7 sy, 0.0 ni, 83.2 id, 14.1 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.0 us, 0.7 sy, 0.0 ni, 98.7 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 4.0 us, 21.7 sy, 0.0 ni, 56.5 id, 17.8 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 1.0 us, 2.3 sy, 0.0 ni, 92.4 id, 4.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 2.0 us, 13.5 sy, 0.0 ni, 80.7 id, 3.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 18.8 us, 2.7 sy, 0.0 ni, 78.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 1.4 us, 10.2 sy, 0.0 ni, 87.4 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 2.4 us, 13.2 sy, 0.0 ni, 84.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 1.0 us, 6.1 sy, 0.0 ni, 88.6 id, 4.4 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.7 us, 5.4 sy, 0.0 ni, 79.3 id, 14.6 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.3 us, 0.7 sy, 0.0 ni, 93.3 id, 5.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 2.7 us, 9.3 sy, 0.0 ni, 87.7 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st So, 8 threads doing work, 16 cpu cores, and only one fully idle processor core in the machine over a 5s sample period. Hence it seems to me that process memory is getting sprayed over all nodes because of the way the scheduler processes move around, not because of the small memory size of the numa nodes. > and so I wouldn't be uncomfortable saying "ok, the page > table handling cleanup caused some issues, but we know about them and > how to fix them longer-term". So I don't consider this a 4.0 > showstopper or a "we need to revert for now" issue. I don't consider it necessary of a revert, either, but I don't want it swept under the table because of "workload not relevant" arguments. Cheers, Dave.
On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner <david@fromorbit.com> wrote: > > On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I > was using during steady state shows: > > 471,752 migrate:mm_migrate_pages ( +- 7.38% ) > > The migrate pages rate is even higher than in 4.0-rc1 (~360,000) > and 3.19 (~55,000), so that looks like even more of a problem than > before. Hmm. How stable are those numbers boot-to-boot? That kind of extreme spread makes me suspicious. It's also interesting that if the numbers really go up even more (and by that big amount), then why does there seem to be almost no correlation with performance (which apparently went up since rc1, despite migrate_pages getting even _worse_). > And the profile looks like: > > - 43.73% 0.05% [kernel] [k] native_flush_tlb_others Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%). And flush_tlb_page() does seem to be called about ten times more (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible at 0.13%) Damn. From a performance number standpoint, it looked like we zoomed in on the right thing. But now it's migrating even more pages than before. Odd. > And the vmstats are: > > 3.19: > > numa_hit 5163221 > numa_local 5153127 > 4.0-rc1: > > numa_hit 36952043 > numa_local 36927384 > > 4.0-rc4: > > numa_hit 23447345 > numa_local 23438564 > > Page migrations are still up by a factor of ~20 on 3.19. The thing is, those "numa_hit" things come from the zone_statistics() call in buffered_rmqueue(), which in turn is simple from the memory allocator. That has *nothing* to do with virtual memory, and everything to do with actual physical memory allocations. So the load is simply allocating a lot more pages, presumably for those stupid migration events. But then it doesn't correlate with performance anyway.. Can you do a simple stupid test? Apply that commit 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") to 3.19, so that it uses the same "pte_dirty()" logic as 4.0-rc4. That *should* make the 3.19 and 4.0-rc4 numbers comparable. It does make me wonder if your load is "chaotic" wrt scheduling. The load presumably wants to spread out across all cpu's, but then the numa code tries to group things together for numa accesses, but depending on just random allocation patterns and layout in the hash tables, there either are patters with page access or there aren't. Which is kind of why I wonder how stable those numbers are boot to boot. Maybe this is at least partly about lucky allocation patterns. Linus
On Tue, Mar 17, 2015 at 02:30:57PM -0700, Linus Torvalds wrote: > On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I > > was using during steady state shows: > > > > 471,752 migrate:mm_migrate_pages ( +- 7.38% ) > > > > The migrate pages rate is even higher than in 4.0-rc1 (~360,000) > > and 3.19 (~55,000), so that looks like even more of a problem than > > before. > > Hmm. How stable are those numbers boot-to-boot? I've run the test several times but only profiles once so far. runtimes were 7m45, 7m50, 7m44s, 8m2s, and the profiles came from the 8m2s run. reboot, run again: $ sudo perf stat -a -r 6 -e migrate:mm_migrate_pages sleep 10 Performance counter stats for 'system wide' (6 runs): 572,839 migrate:mm_migrate_pages ( +- 3.15% ) 10.001664694 seconds time elapsed ( +- 0.00% ) $ And just to confirm, a minute later, still in phase 3: 590,974 migrate:mm_migrate_pages ( +- 2.86% ) Reboot, run again: 575,344 migrate:mm_migrate_pages ( +- 0.70% ) So there is boot-to-boot variation, but it doesn't look like it gets any better.... > That kind of extreme spread makes me suspicious. It's also interesting > that if the numbers really go up even more (and by that big amount), > then why does there seem to be almost no correlation with performance > (which apparently went up since rc1, despite migrate_pages getting > even _worse_). > > > And the profile looks like: > > > > - 43.73% 0.05% [kernel] [k] native_flush_tlb_others > > Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%). > And flush_tlb_page() does seem to be called about ten times more > (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible > at 0.13%) > > Damn. From a performance number standpoint, it looked like we zoomed > in on the right thing. But now it's migrating even more pages than > before. Odd. Throttling problem, like Mel originally suspected? > > And the vmstats are: > > > > 3.19: > > > > numa_hit 5163221 > > numa_local 5153127 > > > 4.0-rc1: > > > > numa_hit 36952043 > > numa_local 36927384 > > > > 4.0-rc4: > > > > numa_hit 23447345 > > numa_local 23438564 > > > > Page migrations are still up by a factor of ~20 on 3.19. > > The thing is, those "numa_hit" things come from the zone_statistics() > call in buffered_rmqueue(), which in turn is simple from the memory > allocator. That has *nothing* to do with virtual memory, and > everything to do with actual physical memory allocations. So the load > is simply allocating a lot more pages, presumably for those stupid > migration events. > > But then it doesn't correlate with performance anyway.. > > Can you do a simple stupid test? Apply that commit 53da3bc2ba9e ("mm: > fix up numa read-only thread grouping logic") to 3.19, so that it uses > the same "pte_dirty()" logic as 4.0-rc4. That *should* make the 3.19 > and 4.0-rc4 numbers comparable. patched 3.19 numbers on this test are slightly worse than stock 3.19, but nowhere near as bad as 4.0-rc4: 241,718 migrate:mm_migrate_pages ( +- 5.17% ) So that pte_write->pte_dirty change makes this go from ~55k to 240k, and runtime go from 4m54s to 5m20s. vmstats: numa_hit 9162476 numa_miss 0 numa_foreign 0 numa_interleave 10685 numa_local 9153740 numa_other 8736 numa_pte_updates 49582103 numa_huge_pte_updates 0 numa_hint_faults 48075098 numa_hint_faults_local 12974704 numa_pages_migrated 5748256 pgmigrate_success 5748256 pgmigrate_fail 0 Cheers, Dave.
On Tue, Mar 17, 2015 at 3:08 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> Damn. From a performance number standpoint, it looked like we zoomed >> in on the right thing. But now it's migrating even more pages than >> before. Odd. > > Throttling problem, like Mel originally suspected? That doesn't much make sense for the original bisect you did, though. Although if there are two different issues, maybe that bisect was wrong. Or rather, incomplete. >> Can you do a simple stupid test? Apply that commit 53da3bc2ba9e ("mm: >> fix up numa read-only thread grouping logic") to 3.19, so that it uses >> the same "pte_dirty()" logic as 4.0-rc4. That *should* make the 3.19 >> and 4.0-rc4 numbers comparable. > > patched 3.19 numbers on this test are slightly worse than stock > 3.19, but nowhere near as bad as 4.0-rc4: > > 241,718 migrate:mm_migrate_pages ( +- 5.17% ) Ok, that's still much worse than plain 3.19, which was ~55,000. Assuming your memory/measurements were the same. So apparently the pte_write() -> pte_dirty() check isn't equivalent at all. My thinking that for the common case (ie private mappings) it would be *exactly* the same, because all normal COW pages turn dirty at the same time they turn writable (and, in page_mkclean_one(), turn clean and read-only again at the same time). But if the numbers change that much, then clearly my simplistic "they are the same in practice" is just complete BS. So why am I wrong? Why is testing for dirty not the same as testing for writable? I can see a few cases: - your load has lots of writable (but not written-to) shared memory, and maybe the test should be something like pte_dirty(pte) || (vma->vm_flags & (VM_WRITE|VM_SHARED) == (VM_WRITE|VM_SHARED)) and we really should have some helper function for this logic. - something completely different that I am entirely missing What am I missing? Linus
On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So why am I wrong? Why is testing for dirty not the same as testing > for writable? > > I can see a few cases: > > - your load has lots of writable (but not written-to) shared memory Hmm. I tried to look at the xfsprog sources, and I don't see any MAP_SHARED activity. It looks like it's just using pread64/pwrite64, and the only MAP_SHARED is for the xfsio mmap test thing, not for xfsrepair. So I don't see any shared mappings, but I don't know the code-base. > - something completely different that I am entirely missing So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Linus
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: > On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So why am I wrong? Why is testing for dirty not the same as testing > > for writable? > > > > I can see a few cases: > > > > - your load has lots of writable (but not written-to) shared memory > > Hmm. I tried to look at the xfsprog sources, and I don't see any > MAP_SHARED activity. It looks like it's just using pread64/pwrite64, > and the only MAP_SHARED is for the xfsio mmap test thing, not for > xfsrepair. > > So I don't see any shared mappings, but I don't know the code-base. Right - all the mmap activity in the xfs_repair test is coming from memory allocation through glibc - we don't use mmap() directly anywhere in xfs_repair. FWIW, all the IO into these pages that are allocated is being done via direct IO, if that makes any difference... > > - something completely different that I am entirely missing > > So I think there's something I'm missing. For non-shared mappings, I > still have the idea that pte_dirty should be the same as pte_write. > And yet, your testing of 3.19 shows that it's a big difference. > There's clearly something I'm completely missing. This level of pte interactions is beyond my level of knowledge, so I'm afraid at this point I'm not going to be much help other than to test patches and report the result. FWIW, here's the distribution of the hash table we are iterating over. There are a lot of search misses, which means we are doing a lot of pointer chasing, but the distribution is centred directly around the goal of 8 entries per chain and there is no long tail: libxfs_bcache: 0x67e110 Max supported entries = 808584 Max utilized entries = 808584 Active entries = 808583 Hash table size = 101073 Hits = 9789987 Misses = 8224234 Hit ratio = 54.35 MRU 0 entries = 4667 ( 0%) MRU 1 entries = 0 ( 0%) MRU 2 entries = 4 ( 0%) MRU 3 entries = 797447 ( 98%) MRU 4 entries = 653 ( 0%) MRU 5 entries = 0 ( 0%) MRU 6 entries = 2755 ( 0%) MRU 7 entries = 1518 ( 0%) MRU 8 entries = 1518 ( 0%) MRU 9 entries = 0 ( 0%) MRU 10 entries = 21 ( 0%) MRU 11 entries = 0 ( 0%) MRU 12 entries = 0 ( 0%) MRU 13 entries = 0 ( 0%) MRU 14 entries = 0 ( 0%) MRU 15 entries = 0 ( 0%) Hash buckets with 0 entries 30 ( 0%) Hash buckets with 1 entries 241 ( 0%) Hash buckets with 2 entries 1019 ( 0%) Hash buckets with 3 entries 2787 ( 1%) Hash buckets with 4 entries 5838 ( 2%) Hash buckets with 5 entries 9144 ( 5%) Hash buckets with 6 entries 12165 ( 9%) Hash buckets with 7 entries 14194 ( 12%) Hash buckets with 8 entries 14387 ( 14%) Hash buckets with 9 entries 12742 ( 14%) Hash buckets with 10 entries 10253 ( 12%) Hash buckets with 11 entries 7308 ( 9%) Hash buckets with 12 entries 4872 ( 7%) Hash buckets with 13 entries 2869 ( 4%) Hash buckets with 14 entries 1578 ( 2%) Hash buckets with 15 entries 894 ( 1%) Hash buckets with 16 entries 430 ( 0%) Hash buckets with 17 entries 188 ( 0%) Hash buckets with 18 entries 88 ( 0%) Hash buckets with 19 entries 24 ( 0%) Hash buckets with 20 entries 11 ( 0%) Hash buckets with 21 entries 10 ( 0%) Hash buckets with 22 entries 1 ( 0%) Cheers, Dave.
On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I think there's something I'm missing. For non-shared mappings, I > still have the idea that pte_dirty should be the same as pte_write. > And yet, your testing of 3.19 shows that it's a big difference. > There's clearly something I'm completely missing. Ahh. The normal page table scanning and page fault handling both clear and set the dirty bit together with the writable one. But "fork()" will clear the writable bit without clearing dirty. For some reason I thought it moved the dirty bit into the struct page like the VM scanning does, but that was just me having a brainfart. So yeah, pte_dirty doesn't have to match pte_write even under perfectly normal circumstances. Maybe there are other cases. Not that I see a lot of forking in the xfs repair case either, so.. Dave, mind re-running the plain 3.19 numbers to really verify that the pte_dirty/pte_write change really made that big of a difference. Maybe your recollection of ~55,000 migrate_pages events was faulty. If the pte_write ->pte_dirty change is the *only* difference, it's still very odd how that one difference would make migrate_rate go from ~55k to 471k. That's an order of magnitude difference, for what really shouldn't be a big change. I'm running a kernel right now with a hacky "update_mmu_cache()" that warns if pte_dirty is ever different from pte_write(). +void update_mmu_cache(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + if (!(vma->vm_flags & VM_SHARED)) { + pte_t now = READ_ONCE(*ptep); + if (!pte_write(now) != !pte_dirty(now)) { + static int count = 20; + static unsigned int prev = 0; + unsigned int val = pte_val(now) & 0xfff; + if (prev != val && count) { + prev = val; + count--; + WARN(1, "pte value %x", val); + } + } + } +} I haven't seen a single warning so far (and there I wrote all that code to limit repeated warnings), although admittedly update_mu_cache() isn't called for all cases where we change a pte (not for the fork case, for example). But it *is* called for the page faulting cases Maybe a system update has changed libraries and memory allocation patterns, and there is something bigger than that one-liner pte_dirty/write change going on? Linus
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote: > On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So I think there's something I'm missing. For non-shared mappings, I > > still have the idea that pte_dirty should be the same as pte_write. > > And yet, your testing of 3.19 shows that it's a big difference. > > There's clearly something I'm completely missing. > > Ahh. The normal page table scanning and page fault handling both clear > and set the dirty bit together with the writable one. But "fork()" > will clear the writable bit without clearing dirty. For some reason I > thought it moved the dirty bit into the struct page like the VM > scanning does, but that was just me having a brainfart. So yeah, > pte_dirty doesn't have to match pte_write even under perfectly normal > circumstances. Maybe there are other cases. > > Not that I see a lot of forking in the xfs repair case either, so.. > > Dave, mind re-running the plain 3.19 numbers to really verify that the > pte_dirty/pte_write change really made that big of a difference. Maybe > your recollection of ~55,000 migrate_pages events was faulty. If the > pte_write ->pte_dirty change is the *only* difference, it's still very > odd how that one difference would make migrate_rate go from ~55k to > 471k. That's an order of magnitude difference, for what really > shouldn't be a big change. My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Wondering where the 55k number came from, I played around with when I started the measurement - all the numbers since I did the bisect have come from starting it at roughly 130AGs into phase 3 where the memory footprint stabilises and the tlb flush overhead kicks in. However, if I start the measurement at the same time as the repair test, I get something much closer to the 55k number. I also note that my original 4.0-rc1 numbers were much lower than the more recent steady state measurements (360k vs 470k), so I'd say the original numbers weren't representative of the steady state behaviour and so can be ignored... > Maybe a system update has changed libraries and memory allocation > patterns, and there is something bigger than that one-liner > pte_dirty/write change going on? Possibly. The xfs_repair binary has definitely been rebuilt (testing unrelated bug fixes that only affect phase 6/7 behaviour), but otherwise the system libraries are unchanged. Cheers, Dave.
On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner <david@fromorbit.com> wrote: > > My recollection wasn't faulty - I pulled it from an earlier email. > That said, the original measurement might have been faulty. I ran > the numbers again on the 3.19 kernel I saved away from the original > testing. That came up at 235k, which is pretty much the same as > yesterday's test. The runtime,however, is unchanged from my original > measurements of 4m54s (pte_hack came in at 5m20s). Ok. Good. So the "more than an order of magnitude difference" was really about measurement differences, not quite as real. Looks like more a "factor of two" than a factor of 20. Did you do the profiles the same way? Because that would explain the differences in the TLB flush percentages too (the "1.4% from tlb_invalidate_range()" vs "pretty much everything from migration"). The runtime variation does show that there's some *big* subtle difference for the numa balancing in the exact TNF_NO_GROUP details. It must be *very* unstable for it to make that big of a difference. But I feel at least a *bit* better about "unstable algorithm changes a small varioation into a factor-of-two" vs that crazy factor-of-20. Can you try Mel's change to make it use if (!(vma->vm_flags & VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. Linus
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: > On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > My recollection wasn't faulty - I pulled it from an earlier email. > > That said, the original measurement might have been faulty. I ran > > the numbers again on the 3.19 kernel I saved away from the original > > testing. That came up at 235k, which is pretty much the same as > > yesterday's test. The runtime,however, is unchanged from my original > > measurements of 4m54s (pte_hack came in at 5m20s). > > Ok. Good. So the "more than an order of magnitude difference" was > really about measurement differences, not quite as real. Looks like > more a "factor of two" than a factor of 20. > > Did you do the profiles the same way? Because that would explain the > differences in the TLB flush percentages too (the "1.4% from > tlb_invalidate_range()" vs "pretty much everything from migration"). No, the profiles all came from steady state. The profiles from the initial startup phase hammer the mmap_sem because of page fault vs mprotect contention (glibc runs mprotect() on every chunk of memory it allocates). It's not until the cache reaches "full" and it starts recycling old buffers rather than allocating new ones that the tlb flush problem dominates the profiles. > The runtime variation does show that there's some *big* subtle > difference for the numa balancing in the exact TNF_NO_GROUP details. > It must be *very* unstable for it to make that big of a difference. > But I feel at least a *bit* better about "unstable algorithm changes a > small varioation into a factor-of-two" vs that crazy factor-of-20. > > Can you try Mel's change to make it use > > if (!(vma->vm_flags & VM_WRITE)) > > instead of the pte details? Again, on otherwise plain 3.19, just so > that we have a baseline. I'd be *so* much happer with checking the vma > details over per-pte details, especially ones that change over the > lifetime of the pte entry, and the NUMA code explicitly mucks with. Yup, will do. might take an hour or two before I get to it, though... Cheers, Dave.
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: > Can you try Mel's change to make it use > > if (!(vma->vm_flags & VM_WRITE)) > > instead of the pte details? Again, on otherwise plain 3.19, just so > that we have a baseline. I'd be *so* much happer with checking the vma > details over per-pte details, especially ones that change over the > lifetime of the pte entry, and the NUMA code explicitly mucks with. $ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10 Performance counter stats for 'system wide' (6 runs): 266,750 migrate:mm_migrate_pages ( +- 7.43% ) 10.002032292 seconds time elapsed ( +- 0.00% ) Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Cheers, Dave.
On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote: > > Bit more variance there than the pte checking, but runtime > difference is in the noise - 5m4s vs 4m54s - and profiles are > identical to the pte checking version. Ahh, so that "!(vma->vm_flags & VM_WRITE)" test works _almost_ as well as the original !pte_write() test. Now, can you check that on top of rc4? If I've gotten everything right, we now have: - plain 3.19 (pte_write): 4m54s - 3.19 with vm_flags & VM_WRITE: 5m4s - 3.19 with pte_dirty: 5m20s so the pte_dirty version seems to have been a bad choice indeed. For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still _much_ worse, but I'm wondering whether that VM_WRITE test will at least shrink the difference like it does for 3.19. And the VM_WRITE test should be stable and not have any subtle interaction with the other changes that the numa pte things introduced. It would be good to see if the profiles then pop something *else* up as the performance difference (which I'm sure will remain, since the 7m50s was so far off). Linus
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote: > On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Bit more variance there than the pte checking, but runtime > > difference is in the noise - 5m4s vs 4m54s - and profiles are > > identical to the pte checking version. > > Ahh, so that "!(vma->vm_flags & VM_WRITE)" test works _almost_ as well > as the original !pte_write() test. > > Now, can you check that on top of rc4? If I've gotten everything > right, we now have: > > - plain 3.19 (pte_write): 4m54s > - 3.19 with vm_flags & VM_WRITE: 5m4s > - 3.19 with pte_dirty: 5m20s *nod* > so the pte_dirty version seems to have been a bad choice indeed. > > For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still > _much_ worse, but I'm wondering whether that VM_WRITE test will at > least shrink the difference like it does for 3.19. Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and 7m36s. IOWs's a bit better, but not significantly. page migrations are pretty much unchanged, too: 558,632 migrate:mm_migrate_pages ( +- 6.38% ) > And the VM_WRITE test should be stable and not have any subtle > interaction with the other changes that the numa pte things > introduced. It would be good to see if the profiles then pop something > *else* up as the performance difference (which I'm sure will remain, > since the 7m50s was so far off). No, nothing new pops up in the kernel profiles. All the system CPU time is still being spent sending IPIs on the tlb flush path. Cheers, Dave.
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: > On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > My recollection wasn't faulty - I pulled it from an earlier email. > > That said, the original measurement might have been faulty. I ran > > the numbers again on the 3.19 kernel I saved away from the original > > testing. That came up at 235k, which is pretty much the same as > > yesterday's test. The runtime,however, is unchanged from my original > > measurements of 4m54s (pte_hack came in at 5m20s). > > Ok. Good. So the "more than an order of magnitude difference" was > really about measurement differences, not quite as real. Looks like > more a "factor of two" than a factor of 20. > > Did you do the profiles the same way? Because that would explain the > differences in the TLB flush percentages too (the "1.4% from > tlb_invalidate_range()" vs "pretty much everything from migration"). > > The runtime variation does show that there's some *big* subtle > difference for the numa balancing in the exact TNF_NO_GROUP details. TNF_NO_GROUP affects whether the scheduler tries to group related processes together. Whether migration occurs depends on what node a process is scheduled on. If processes are aggressively grouped inappropriately then it is possible there is a bug that causes the load balancer to move processes off a node (possible migration) with NUMA balancing trying to pull it back (another possible migration). Small bugs there can result in excessive migration.
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote: > And the VM_WRITE test should be stable and not have any subtle > interaction with the other changes that the numa pte things > introduced. It would be good to see if the profiles then pop something > *else* up as the performance difference (which I'm sure will remain, > since the 7m50s was so far off). > As a side-note, I did test a patch that checked pte_write and preserved it across both faults and setting the protections. It did not alter migration activity much but there was a drop in minor faults - 20% drop in autonumabench, 58% drop in xfsrepair workload. I'm assuming this is due to refaults to mark pages writable. The patch looks and is hacky so I won't post it to save people bleaching their eyes. I'll spend some time soon (hopefully today) at a smooth way of falling through to WP checks after trapping a NUMA fault.
On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote: > > Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and > 7m36s. IOWs's a bit better, but not significantly. page migrations > are pretty much unchanged, too: > > 558,632 migrate:mm_migrate_pages ( +- 6.38% ) Ok. That was kind of the expected thing. I don't really know the NUMA fault rate limiting code, but one thing that strikes me is that if it tries to balance the NUMA faults against the *regular* faults, then maybe just the fact that we end up taking more COW faults after a NUMA fault then means that the NUMA rate limiting code now gets over-eager (because it sees all those extra non-numa faults). Mel, does that sound at all possible? I really have never looked at the magic automatic rate handling.. Linus
On Fri, Mar 20, 2015 at 10:02:23AM -0700, Linus Torvalds wrote: > On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and > > 7m36s. IOWs's a bit better, but not significantly. page migrations > > are pretty much unchanged, too: > > > > 558,632 migrate:mm_migrate_pages ( +- 6.38% ) > > Ok. That was kind of the expected thing. > > I don't really know the NUMA fault rate limiting code, but one thing > that strikes me is that if it tries to balance the NUMA faults against > the *regular* faults, then maybe just the fact that we end up taking > more COW faults after a NUMA fault then means that the NUMA rate > limiting code now gets over-eager (because it sees all those extra > non-numa faults). > > Mel, does that sound at all possible? I really have never looked at > the magic automatic rate handling.. > It should not be trying to balance against regular faults as it has no information on it. The trapping of additional faults to mark the PTE writable will alter timing so it indirectly affects how many migration faults there but this is only a side-effect IMO. There is more overhead now due to losing the writable information and that should be reduced so I tried a few approaches. Ultimately, the one that performed the best and was easiest to understand simply preserved the writable bit across the protection update and page fault. I'll post it later when I stick a changelog on it.
diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d77432e14ff..a419b65770d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1625,11 +1625,11 @@ struct task_struct { /* * numa_faults_locality tracks if faults recorded during the last - * scan window were remote/local. The task scan period is adapted - * based on the locality of the faults with different weights - * depending on whether they were shared or private faults + * scan window were remote/local or failed to migrate. The task scan + * period is adapted based on the locality of the faults with different + * weights depending on whether they were shared or private faults */ - unsigned long numa_faults_locality[2]; + unsigned long numa_faults_locality[3]; unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ @@ -1719,6 +1719,7 @@ struct task_struct { #define TNF_NO_GROUP 0x02 #define TNF_SHARED 0x04 #define TNF_FAULT_LOCAL 0x08 +#define TNF_MIGRATE_FAIL 0x10 #ifdef CONFIG_NUMA_BALANCING extern void task_numa_fault(int last_node, int node, int pages, int flags); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ce18f3c097a..bcfe32088b37 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1609,9 +1609,11 @@ static void update_task_scan_period(struct task_struct *p, /* * If there were no record hinting faults then either the task is * completely idle or all activity is areas that are not of interest - * to automatic numa balancing. Scan slower + * to automatic numa balancing. Related to that, if there were failed + * migration then it implies we are migrating too quickly or the local + * node is overloaded. In either case, scan slower */ - if (local + shared == 0) { + if (local + shared == 0 || p->numa_faults_locality[2]) { p->numa_scan_period = min(p->numa_scan_period_max, p->numa_scan_period << 1); @@ -2080,6 +2082,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) if (migrated) p->numa_pages_migrated += pages; + if (flags & TNF_MIGRATE_FAIL) + p->numa_faults_locality[2] += pages; p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages; p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ae13ad31e113..f508fda07d34 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1353,7 +1353,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (migrated) { flags |= TNF_MIGRATED; page_nid = target_nid; - } + } else + flags |= TNF_MIGRATE_FAIL; goto out; clear_pmdnuma: diff --git a/mm/memory.c b/mm/memory.c index 8068893697bb..187daf695f88 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3097,7 +3097,8 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (migrated) { page_nid = target_nid; flags |= TNF_MIGRATED; - } + } else + flags |= TNF_MIGRATE_FAIL; out: if (page_nid != -1)
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This is showing excessive migration activity even though excessive migrations are meant to get throttled. Normally, the scan rate is tuned on a per-task basis depending on the locality of faults. However, if migrations fail for any reason then the PTE scanner may scan faster if the faults continue to be remote. This means there is higher system CPU overhead and fault trapping at exactly the time we know that migrations cannot happen. This patch tracks when migration failures occur and slows the PTE scanner. This was tested on a 4 socket bare-metal machine with 48 cores. The results compare 4.0-rc1, the patches applied and 3.19-vanilla which was the last known good kernel. This is the standard autonuma benchmark 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla Time System-NUMA01 602.44 ( 0.00%) 209.42 ( 65.24%) 194.70 ( 67.68%) Time System-NUMA01_THEADLOCAL 78.10 ( 0.00%) 92.70 (-18.69%) 98.52 (-26.15%) Time System-NUMA02 6.47 ( 0.00%) 6.06 ( 6.34%) 9.28 (-43.43%) Time System-NUMA02_SMT 5.06 ( 0.00%) 3.39 ( 33.00%) 3.79 ( 25.10%) Time Elapsed-NUMA01 755.96 ( 0.00%) 833.63 (-10.27%) 558.84 ( 26.08%) Time Elapsed-NUMA01_THEADLOCAL 382.22 ( 0.00%) 395.45 ( -3.46%) 382.54 ( -0.08%) Time Elapsed-NUMA02 49.38 ( 0.00%) 50.21 ( -1.68%) 49.83 ( -0.91%) Time Elapsed-NUMA02_SMT 47.70 ( 0.00%) 48.55 ( -1.78%) 46.59 ( 2.33%) There is a performance drop as a result of this patch although in the case of NUMA01 it is not a major concern as it's an adverse workload. The important point is that in most cases system CPU usage is much lower. Here are the totals 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla User 53384.29 56093.11 46119.12 System 692.14 311.64 306.41 Elapsed 1236.87 1328.61 1039.88 Note that the system CPU usage is now similar to 3.19-vanilla. I also tested with a workload very similar to Dave's. The machine configuration and storage is completely different so it's not an equivalent test unfortunately. It's reporting the elapsed time and CPU time while fsmark is running to create the inodes and when runnig xfsrepair afterwards xfsrepair 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla Min real-fsmark 1157.41 ( 0.00%) 1150.38 ( 0.61%) 1164.44 ( -0.61%) Min syst-fsmark 3998.06 ( 0.00%) 3988.42 ( 0.24%) 4016.12 ( -0.45%) Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) 442.64 ( 11.05%) Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) 194.97 ( 61.05%) Amean real-fsmark 1166.63 ( 0.00%) 1155.97 ( 0.91%) 1166.28 ( 0.03%) Amean syst-fsmark 4020.94 ( 0.00%) 4004.19 ( 0.42%) 4025.87 ( -0.12%) Amean real-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) Amean syst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) Stddev real-fsmark 6.55 ( 0.00%) 3.97 ( 39.30%) 1.44 ( 77.98%) Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) 9.76 ( 39.86%) Stddev real-xfsrepair 11.17 ( 0.00%) 3.41 ( 69.43%) 5.57 ( 50.17%) Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) 5.69 ( 59.31%) CoeffVar real-fsmark 0.56 ( 0.00%) 0.34 ( 38.74%) 0.12 ( 77.97%) CoeffVar syst-fsmark 0.40 ( 0.00%) 0.38 ( 6.57%) 0.24 ( 39.93%) CoeffVar real-xfsrepair 2.20 ( 0.00%) 0.74 ( 66.22%) 1.24 ( 43.47%) CoeffVar syst-xfsrepair 2.69 ( 0.00%) 7.08 (-163.23%) 2.80 ( -4.23%) Max real-fsmark 1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) Max syst-fsmark 4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) The key point is that system CPU usage for xfsrepair (syst-xfsrepair) is almost cut in half. It's still not as low as 3.19-vanilla but it's much closer 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla NUMA alloc hit 146138883 121929782 104019526 NUMA alloc miss 13146328 11456356 7806370 NUMA interleave hit 0 0 0 NUMA alloc local 146060848 121865921 103953085 NUMA base PTE updates 242201535 117237258 216624143 NUMA huge PMD updates 113270 52121 127782 NUMA page range updates 300195775 143923210 282048527 NUMA hint faults 180388025 87299060 147235021 NUMA hint local faults 72784532 32939258 61866265 NUMA hint local percent 40 37 42 NUMA pages migrated 71175262 41395302 23237799 Note the big differences in faults trapped and pages migrated. 3.19-vanilla still migrated fewer pages but if necessary the threshold at which we start throttling migrations can be lowered. Signed-off-by: Mel Gorman <mgorman@suse.de> --- include/linux/sched.h | 9 +++++---- kernel/sched/fair.c | 8 ++++++-- mm/huge_memory.c | 3 ++- mm/memory.c | 3 ++- 4 files changed, 15 insertions(+), 8 deletions(-)