mbox series

[RFC,0/3] um: clean up mm creation - another attempt

Message ID 20230922223737.1206223-4-johannes@sipsolutions.net
Headers show
Series um: clean up mm creation - another attempt | expand

Message

Johannes Berg Sept. 22, 2023, 10:37 p.m. UTC
This shouldn't pessimize since we keep copy_context_skas0() and
the setup there, but still clean up the code there to not rely on
current, and move the force_flush_all() to a more appropriate
place, which fixes the issue with that in the fork_handler().

johannes

Comments

Anton Ivanov Sept. 25, 2023, 1:29 p.m. UTC | #1
I have rebased the preempt patch on top of these series.

PREEMPT works with some performance decrease.

VOLUNTARY deadlocks early in boot around the time it starts loading modules.

non-preemptible deadlocks very early in boot.

I am going to investigate and come back with a revised PREEMPT patch (hopefully, this will also flush out any remaining gremlins).

A.

On 22/09/2023 23:37, Johannes Berg wrote:
> This shouldn't pessimize since we keep copy_context_skas0() and
> the setup there, but still clean up the code there to not rely on
> current, and move the force_flush_all() to a more appropriate
> place, which fixes the issue with that in the fork_handler().
>
> johannes
>
>
>
Johannes Berg Sept. 25, 2023, 1:33 p.m. UTC | #2
On Mon, 2023-09-25 at 14:29 +0100, Anton Ivanov wrote:
> I have rebased the preempt patch on top of these series.
> 
> PREEMPT works with some performance decrease.
> 
> VOLUNTARY deadlocks early in boot around the time it starts loading modules.
> 
> non-preemptible deadlocks very early in boot.
> 

Well I guess that means there's still some issue in here? Hmm.

Now I don't understand anything anymore, I guess.

johannes
Anton Ivanov Sept. 25, 2023, 1:34 p.m. UTC | #3
On 25/09/2023 14:33, Johannes Berg wrote:
> On Mon, 2023-09-25 at 14:29 +0100, Anton Ivanov wrote:
>> I have rebased the preempt patch on top of these series.
>>
>> PREEMPT works with some performance decrease.
>>
>> VOLUNTARY deadlocks early in boot around the time it starts loading modules.
>>
>> non-preemptible deadlocks very early in boot.
>>
> 
> Well I guess that means there's still some issue in here? Hmm.
> 
> Now I don't understand anything anymore, I guess.

UML VM and process handling.

And here be dragons :)

> 
> johannes
>
Anton Ivanov Sept. 25, 2023, 2:27 p.m. UTC | #4
On 25/09/2023 14:33, Johannes Berg wrote:
> On Mon, 2023-09-25 at 14:29 +0100, Anton Ivanov wrote:
>> I have rebased the preempt patch on top of these series.
>>
>> PREEMPT works with some performance decrease.
>>
>> VOLUNTARY deadlocks early in boot around the time it starts loading modules.
>>
>> non-preemptible deadlocks very early in boot.
>>
> 
> Well I guess that means there's still some issue in here? Hmm.
> 
> Now I don't understand anything anymore, I guess.

PEBKAC. The tree got corrupted somewhere during rebase. Reapplying everything on top of a clean master fixed it.

So it all works.

With some performance penalties compared to the old approach, but works.

> 
> johannes
>
Johannes Berg Sept. 25, 2023, 2:44 p.m. UTC | #5
On Mon, 2023-09-25 at 15:27 +0100, Anton Ivanov wrote:
> On 25/09/2023 14:33, Johannes Berg wrote:
> > On Mon, 2023-09-25 at 14:29 +0100, Anton Ivanov wrote:
> > > I have rebased the preempt patch on top of these series.
> > > 
> > > PREEMPT works with some performance decrease.
> > > 
> > > VOLUNTARY deadlocks early in boot around the time it starts loading modules.
> > > 
> > > non-preemptible deadlocks very early in boot.
> > > 
> > 
> > Well I guess that means there's still some issue in here? Hmm.
> > 
> > Now I don't understand anything anymore, I guess.
> 
> PEBKAC. The tree got corrupted somewhere during rebase. Reapplying everything on top of a clean master fixed it.
> 
> So it all works.

OK, whew. At least now I no longer _completely_ doubt the mental model I
have of UML VM :-)

> With some performance penalties compared to the old approach, but works.

I still find this odd though, I don't see what the flush would possibly
do in a new (mm host) process that's not achievable in arch_dup_mmap()?

OK, so let's see - arch_dup_mmap() is _earlier_ than the fork_handler,
because that only happens on the very first switch into the process.
This is only when it gets scheduled. So we'd be looking for something
that copy_process() changes in the MM after copy_mm() and before it can
get scheduled?

I guess we could even move the flush it into copy_thread(), which is a
simpler patch too, but it felt a bit wrong, since that's about the
(guest!) process, not the mm.

But basically I don't see anything there - fork syscall tail-calls
kernel_clone(), which doesn't really do anything with the result of
copy_process() except wake it up, and copy_process() doesn't really do
anything either?

johannes
Anton Ivanov Sept. 25, 2023, 3:20 p.m. UTC | #6
On 25/09/2023 15:44, Johannes Berg wrote:
> On Mon, 2023-09-25 at 15:27 +0100, Anton Ivanov wrote:
>> On 25/09/2023 14:33, Johannes Berg wrote:
>>> On Mon, 2023-09-25 at 14:29 +0100, Anton Ivanov wrote:
>>>> I have rebased the preempt patch on top of these series.
>>>>
>>>> PREEMPT works with some performance decrease.
>>>>
>>>> VOLUNTARY deadlocks early in boot around the time it starts loading modules.
>>>>
>>>> non-preemptible deadlocks very early in boot.
>>>>
>>> Well I guess that means there's still some issue in here? Hmm.
>>>
>>> Now I don't understand anything anymore, I guess.
>> PEBKAC. The tree got corrupted somewhere during rebase. Reapplying everything on top of a clean master fixed it.
>>
>> So it all works.
> OK, whew. At least now I no longer _completely_ doubt the mental model I
> have of UML VM :-)
>
>> With some performance penalties compared to the old approach, but works.
> I still find this odd though, I don't see what the flush would possibly
> do in a new (mm host) process that's not achievable in arch_dup_mmap()?
>
> OK, so let's see - arch_dup_mmap() is _earlier_ than the fork_handler,
> because that only happens on the very first switch into the process.
> This is only when it gets scheduled. So we'd be looking for something
> that copy_process() changes in the MM after copy_mm() and before it can
> get scheduled?
>
> I guess we could even move the flush it into copy_thread(), which is a
> simpler patch too, but it felt a bit wrong, since that's about the
> (guest!) process, not the mm.
>
> But basically I don't see anything there - fork syscall tail-calls
> kernel_clone(), which doesn't really do anything with the result of
> copy_process() except wake it up, and copy_process() doesn't really do
> anything either?

I am continuing to dig through that and looking for voluntary preemption points in the process.

We have none of our own and the generic ones are not invoked - voluntary for all practical purposes does not differ from no-preempt.

If I have some suspects, I will post their mugshots to the list :)

> johannes
>
Anton Ivanov Sept. 26, 2023, 12:16 p.m. UTC | #7
On 25/09/2023 16:20, Anton Ivanov wrote:
> 
> On 25/09/2023 15:44, Johannes Berg wrote:
>> On Mon, 2023-09-25 at 15:27 +0100, Anton Ivanov wrote:
>>> On 25/09/2023 14:33, Johannes Berg wrote:
>>>> On Mon, 2023-09-25 at 14:29 +0100, Anton Ivanov wrote:
>>>>> I have rebased the preempt patch on top of these series.
>>>>>
>>>>> PREEMPT works with some performance decrease.
>>>>>
>>>>> VOLUNTARY deadlocks early in boot around the time it starts loading modules.
>>>>>
>>>>> non-preemptible deadlocks very early in boot.
>>>>>
>>>> Well I guess that means there's still some issue in here? Hmm.
>>>>
>>>> Now I don't understand anything anymore, I guess.
>>> PEBKAC. The tree got corrupted somewhere during rebase. Reapplying everything on top of a clean master fixed it.
>>>
>>> So it all works.
>> OK, whew. At least now I no longer _completely_ doubt the mental model I
>> have of UML VM :-)
>>
>>> With some performance penalties compared to the old approach, but works.
>> I still find this odd though, I don't see what the flush would possibly
>> do in a new (mm host) process that's not achievable in arch_dup_mmap()?
>>
>> OK, so let's see - arch_dup_mmap() is _earlier_ than the fork_handler,
>> because that only happens on the very first switch into the process.
>> This is only when it gets scheduled. So we'd be looking for something
>> that copy_process() changes in the MM after copy_mm() and before it can
>> get scheduled?
>>
>> I guess we could even move the flush it into copy_thread(), which is a
>> simpler patch too, but it felt a bit wrong, since that's about the
>> (guest!) process, not the mm.
>>
>> But basically I don't see anything there - fork syscall tail-calls
>> kernel_clone(), which doesn't really do anything with the result of
>> copy_process() except wake it up, and copy_process() doesn't really do
>> anything either?
> 
> I am continuing to dig through that and looking for voluntary preemption points in the process.
> 
> We have none of our own and the generic ones are not invoked - voluntary for all practical purposes does not differ from no-preempt.
> 
> If I have some suspects, I will post their mugshots to the list :)

For the time being it is mostly negative :)

1. The performance after the mm patch is down. By 30-40% on my standard bench.

2. The preemption patches work fine on top (all 3 cases). The performance difference stays.

3. We do not have anything of value to add in term of cond_resched() to the drivers :(
Most drivers are fairly simplistic with no safe points to add this.

4. The upper layer resched points are not enough.
F.E. -EAGAIN from block, retries, etc should all hit paths in the upper layer which cause a
cond_resched(). They are not noticeable. VOLUNTARY behaves nearly identical to non-preemptive.

5. There are a couple of places where an if (condition) schedule() should probably be replaced
with cond_resched(). F.E. interrupt_end(). While it looks more appropriate, the performance
effect is nil.

6. Do we still need force_flush_all() in the arch_dup_mmap()? This works with a non-forced tlb flush
using flush_tlb_mm(mm);

7. In all cases, UML is doing something silly.
The CPU usage while doing find -type f -exec cat {} > /dev/null measured from outside in non-preemptive and
PREEMPT_VOLUNTARY stays around 8-15%. The UML takes a sabbatical for the remaining 85 instead of actually
doing work. PREEMPT is slightly better at 60, but still far from 100%. It just keeps going into idle and I
cannot understand why.

8. All in all I do not have an obvious suspect for the performance difference. No change candidates
either. I also do not have any ideas why it ends up in idle instead of doing work.

> 
>> johannes
>>
Johannes Berg Sept. 26, 2023, 12:38 p.m. UTC | #8
On Tue, 2023-09-26 at 13:16 +0100, Anton Ivanov wrote:
> 
> For the time being it is mostly negative :)

Oh well :)

> 1. The performance after the mm patch is down. By 30-40% on my standard bench.

For the record, you mean this three-patch series that we're discussing
in the thread of?


Btw, Benjamin realized that MADV_DONTFORK is broken in UML, precisely
_because_ we fork/copy the whole mm process and then try to fix it up.
But we can only fix up things that actually have VMAs, and of course
there are no VMAs with VM_DONTCOPY (set by MADV_DONTFORK) in the new mm
after fork.

To fix this, really we should either

1. Start from scratch, without copying, which my other patch [1] did.

   [1] https://lore.kernel.org/all/20230922131638.2c57ec713d1c.Id11dff4b349e6a8f0136bb6bb09f6e01a80befbb@changeid/

   But of course that's more expensive because we now have to page-fault
   everything in the new process, and page faults are expensive.

2. Compare the new mm and the old mm, which requires putting it into
   arch_dup_mmap() like these patches here - where I'm not sure I
   understand at all why they cause a perf regression - and remove the
   VMAs that are marked VM_DONTCOPY in the old one.


To be honest I don't really like _either_ of these approaches, nor the
current "fork the process" approach that UML takes. It's very magic, and
very much works around how Linux works.

Remember that basically the mm process contents should match the page
tables in the VMAs; but this is decidedly not true where fork() is
involved, because while the VMAs are copied, most of the page tables are
_not_ copied. Thus, we have a situation where after fork we don't take
page faults in UML that we would take in a normal system (this part is
good for performance), and I believe also vice versa, which would then
perhaps explain the flush_tlb_page() in handle_page_fault(), because
honestly I don't otherwise have an explanation for it.


I think the better approach for correctness and integration into the
kernel would be to actually admit that UML is special because page
faults are so expensive, and

 * start with a fresh mm process every time
 * have vma_needs_copy() return true
 * completely fill the mappings according to only the new mm's VMAs
   in arch_dup_mmap() or perhaps later

I don't know how that'd behave wrt. performance, though it likely cannot
be better than with these patches, but at least it'd be more correct,
and more obviously correct too, for starters, because then the actual
mappings in the UML mm process would actually reflect the PTEs that
Linux knows about.


> 2. The preemption patches work fine on top (all 3 cases). The performance difference stays.

OK.

> 3. We do not have anything of value to add in term of cond_resched() to the drivers :(
> Most drivers are fairly simplistic with no safe points to add this.

Yeah, not surprised by this.

> 6. Do we still need force_flush_all() in the arch_dup_mmap()? This works with a non-forced tlb flush
> using flush_tlb_mm(mm);

Maybe not, does it make a difference though?

> 7. In all cases, UML is doing something silly.
> The CPU usage while doing find -type f -exec cat {} > /dev/null measured from outside in non-preemptive and
> PREEMPT_VOLUNTARY stays around 8-15%. The UML takes a sabbatical for the remaining 85 instead of actually
> doing work. PREEMPT is slightly better at 60, but still far from 100%. It just keeps going into idle and I
> cannot understand why.

Is it just waiting for IO?

johannes
Anton Ivanov Sept. 26, 2023, 1:04 p.m. UTC | #9
On 26/09/2023 13:38, Johannes Berg wrote:
> On Tue, 2023-09-26 at 13:16 +0100, Anton Ivanov wrote:
>>
>> For the time being it is mostly negative :)
> 
> Oh well :)
> 
>> 1. The performance after the mm patch is down. By 30-40% on my standard bench.
> 
> For the record, you mean this three-patch series that we're discussing
> in the thread of?

Yes. It has no stability issues on its own as well as with the PREEMPT patch on top.

> 
> 
> Btw, Benjamin realized that MADV_DONTFORK is broken in UML, precisely
> _because_ we fork/copy the whole mm process and then try to fix it up.
> But we can only fix up things that actually have VMAs, and of course
> there are no VMAs with VM_DONTCOPY (set by MADV_DONTFORK) in the new mm
> after fork.
> 
> To fix this, really we should either
> 
> 1. Start from scratch, without copying, which my other patch [1] did.
> 
>     [1] https://lore.kernel.org/all/20230922131638.2c57ec713d1c.Id11dff4b349e6a8f0136bb6bb09f6e01a80befbb@changeid/
> 
>     But of course that's more expensive because we now have to page-fault
>     everything in the new process, and page faults are expensive.
> 
> 2. Compare the new mm and the old mm, which requires putting it into
>     arch_dup_mmap() like these patches here - where I'm not sure I
>     understand at all why they cause a perf regression - and remove the
>     VMAs that are marked VM_DONTCOPY in the old one.
> 
> 
> To be honest I don't really like _either_ of these approaches, nor the
> current "fork the process" approach that UML takes. It's very magic, and
> very much works around how Linux works.

+1

> 
> Remember that basically the mm process contents should match the page
> tables in the VMAs; but this is decidedly not true where fork() is
> involved, because while the VMAs are copied, most of the page tables are
> _not_ copied. Thus, we have a situation where after fork we don't take
> page faults in UML that we would take in a normal system (this part is
> good for performance), and I believe also vice versa, which would then
> perhaps explain the flush_tlb_page() in handle_page_fault(), because
> honestly I don't otherwise have an explanation for it.
> 
> 
> I think the better approach for correctness and integration into the
> kernel would be to actually admit that UML is special because page
> faults are so expensive, and
> 
>   * start with a fresh mm process every time
>   * have vma_needs_copy() return true
>   * completely fill the mappings according to only the new mm's VMAs
>     in arch_dup_mmap() or perhaps later
> 
> I don't know how that'd behave wrt. performance, though it likely cannot
> be better than with these patches, but at least it'd be more correct,
> and more obviously correct too, for starters, because then the actual
> mappings in the UML mm process would actually reflect the PTEs that
> Linux knows about.

We can try that.

> 
> 
>> 2. The preemption patches work fine on top (all 3 cases). The performance difference stays.
> 
> OK.
> 
>> 3. We do not have anything of value to add in term of cond_resched() to the drivers :(
>> Most drivers are fairly simplistic with no safe points to add this.
> 
> Yeah, not surprised by this.
> 
>> 6. Do we still need force_flush_all() in the arch_dup_mmap()? This works with a non-forced tlb flush
>> using flush_tlb_mm(mm);
> 
> Maybe not, does it make a difference though?

Nope. Same numbers in both cases.

> 
>> 7. In all cases, UML is doing something silly.
>> The CPU usage while doing find -type f -exec cat {} > /dev/null measured from outside in non-preemptive and
>> PREEMPT_VOLUNTARY stays around 8-15%. The UML takes a sabbatical for the remaining 85 instead of actually
>> doing work. PREEMPT is slightly better at 60, but still far from 100%. It just keeps going into idle and I
>> cannot understand why.
> 
> Is it just waiting for IO?

Nope. Nearly all I see on strace is wait4 and PTRACE. The epoll_waits are few and far between.

The bottleneck is mm and vm, not IO :(

> 
> johannes
>
Benjamin Berg Sept. 27, 2023, 9:52 a.m. UTC | #10
Hi,

On Tue, 2023-09-26 at 14:38 +0200, Johannes Berg wrote:
> [SNIP]
> 1. Start from scratch, without copying, which my other patch [1] did.

I really think we should go ahead with that approach. Then follow up
with optimizations.

> [SNIP]
> 
> I think the better approach for correctness and integration into the
> kernel would be to actually admit that UML is special because page
> faults are so expensive, and
> 
>  * start with a fresh mm process every time
>  * have vma_needs_copy() return true
>  * completely fill the mappings according to only the new mm's VMAs
>    in arch_dup_mmap() or perhaps later
> 
> I don't know how that'd behave wrt. performance, though it likely cannot
> be better than with these patches, but at least it'd be more correct,
> and more obviously correct too, for starters, because then the actual
> mappings in the UML mm process would actually reflect the PTEs that
> Linux knows about.

Yes, performance may degrade, but the implementation should be correct
in the first place. Note that even though we looked at it (and e.g.
found that MMAP_DONTFORK is incorrect), we have not figured out why the
first approach is slower currently as everything interesting should be
getting unmapped by the force_flush_all.

Once we are there, we can look for optimizations. The fundamental
problem is that page faults (even minor ones) are extremely expensive
for us.

Just throwing out ideas on what we could do:
   1. SECCOMP as that reduces the amount of context switches.
      (Yes, I know I should resubmit the patchset)
   2. Maybe we can disable/cripple page access tracking? If we assume
      initially mark all pages as accessed by userspace (i.e.
      pte_mkyoung), then we avoid a minor page fault on first access.
      Doing that will mess with page eviction though.
   3. Do DAX (direct_access) for files. i.e. mmap files directly in the
      host kernel rather than through UM.
      With a hostfs like file system, one should be able to add an
      intermediate block device that maps host files to physical pages,
      then do DAX in the FS.
      For disk images, the existing iomem infrastructure should be
      usable, this should work with any DAX enabled filesystems (ext2,
      ext4, xfs, virtiofs, erofs).

Benjamin

> 
> > 2. The preemption patches work fine on top (all 3 cases). The
> > performance difference stays.
> 
> OK.
> 
> > 3. We do not have anything of value to add in term of
> > cond_resched() to the drivers :(
> > Most drivers are fairly simplistic with no safe points to add this.
> 
> Yeah, not surprised by this.
> 
> > 6. Do we still need force_flush_all() in the arch_dup_mmap()? This
> > works with a non-forced tlb flush
> > using flush_tlb_mm(mm);
> 
> Maybe not, does it make a difference though?
> 
> > 7. In all cases, UML is doing something silly.
> > The CPU usage while doing find -type f -exec cat {} > /dev/null
> > measured from outside in non-preemptive and
> > PREEMPT_VOLUNTARY stays around 8-15%. The UML takes a sabbatical
> > for the remaining 85 instead of actually
> > doing work. PREEMPT is slightly better at 60, but still far from
> > 100%. It just keeps going into idle and I
> > cannot understand why.
> 
> Is it just waiting for IO?
> 
> johannes
> 
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
>
Anton Ivanov Sept. 27, 2023, 9:59 a.m. UTC | #11
On 27/09/2023 10:52, Benjamin Berg wrote:
> Hi,
> 
> On Tue, 2023-09-26 at 14:38 +0200, Johannes Berg wrote:
>> [SNIP]
>> 1. Start from scratch, without copying, which my other patch [1] did.
> 
> I really think we should go ahead with that approach. Then follow up
> with optimizations.

+1

> 
>> [SNIP]
>>
>> I think the better approach for correctness and integration into the
>> kernel would be to actually admit that UML is special because page
>> faults are so expensive, and
>>
>>   * start with a fresh mm process every time
>>   * have vma_needs_copy() return true
>>   * completely fill the mappings according to only the new mm's VMAs
>>     in arch_dup_mmap() or perhaps later
>>
>> I don't know how that'd behave wrt. performance, though it likely cannot
>> be better than with these patches, but at least it'd be more correct,
>> and more obviously correct too, for starters, because then the actual
>> mappings in the UML mm process would actually reflect the PTEs that
>> Linux knows about.
> 
> Yes, performance may degrade, but the implementation should be correct
> in the first place. Note that even though we looked at it (and e.g.
> found that MMAP_DONTFORK is incorrect), we have not figured out why the
> first approach is slower currently as everything interesting should be
> getting unmapped by the force_flush_all.
> 
> Once we are there, we can look for optimizations. The fundamental
> problem is that page faults (even minor ones) are extremely expensive
> for us.
> 
> Just throwing out ideas on what we could do:
>     1. SECCOMP as that reduces the amount of context switches.
>        (Yes, I know I should resubmit the patchset)

Actually... YES, YES and YES.

I was just looking at all the workaround which are in place to prevent
guest processes doing a syscall on the host. If this is prohibited at
a higher level we should get quite a boost as all these PTRACE_PEEKs
will become unnecessary.

>     2. Maybe we can disable/cripple page access tracking? If we assume
>        initially mark all pages as accessed by userspace (i.e.
>        pte_mkyoung), then we avoid a minor page fault on first access.
>        Doing that will mess with page eviction though.
>     3. Do DAX (direct_access) for files. i.e. mmap files directly in the
>        host kernel rather than through UM.
>        With a hostfs like file system, one should be able to add an
>        intermediate block device that maps host files to physical pages,
>        then do DAX in the FS.
>        For disk images, the existing iomem infrastructure should be
>        usable, this should work with any DAX enabled filesystems (ext2,
>        ext4, xfs, virtiofs, erofs).

I had some plans to do a ubd gen 2 which uses mmap and/or this. They are
presently way on the backburner. We can do some of that once we push
the new VM changes.

> 
> Benjamin
> 
>>
>>> 2. The preemption patches work fine on top (all 3 cases). The
>>> performance difference stays.
>>
>> OK.
>>
>>> 3. We do not have anything of value to add in term of
>>> cond_resched() to the drivers :(
>>> Most drivers are fairly simplistic with no safe points to add this.
>>
>> Yeah, not surprised by this.
>>
>>> 6. Do we still need force_flush_all() in the arch_dup_mmap()? This
>>> works with a non-forced tlb flush
>>> using flush_tlb_mm(mm);
>>
>> Maybe not, does it make a difference though?
>>
>>> 7. In all cases, UML is doing something silly.
>>> The CPU usage while doing find -type f -exec cat {} > /dev/null
>>> measured from outside in non-preemptive and
>>> PREEMPT_VOLUNTARY stays around 8-15%. The UML takes a sabbatical
>>> for the remaining 85 instead of actually
>>> doing work. PREEMPT is slightly better at 60, but still far from
>>> 100%. It just keeps going into idle and I
>>> cannot understand why.
>>
>> Is it just waiting for IO?
>>
>> johannes
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
> 
>
Benjamin Berg Sept. 27, 2023, 10:42 a.m. UTC | #12
Hi,

On Wed, 2023-09-27 at 10:59 +0100, Anton Ivanov wrote:
> [SNIP]
> 
> I was just looking at all the workaround which are in place to
> prevent guest processes doing a syscall on the host. If this is
> prohibited at a higher level we should get quite a boost as all these
> PTRACE_PEEKs will become unnecessary.

Are you maybe looking at the fallback code if "local_using_sysemu" is
not set? That code should be dropped either way, we can assume
PTRACE_SYSEMU support these days, so the code in e.g. handle_trap can
be removed. I had a patch for that already.

Benjamin

> 
> >     2. Maybe we can disable/cripple page access tracking? If we
> > assume
> >        initially mark all pages as accessed by userspace (i.e.
> >        pte_mkyoung), then we avoid a minor page fault on first
> > access.
> >        Doing that will mess with page eviction though.
> >     3. Do DAX (direct_access) for files. i.e. mmap files directly
> > in the
> >        host kernel rather than through UM.
> >        With a hostfs like file system, one should be able to add an
> >        intermediate block device that maps host files to physical
> > pages,
> >        then do DAX in the FS.
> >        For disk images, the existing iomem infrastructure should be
> >        usable, this should work with any DAX enabled filesystems
> > (ext2,
> >        ext4, xfs, virtiofs, erofs).
> 
> I had some plans to do a ubd gen 2 which uses mmap and/or this. They
> are
> presently way on the backburner. We can do some of that once we push
> the new VM changes.
> 
> > 
> > Benjamin
> > 
> > > 
> > > > 2. The preemption patches work fine on top (all 3 cases). The
> > > > performance difference stays.
> > > 
> > > OK.
> > > 
> > > > 3. We do not have anything of value to add in term of
> > > > cond_resched() to the drivers :(
> > > > Most drivers are fairly simplistic with no safe points to add
> > > > this.
> > > 
> > > Yeah, not surprised by this.
> > > 
> > > > 6. Do we still need force_flush_all() in the arch_dup_mmap()?
> > > > This
> > > > works with a non-forced tlb flush
> > > > using flush_tlb_mm(mm);
> > > 
> > > Maybe not, does it make a difference though?
> > > 
> > > > 7. In all cases, UML is doing something silly.
> > > > The CPU usage while doing find -type f -exec cat {} > /dev/null
> > > > measured from outside in non-preemptive and
> > > > PREEMPT_VOLUNTARY stays around 8-15%. The UML takes a
> > > > sabbatical
> > > > for the remaining 85 instead of actually
> > > > doing work. PREEMPT is slightly better at 60, but still far
> > > > from
> > > > 100%. It just keeps going into idle and I
> > > > cannot understand why.
> > > 
> > > Is it just waiting for IO?
> > > 
> > > johannes
> > > 
> > > _______________________________________________
> > > linux-um mailing list
> > > linux-um@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-um
> > > 
> > 
> > 
>
Benjamin Berg Jan. 17, 2024, 5:17 p.m. UTC | #13
Hi,

On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
> [SNIP]
> Once we are there, we can look for optimizations. The fundamental
> problem is that page faults (even minor ones) are extremely expensive
> for us.
> 
> Just throwing out ideas on what we could do:
>    1. SECCOMP as that reduces the amount of context switches.
>       (Yes, I know I should resubmit the patchset)
>    2. Maybe we can disable/cripple page access tracking? If we assume
>       initially mark all pages as accessed by userspace (i.e.
>       pte_mkyoung), then we avoid a minor page fault on first access.
>       Doing that will mess with page eviction though.
>    3. Do DAX (direct_access) for files. i.e. mmap files directly in the
>       host kernel rather than through UM.
>       With a hostfs like file system, one should be able to add an
>       intermediate block device that maps host files to physical pages,
>       then do DAX in the FS.
>       For disk images, the existing iomem infrastructure should be
>       usable, this should work with any DAX enabled filesystems (ext2,
>       ext4, xfs, virtiofs, erofs).

So, I experimented quite a bit over Christmas (including getting DAX to
work with virtiofs). At the end of all this my conclusion is that
insufficient page table synchronization is our main problem.

Basically, right now we rely on the flush_tlb_* functions from the
kernel, but these are only called when TLB entries are removed, *not*
when new PTEs are added (there is also update_mmu_cache, but it isn't
enough either). Effectively this means that new page table entries will
often only be synced because the userspace code runs into an
unnecessary segfright now we rely on the flush_tlb_* functions from the
kernel, but these are only called when TLB entries are removed, *not*
when new PTEs are added (there is also update_mmu_cache, but it isn't
enough either). Effectively this means that new page table entries will
often only be synced because the userspace code runs into an
unnecessary segfaultault.
 
Really, what we need is a set_pte_at() implementation that marks the
memory range for synchronization. Then we can make sure we sync it
before switching to the userspace process (the equivalent of running
flush_tlb_mm_range right now).

I think we should:
 * Rewrite the userspace syscall code
   - Support delaying the execution of syscalls
   - Only support mmap/munmap/mprotect and LDT
   - Do simple compression of consecutive syscalls here
   - Drop the hand-written assembler
 * Improve the tlb.c code
   - remove the HVC abstraction
   - never force immediate syscall execution
 * Let set_pte_at() track which memory ranges that need syncing
 * At that point we should be able to:
   - drop copy_context_skas0
   - make flush_tlb_* no-ops
   - drop flush_tlb_page from handle_page_fault
   - move unmap() from flush_thread to init_new_context
     (or do it as part of start_userspace)

So, I did try this using nasty hacks and IIRC one of my runs was going
from 21s to 16s and another from 63s to 56s. Which seems like a nice
improvement.

Benjamin


PS: As for DAX, it doesn't really seem to help performance. It didn't
seem to lower the amount of page faults in UML. And, from my
perspective, it isn't really worth just for the memory sharing.

PPS: dirty/young tracking seemed to be only cause a small amount of
page faults in the grand scheme. So probably not something worth
following up on.
Anton Ivanov Jan. 17, 2024, 7:45 p.m. UTC | #14
On 17/01/2024 17:17, Benjamin Berg wrote:
> Hi,
> 
> On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
>> [SNIP]
>> Once we are there, we can look for optimizations. The fundamental
>> problem is that page faults (even minor ones) are extremely expensive
>> for us.
>>
>> Just throwing out ideas on what we could do:
>>     1. SECCOMP as that reduces the amount of context switches.
>>        (Yes, I know I should resubmit the patchset)
>>     2. Maybe we can disable/cripple page access tracking? If we assume
>>        initially mark all pages as accessed by userspace (i.e.
>>        pte_mkyoung), then we avoid a minor page fault on first access.
>>        Doing that will mess with page eviction though.
>>     3. Do DAX (direct_access) for files. i.e. mmap files directly in the
>>        host kernel rather than through UM.
>>        With a hostfs like file system, one should be able to add an
>>        intermediate block device that maps host files to physical pages,
>>        then do DAX in the FS.
>>        For disk images, the existing iomem infrastructure should be
>>        usable, this should work with any DAX enabled filesystems (ext2,
>>        ext4, xfs, virtiofs, erofs).
> 
> So, I experimented quite a bit over Christmas (including getting DAX to
> work with virtiofs). At the end of all this my conclusion is that
> insufficient page table synchronization is our main problem.
> 
> Basically, right now we rely on the flush_tlb_* functions from the
> kernel, but these are only called when TLB entries are removed, *not*
> when new PTEs are added (there is also update_mmu_cache, but it isn't
> enough either). Effectively this means that new page table entries will
> often only be synced because the userspace code runs into an
> unnecessary segfright now we rely on the flush_tlb_* functions from the
> kernel, but these are only called when TLB entries are removed, *not*
> when new PTEs are added (there is also update_mmu_cache, but it isn't
> enough either). Effectively this means that new page table entries will
> often only be synced because the userspace code runs into an
> unnecessary segfaultault.
>   
> Really, what we need is a set_pte_at() implementation that marks the
> memory range for synchronization. Then we can make sure we sync it
> before switching to the userspace process (the equivalent of running
> flush_tlb_mm_range right now).
> 
> I think we should:
>   * Rewrite the userspace syscall code
>     - Support delaying the execution of syscalls
>     - Only support mmap/munmap/mprotect and LDT
>     - Do simple compression of consecutive syscalls here
>     - Drop the hand-written assembler
>   * Improve the tlb.c code
>     - remove the HVC abstraction

Cool. That was not working particularly well. I tried to improve it a 
few times, but ripping it out and replacing it is probably a better idea.

>     - never force immediate syscall execution
>   * Let set_pte_at() track which memory ranges that need syncing
>   * At that point we should be able to:
>     - drop copy_context_skas0
>     - make flush_tlb_* no-ops
>     - drop flush_tlb_page from handle_page_fault
>     - move unmap() from flush_thread to init_new_context
>       (or do it as part of start_userspace)
> 
> So, I did try this using nasty hacks and IIRC one of my runs was going
> from 21s to 16s and another from 63s to 56s. Which seems like a nice
> improvement.

Excellent. I assume you were using hostfs as usual, right? If so, the 
difference is likely to be even more noticeable on ubd.

> 
> Benjamin
> 
> 
> PS: As for DAX, it doesn't really seem to help performance. It didn't
> seem to lower the amount of page faults in UML. And, from my
> perspective, it isn't really worth just for the memory sharing.
> 
> PPS: dirty/young tracking seemed to be only cause a small amount of
> page faults in the grand scheme. So probably not something worth
> following up on.
>
Benjamin Berg Jan. 17, 2024, 7:54 p.m. UTC | #15
On Wed, 2024-01-17 at 19:45 +0000, Anton Ivanov wrote:
> On 17/01/2024 17:17, Benjamin Berg wrote:
> > Hi,
> > 
> > On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
> > > [SNIP]
> > > Once we are there, we can look for optimizations. The fundamental
> > > problem is that page faults (even minor ones) are extremely expensive
> > > for us.
> > > 
> > > Just throwing out ideas on what we could do:
> > >     1. SECCOMP as that reduces the amount of context switches.
> > >        (Yes, I know I should resubmit the patchset)
> > >     2. Maybe we can disable/cripple page access tracking? If we assume
> > >        initially mark all pages as accessed by userspace (i.e.
> > >        pte_mkyoung), then we avoid a minor page fault on first access.
> > >        Doing that will mess with page eviction though.
> > >     3. Do DAX (direct_access) for files. i.e. mmap files directly in the
> > >        host kernel rather than through UM.
> > >        With a hostfs like file system, one should be able to add an
> > >        intermediate block device that maps host files to physical pages,
> > >        then do DAX in the FS.
> > >        For disk images, the existing iomem infrastructure should be
> > >        usable, this should work with any DAX enabled filesystems (ext2,
> > >        ext4, xfs, virtiofs, erofs).
> > 
> > So, I experimented quite a bit over Christmas (including getting DAX to
> > work with virtiofs). At the end of all this my conclusion is that
> > insufficient page table synchronization is our main problem.
> > 
> > Basically, right now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfright now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfaultault.
> >   
> > Really, what we need is a set_pte_at() implementation that marks the
> > memory range for synchronization. Then we can make sure we sync it
> > before switching to the userspace process (the equivalent of running
> > flush_tlb_mm_range right now).
> > 
> > I think we should:
> >   * Rewrite the userspace syscall code
> >     - Support delaying the execution of syscalls
> >     - Only support mmap/munmap/mprotect and LDT
> >     - Do simple compression of consecutive syscalls here
> >     - Drop the hand-written assembler
> >   * Improve the tlb.c code
> >     - remove the HVC abstraction
> 
> Cool. That was not working particularly well. I tried to improve it a
> few times, but ripping it out and replacing it is probably a better idea.

Hm, now I realise that we still want mmap() syscall compression for the
kernel itself in tlb.c.

> >     - never force immediate syscall execution
> >   * Let set_pte_at() track which memory ranges that need syncing
> >   * At that point we should be able to:
> >     - drop copy_context_skas0
> >     - make flush_tlb_* no-ops
> >     - drop flush_tlb_page from handle_page_fault
> >     - move unmap() from flush_thread to init_new_context
> >       (or do it as part of start_userspace)
> > 
> > So, I did try this using nasty hacks and IIRC one of my runs was going
> > from 21s to 16s and another from 63s to 56s. Which seems like a nice
> > improvement.
> 
> Excellent. I assume you were using hostfs as usual, right? If so, the
> difference is likely to be even more noticeable on ubd.

Yes, I was mostly testing hostfs. Initially also virtiofs with DAX, but
I went back as that didn't result in a pagefault count improvement once
I made some other adjustments.

Benjamin

> 
> > 
> > Benjamin
> > 
> > 
> > PS: As for DAX, it doesn't really seem to help performance. It didn't
> > seem to lower the amount of page faults in UML. And, from my
> > perspective, it isn't really worth just for the memory sharing.
> > 
> > PPS: dirty/young tracking seemed to be only cause a small amount of
> > page faults in the grand scheme. So probably not something worth
> > following up on.
> > 
>