um: borrow bitops from the x86 tree

Message ID	20201116144426.8415-1-anton.ivanov@cambridgegreys.com
State	Superseded
Headers	show Return-Path: <linux-um-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> From: anton.ivanov@cambridgegreys.com To: linux-um@lists.infradead.org Subject: [PATCH] um: borrow bitops from the x86 tree Date: Mon, 16 Nov 2020 14:44:26 +0000 Message-Id: <20201116144426.8415-1-anton.ivanov@cambridgegreys.com> MIME-Version: 1.0 summary: Content analysis details: (0.4 points) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 SPF_NONE SPF: sender does not publish an SPF Record 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 0.4 KHOP_HELO_FCRDNS Relay HELO differs from its IP's reverse DNS Precedence: list Cc: richard@nod.at, Anton Ivanov <anton.ivanov@cambridgegreys.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-um" <linux-um-bounces@lists.infradead.org> Errors-To: linux-um-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org
Series	um: borrow bitops from the x86 tree \| expand um: borrow bitops from the x86 tree

Anton Ivanov Nov. 16, 2020, 2:44 p.m. UTC

From: Anton Ivanov <anton.ivanov@cambridgegreys.com>

Using x86 bitops instead of the asm-generic allows to squeeze
a couple of percents improvement on fs IO in UML. It should
improve other areas as well.

Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com>
---
 arch/um/include/asm/bitops-x86.h |  1 +
 arch/um/include/asm/bitops.h     | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)
 create mode 120000 arch/um/include/asm/bitops-x86.h
 create mode 100644 arch/um/include/asm/bitops.h

Johannes Berg Nov. 17, 2020, 11:05 a.m. UTC | #1

Hi Anton,

So I thought I'd test your performance patches here, and applied
(hopefully the latest versions of) on top of 5.9:

      um: allow the use of glibc functions instead of builtins
      um: Fetch registers only for signals which need them
      um: enable the use of optimized xor routines in UML
      um: add a UML specific futex implementation
      um: Remove use of asprinf in umid.c
      um: "borrow" atomics from x86 architecture
      um: "borrow" cmpxchg from x86 tree in UML
      um: borrow bitops from the x86 tree


With the patches (compiled with glibc functions), one of my trivial
virtual lab tests gets:

  Time (mean ± σ):     15.918 s ±  0.833 s    [User: 10.977 s, System: 5.600 s]
  Range (min … max):   15.371 s … 17.986 s    10 runs

It's not a large improvement, it seems noticable; without the patches I
get:

  Time (mean ± σ):     16.525 s ±  0.884 s    [User: 11.355 s, System: 5.648 s]
  Range (min … max):   15.682 s … 18.088 s    10 runs

johannes

Anton Ivanov Nov. 17, 2020, 11:46 a.m. UTC | #2

On 17/11/2020 11:05, Johannes Berg wrote:
> Hi Anton,
> 
> So I thought I'd test your performance patches here, and applied
> (hopefully the latest versions of) on top of 5.9:
> 
>        um: allow the use of glibc functions instead of builtins
>        um: Fetch registers only for signals which need them
>        um: enable the use of optimized xor routines in UML
>        um: add a UML specific futex implementation
>        um: Remove use of asprinf in umid.c
>        um: "borrow" atomics from x86 architecture
>        um: "borrow" cmpxchg from x86 tree in UML
>        um: borrow bitops from the x86 tree
> 
> 
> With the patches (compiled with glibc functions), one of my trivial
> virtual lab tests gets:
> 
>    Time (mean ± σ):     15.918 s ±  0.833 s    [User: 10.977 s, System: 5.600 s]
>    Range (min … max):   15.371 s … 17.986 s    10 runs
> 
> It's not a large improvement, it seems noticable; without the patches I
> get:
> 
>    Time (mean ± σ):     16.525 s ±  0.884 s    [User: 11.355 s, System: 5.648 s]
>    Range (min … max):   15.682 s … 18.088 s    10 runs
> 
> johannes
> 
> 

This is similar to what I get.

My usual test is:

  time busybox find /usr/lib/ -type f -exec cat {} > /dev/null \;

I discard the first run and use only runs from fs cache.

With stock I get

real	34.0 - 36.0
user 	29.6 - 29.9
sys	3.4 - 3.6


With the patch-set I get

real	32.0 - 34.0
user	28.2 - 29.2
sys	3.0 - 3.4

dd if=/dev/zero of=/dev/null bs=1M on the whole UBD device without the patches for 2nd run and later is 2.0GB/s - 2.1GB/s, with the patches is 2.2GB/s - 2.3GB/s

It is not a lot, but something - 2-5% on average depending on actual test.

The real gain will be to figure out how to optimize the memory mapper. It is the "handbrake" which slows down everything else.

Johannes Berg Nov. 17, 2020, 12:11 p.m. UTC | #3

On Tue, 2020-11-17 at 11:46 +0000, Anton Ivanov wrote:
> 
> My usual test is:
> 
>   time busybox find /usr/lib/ -type f -exec cat {} > /dev/null \;
> 
> I discard the first run and use only runs from fs cache.

Oh. I didn't even run the timing inside. I ran it *outside*, something
like

time ./linux args... init=/path/to/test-script.sh

johannes

Anton Ivanov Nov. 17, 2020, 12:53 p.m. UTC | #4

On 17/11/2020 12:11, Johannes Berg wrote:
> On Tue, 2020-11-17 at 11:46 +0000, Anton Ivanov wrote:
>>
>> My usual test is:
>>
>>    time busybox find /usr/lib/ -type f -exec cat {} > /dev/null \;
>>
>> I discard the first run and use only runs from fs cache.
> 
> Oh. I didn't even run the timing inside. I ran it *outside*, something
> like
> 
> time ./linux args... init=/path/to/test-script.sh

I usually do a full set of tests on fs access, device IO access and a 
netperf after each patch.

Based on them it looks like it is worth it.

The more interesting question is - is this the right organization?

We have stuff in multiple places now - arch/x86/um , arch/um, etc.

IMHO, we should probably look at getting it organized so that all 
sub-arches are under the um tree at some point.

> 
> johannes
> 
> 
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
>

Anton Ivanov Dec. 7, 2020, 6:18 p.m. UTC | #5

On 17/11/2020 12:53, Anton Ivanov wrote:
> On 17/11/2020 12:11, Johannes Berg wrote:
>> On Tue, 2020-11-17 at 11:46 +0000, Anton Ivanov wrote:
>>>
>>> My usual test is:
>>>
>>>    time busybox find /usr/lib/ -type f -exec cat {} > /dev/null \;
>>>
>>> I discard the first run and use only runs from fs cache.
>>
>> Oh. I didn't even run the timing inside. I ran it *outside*, something
>> like
>>
>> time ./linux args... init=/path/to/test-script.sh
> 
> I usually do a full set of tests on fs access, device IO access and a 
> netperf after each patch.
> 
> Based on them it looks like it is worth it.
> 
> The more interesting question is - is this the right organization?
> 
> We have stuff in multiple places now - arch/x86/um , arch/um, etc.
> 
> IMHO, we should probably look at getting it organized so that all 
> sub-arches are under the um tree at some point.


In the meantime, a backport of these patchsets (string, atomic, bitops, 
xor, futex, etc) to OpenWRT/UML has clocked 14 days as my main CPE.

I have not observed any stability issues and there is some visible 
improvement in CPU usage.


> 
>>
>> johannes
>>
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
> 
>

um: borrow bitops from the x86 tree

Commit Message

Comments

Patch