[RFC,Qemu,v2,0/2] spapr: nvdimm: Asynchronus flush hcall support

Message ID	160674929554.2492771.17651548703390170573.stgit@lep8c.aus.stglabs.ibm.com
Headers	show Return-Path: <kvm-ppc-owner@vger.kernel.org> Subject: [RFC Qemu PATCH v2 0/2] spapr: nvdimm: Asynchronus flush hcall support From: Shivaprasad G Bhat <sbhat@linux.ibm.com> To: xiaoguangrong.eric@gmail.com, mst@redhat.com, imammedo@redhat.com, david@gibson.dropbear.id.au, qemu-devel@nongnu.org, qemu-ppc@nongnu.org Cc: shivaprasadbhat@gmail.com, bharata@linux.vnet.ibm.com, linux-nvdimm@lists.01.org, linuxppc-dev@lists.ozlabs.org, kvm-ppc@vger.kernel.org, aneesh.kumar@linux.ibm.com Date: Mon, 30 Nov 2020 09:16:14 -0600 Message-ID: <160674929554.2492771.17651548703390170573.stgit@lep8c.aus.stglabs.ibm.com> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	spapr: nvdimm: Asynchronus flush hcall support \| expand [RFC,Qemu,v2,0/2] spapr: nvdimm: Asynchronus flush hcall support [RFC,Qemu,v2,1/2] spapr: drc: Add support for async hcalls at the drc level [RFC,Qemu,v2,2/2] spapr: nvdimm: Implement async flush hcalls

Message ID

160674929554.2492771.17651548703390170573.stgit@lep8c.aus.stglabs.ibm.com

Headers

Subject: [RFC Qemu PATCH v2 0/2] spapr: nvdimm: Asynchronus flush hcall
 support
From: Shivaprasad G Bhat <sbhat@linux.ibm.com>
To: xiaoguangrong.eric@gmail.com, mst@redhat.com, imammedo@redhat.com,
        david@gibson.dropbear.id.au, qemu-devel@nongnu.org,
        qemu-ppc@nongnu.org
Cc: shivaprasadbhat@gmail.com, bharata@linux.vnet.ibm.com,
        linux-nvdimm@lists.01.org, linuxppc-dev@lists.ozlabs.org,
        kvm-ppc@vger.kernel.org, aneesh.kumar@linux.ibm.com
Date: Mon, 30 Nov 2020 09:16:14 -0600
Message-ID: 
 <160674929554.2492771.17651548703390170573.stgit@lep8c.aus.stglabs.ibm.com>
User-Agent: StGit/0.19
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Precedence: bulk

Series

spapr: nvdimm: Asynchronus flush hcall support | expand

Message

Shivaprasad G Bhat Nov. 30, 2020, 3:16 p.m. UTC

The nvdimm devices are expected to ensure write persistent during power
failure kind of scenarios.

The libpmem has architecture specific instructions like dcbf on power
to flush the cache data to backend nvdimm device during normal writes.

Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
doesn't traslate to actual flush to the backend file on the host in case
of file backed vnvdimms. This is addressed by virtio-pmem in case of x86_64
by making asynchronous flushes.

On PAPR, issue is addressed by adding a new hcall to
request for an explicit asynchronous flush requests from the guest ndctl
driver when the backend nvdimm cannot ensure write persistence with dcbf
alone. So, the approach here is to convey when the asynchronous flush is
required in a device tree property. The guest makes the hcall when the
property is found, instead of relying on dcbf.

The first patch adds the necessary asynchronous hcall support infrastructure
code at the DRC level. Second patch implements the hcall using the
infrastructure.

Hcall semantics are in review and not final.

A new device property sync-dax is added to the nvdimm device. When the
sync-dax is off(default), the asynchronous hcalls will be called.

With respect to save from new qemu to restore on old qemu, having the
sync-dax by default off(when not specified) causes IO errors in guests as
the async-hcall would not be supported on old qemu. The new hcall
implementation being supported only on the new pseries machine version,
the current machine version checks may be sufficient to prevent
such migration. Please suggest what should be done.

The below demonstration shows the map_sync behavior with sync-dax on & off.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with syn-dax=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=off
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=on
Failed to mmap with Operation not supported

---
v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
- Fixed a missed-out unlock
- using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (2):
spapr: drc: Add support for async hcalls at the drc level
spapr: nvdimm: Implement async flush hcalls

--
Signature

Comments

Greg Kurz Dec. 21, 2020, 1:07 p.m. UTC | #1

On Mon, 30 Nov 2020 09:16:14 -0600
Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote:

> The nvdimm devices are expected to ensure write persistent during power
> failure kind of scenarios.
> 
> The libpmem has architecture specific instructions like dcbf on power
> to flush the cache data to backend nvdimm device during normal writes.
> 
> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> doesn't traslate to actual flush to the backend file on the host in case
> of file backed vnvdimms. This is addressed by virtio-pmem in case of x86_64
> by making asynchronous flushes.
> 
> On PAPR, issue is addressed by adding a new hcall to
> request for an explicit asynchronous flush requests from the guest ndctl
> driver when the backend nvdimm cannot ensure write persistence with dcbf
> alone. So, the approach here is to convey when the asynchronous flush is
> required in a device tree property. The guest makes the hcall when the
> property is found, instead of relying on dcbf.
> 
> The first patch adds the necessary asynchronous hcall support infrastructure
> code at the DRC level. Second patch implements the hcall using the
> infrastructure.
> 
> Hcall semantics are in review and not final.
> 
> A new device property sync-dax is added to the nvdimm device. When the 
> sync-dax is off(default), the asynchronous hcalls will be called.
> 
> With respect to save from new qemu to restore on old qemu, having the
> sync-dax by default off(when not specified) causes IO errors in guests as
> the async-hcall would not be supported on old qemu. The new hcall
> implementation being supported only on the new  pseries machine version,
> the current machine version checks may be sufficient to prevent
> such migration. Please suggest what should be done.
> 

First, all requests that are still not completed from the guest POV,
ie. the hcall hasn't returned H_SUCCESS yet, are state that we should
migrate in theory. In this case, I guess we rather want to drain all
pending requests on the source in some pre-save handler.

Then, as explained in another mail, you should enforce stable behavior
for existing machine types with some hw_compat magic.

> The below demonstration shows the map_sync behavior with sync-dax on & off.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with syn-dax=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile    ----> When sync-dax=off
> [root@atest-guest ~]# ./mapsync /mnt2/newfile    ----> when sync-dax=on
> Failed to mmap  with Operation not supported
> 
> ---
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
>       - Fixed a missed-out unlock
>       - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
> 
> Shivaprasad G Bhat (2):
>       spapr: drc: Add support for async hcalls at the drc level
>       spapr: nvdimm: Implement async flush hcalls
> 
> 
>  hw/mem/nvdimm.c            |    1
>  hw/ppc/spapr_drc.c         |  146 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_nvdimm.c      |   79 ++++++++++++++++++++++++
>  include/hw/mem/nvdimm.h    |   10 +++
>  include/hw/ppc/spapr.h     |    3 +
>  include/hw/ppc/spapr_drc.h |   25 ++++++++
>  6 files changed, 263 insertions(+), 1 deletion(-)
> 
> --
> Signature
> 
>