From patchwork Fri Oct 21 17:11:44 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alex Williamson X-Patchwork-Id: 685207 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3t0sgL1ygwz9t0J for ; Sat, 22 Oct 2016 04:12:29 +1100 (AEDT) Received: from localhost ([::1]:33461 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bxdMo-00067s-MX for incoming@patchwork.ozlabs.org; Fri, 21 Oct 2016 13:12:26 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44777) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bxdMD-0005qZ-Cy for qemu-devel@nongnu.org; Fri, 21 Oct 2016 13:11:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bxdMA-0004bD-5E for qemu-devel@nongnu.org; Fri, 21 Oct 2016 13:11:49 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50018) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1bxdM9-0004aw-Ty for qemu-devel@nongnu.org; Fri, 21 Oct 2016 13:11:46 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 08C74624CF; Fri, 21 Oct 2016 17:11:45 +0000 (UTC) Received: from gimli.home (ovpn-116-107.phx2.redhat.com [10.3.116.107]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u9LHBi1V017079; Fri, 21 Oct 2016 13:11:44 -0400 From: Alex Williamson To: qemu-devel@nongnu.org Date: Fri, 21 Oct 2016 11:11:44 -0600 Message-ID: <20161021171100.18049.96340.stgit@gimli.home> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Fri, 21 Oct 2016 17:11:45 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [RFC PATCH] memory: Don't use memcpy for ram marked as skip_dump X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: pbonzini@redhat.com, thorsten.kohfeldt@gmx.de Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" With a vfio assigned device we lay down a base MemoryRegion registered as an IO region, giving us read & write accessors. If the region supports mmap, we lay down a higher priority sub-region MemoryRegion on top of the base layer initialized as a RAM pointer to the mmap. Finally, if we have any quirks for the device (ie. address ranges that need additional virtualization support), we put another IO sub-region on top of the mmap MemoryRegion. When this is flattened, we now potentially have sub-page mmap MemoryRegions exposed which cannot be directly mapped through KVM. This is as expected, but a subtle detail of this is that we end up with two different access mechanisms through QEMU. If we disable the mmap MemoryRegion, we make use of the IO MemoryRegion and service accesses using pread and pwrite to the vfio device file descriptor. If the mmap MemoryRegion is enabled and we end up in one of these sub-page gaps, QEMU handles the access as RAM, using memcpy to the mmap. Using the mmap through QEMU is a subtle difference, but it's fine, the problem is the memcpy. My assumption is that memcpy makes no guarantees about access width and potentially uses all sorts of optimized memory transfers that are not intended for talking to device MMIO. It turns out that this has been a problem for Realtek NIC assignment, which has such a quirk that creates a sub-page mmap MemoryRegion access. My proposal to fix this is to leverage the skip_dump flag that we already use for special handling of these device-backed MMIO ranges. When skip_dump is set for a MemoryRegion, we mark memory access as non-direct and automatically insert MemoryRegionOps with basic semantics to handle accesses. Note that we only enable dword accesses because some devices don't particularly like qword accesses (Realtek NICs are such a device). This actually also fixes memory inspection via the xp command in the QEMU monitor as well. Please comment. Is this the best way to solve this problem? Thanks Reported-by: Thorsten Kohfeldt Signed-off-by: Alex Williamson --- include/exec/memory.h | 6 ++++-- memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 10d7eac..a4c3acf 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1464,9 +1464,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr); static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write) { if (is_write) { - return memory_region_is_ram(mr) && !mr->readonly; + return memory_region_is_ram(mr) && + !mr->readonly && !memory_region_is_skip_dump(mr); } else { - return memory_region_is_ram(mr) || memory_region_is_romd(mr); + return (memory_region_is_ram(mr) && !memory_region_is_skip_dump(mr)) || + memory_region_is_romd(mr); } } diff --git a/memory.c b/memory.c index 58f9269..7ed7ca9 100644 --- a/memory.c +++ b/memory.c @@ -1136,6 +1136,46 @@ const MemoryRegionOps unassigned_mem_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; +static uint64_t skip_dump_mem_read(void *opaque, hwaddr addr, unsigned size) +{ + uint64_t val = (uint64_t)~0; + + switch (size) { + case 1: + val = *(uint8_t *)(opaque + addr); + break; + case 2: + val = *(uint16_t *)(opaque + addr); + break; + case 4: + val = *(uint32_t *)(opaque + addr); + break; + } + + return val; +} + +static void skip_dump_mem_write(void *opaque, hwaddr addr, uint64_t data, unsigned size) +{ + switch (size) { + case 1: + *(uint8_t *)(opaque + addr) = (uint8_t)data; + break; + case 2: + *(uint16_t *)(opaque + addr) = (uint16_t)data; + break; + case 4: + *(uint32_t *)(opaque + addr) = (uint32_t)data; + break; + } +} + +const MemoryRegionOps skip_dump_mem_ops = { + .read = skip_dump_mem_read, + .write = skip_dump_mem_write, + .endianness = DEVICE_NATIVE_ENDIAN, +}; + bool memory_region_access_valid(MemoryRegion *mr, hwaddr addr, unsigned size, @@ -1366,6 +1406,10 @@ void memory_region_init_ram_ptr(MemoryRegion *mr, void memory_region_set_skip_dump(MemoryRegion *mr) { mr->skip_dump = true; + if (mr->ram && mr->ops == &unassigned_mem_ops) { + mr->ops = &skip_dump_mem_ops; + mr->opaque = mr->ram_block->host; + } } void memory_region_init_alias(MemoryRegion *mr,