From patchwork Fri Feb 28 14:35:05 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Peter Lieven X-Patchwork-Id: 325235 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 8A2912C00B2 for ; Sat, 1 Mar 2014 01:35:57 +1100 (EST) Received: from localhost ([::1]:51536 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WJOXb-0006TU-2z for incoming@patchwork.ozlabs.org; Fri, 28 Feb 2014 09:35:55 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57410) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WJOXA-0006Sr-2T for qemu-devel@nongnu.org; Fri, 28 Feb 2014 09:35:33 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WJOX4-0005r0-Dm for qemu-devel@nongnu.org; Fri, 28 Feb 2014 09:35:27 -0500 Received: from mx.ipv6.kamp.de ([2a02:248:0:51::16]:47328 helo=mx01.kamp.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WJOX3-0005qE-UM for qemu-devel@nongnu.org; Fri, 28 Feb 2014 09:35:22 -0500 Received: (qmail 20060 invoked by uid 89); 28 Feb 2014 14:35:19 -0000 Received: from [195.62.97.28] by client-16-kamp (envelope-from , uid 89) with qmail-scanner-2010/03/19-MF (clamdscan: 0.98.1/18523. hbedv: 8.2.14.18/7.11.133.254. spamassassin: 3.3.1. Clear:RC:1(195.62.97.28):SA:0(-2.0/5.0):. Processed in 12.419547 secs); 28 Feb 2014 14:35:19 -0000 Received: from smtp.kamp.de (HELO submission.kamp.de) ([195.62.97.28]) by mx01.kamp.de with SMTP; 28 Feb 2014 14:35:06 -0000 X-GL_Whitelist: yes Received: (qmail 28571 invoked from network); 28 Feb 2014 14:35:04 -0000 Received: from lieven-pc.kamp-intra.net (HELO ?172.21.12.60?) (pl@kamp.de@172.21.12.60) by submission.kamp.de with ESMTPS (DHE-RSA-AES128-SHA encrypted) ESMTPA; 28 Feb 2014 14:35:04 -0000 Message-ID: <53109E99.3020102@kamp.de> Date: Fri, 28 Feb 2014 15:35:05 +0100 From: Peter Lieven User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Stefan Hajnoczi References: <530DBE6C.5030502@kamp.de> <20140226154154.GB20820@stefanha-thinkpad.muc.redhat.com> <530E0FF0.20501@kamp.de> <20140227085711.GC21749@stefanha-thinkpad.redhat.com> In-Reply-To: <20140227085711.GC21749@stefanha-thinkpad.redhat.com> X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a02:248:0:51::16 Cc: Kevin Wolf , Stefan Hajnoczi , "qemu-devel@nongnu.org" , Paolo Bonzini Subject: Re: [Qemu-devel] qemu-img convert cache mode for source X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org On 27.02.2014 09:57, Stefan Hajnoczi wrote: > On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote: >> On 26.02.2014 16:41, Stefan Hajnoczi wrote: >>> On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote: >>>> I was wondering if it would be a good idea to set the O_DIRECT mode for the source >>>> files of a qemu-img convert process if the source is a host_device? >>>> >>>> Currently the backup of a host device is polluting the page cache. >>> Points to consider: >>> >>> 1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening >>> the file. A fallback is necessary. >>> >>> 2. O_DIRECT has no readahead so performance could actually decrease. >>> The question is, how important is reahead versus polluting page >>> cache? >>> >>> 3. For raw files it would make sense to tell the kernel that access is >>> sequential and data will be used only once. Then we can get the best >>> of both worlds (avoid polluting page cache but still get readahead). >>> This is done using posix_fadvise(2). >>> >>> The problem is what to do for image formats. An image file can be >>> very fragmented so the readahead might not be a win. Does this mean >>> that for image formats we should tell the kernel access will be >>> random? >>> >>> Furthermore, maybe it's best to do readahead inside QEMU so that even >>> network protocols (nbd, iscsi, etc) can get good performance. They >>> act like O_DIRECT is always on. >> your comments are regarding qemu-img convert, right? >> How would you implement this? A new open flag because >> the fadvise had to goto inside the protocol driver. >> >> I would start with host_devices first and see how it performs there. >> >> For qemu-img convert I would issue a FADV_DONTNEED after >> a write for the bytes that have been written >> (i have tested this with Linux and it seems to work quite well). >> >> Question is, what is the right paramter for reads? Also FADV_DONTNEED? > I think so but this should be justified with benchmark results. I ran some benchmarks at found that a FADV_DONTNEED issues after a read does not hurt regarding to performance. But it avoids buffers increasing while I read from a host_device of raw file. As for writing it does only work if I issue a fdatasync after each write, but this should be equivalent to O_DIRECT. So I would keep the patch to support qemu-img convert sources if they are host_device or file. Here is a proposal for a patch: ........................................................... KAMP Netzwerkdienste GmbH Vestische Str. 89-91 | 46117 Oberhausen Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40 pl@kamp.de | http://www.kamp.de Geschäftsführer: Heiner Lante | Michael Lante Amtsgericht Duisburg | HRB Nr. 12154 USt-Id-Nr.: DE 120607556 ........................................................... diff --git a/block.c b/block.c index 2fd5482..2445433 100644 --- a/block.c +++ b/block.c @@ -2626,6 +2626,14 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset, qemu_aio_wait(); } } + +#ifdef POSIX_FADV_DONTNEED + if (!rwco.ret && bs->open_flags & BDRV_O_SEQUENTIAL && + bs->drv->bdrv_fadvise && !is_write) { + bs->drv->bdrv_fadvise(bs, offset, qiov->size, POSIX_FADV_DONTNEED); + } +#endif + return rwco.ret; } diff --git a/block/raw-posix.c b/block/raw-posix.c index 161ea14..d8d78d8 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -1397,6 +1397,12 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi) return 0; } +static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise) +{ + BDRVRawState *s = bs->opaque; + return posix_fadvise(s->fd, offset, len, advise); +} + static QEMUOptionParameter raw_create_options[] = { { .name = BLOCK_OPT_SIZE, @@ -1433,6 +1439,7 @@ static BlockDriver bdrv_file = { .bdrv_get_info = raw_get_info, .bdrv_get_allocated_file_size = raw_get_allocated_file_size, + .bdrv_fadvise = raw_fadvise, .create_options = raw_create_options, }; @@ -1811,6 +1818,7 @@ static BlockDriver bdrv_host_device = { .bdrv_get_info = raw_get_info, .bdrv_get_allocated_file_size = raw_get_allocated_file_size, + .bdrv_fadvise = raw_fadvise, /* generic scsi device */ #ifdef __linux__ diff --git a/block/raw_bsd.c b/block/raw_bsd.c index 01ea692..f09bc70 100644 --- a/block/raw_bsd.c +++ b/block/raw_bsd.c @@ -171,6 +171,15 @@ static int raw_probe(const uint8_t *buf, int buf_size, const char *filename) return 1; } +static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise) +{ + if (bs->file->drv->bdrv_fadvise) { + return bs->file->drv->bdrv_fadvise(bs->file, offset, len, advise); + } + return 0; +} + + static BlockDriver bdrv_raw = { .format_name = "raw", .bdrv_probe = &raw_probe, @@ -195,7 +204,8 @@ static BlockDriver bdrv_raw = { .bdrv_ioctl = &raw_ioctl, .bdrv_aio_ioctl = &raw_aio_ioctl, .create_options = &raw_create_options[0], - .bdrv_has_zero_init = &raw_has_zero_init + .bdrv_has_zero_init = &raw_has_zero_init, + .bdrv_fadvise = &raw_fadvise, }; static void bdrv_raw_init(void) diff --git a/include/block/block.h b/include/block/block.h index 780f48b..a4dcc3c 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -105,6 +105,9 @@ typedef enum { #define BDRV_O_PROTOCOL 0x8000 /* if no block driver is explicitly given: select an appropriate protocol driver, ignoring the format layer */ +#define BDRV_O_SEQUENTIAL 0x10000 /* open device for sequential read/write */ + + #define BDRV_O_CACHE_MASK (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH) diff --git a/include/block/block_int.h b/include/block/block_int.h index 0bcf1c9..7efad55 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -246,6 +246,8 @@ struct BlockDriver { * zeros, 0 otherwise. */ int (*bdrv_has_zero_init)(BlockDriverState *bs); + + int (*bdrv_fadvise)(BlockDriverState *bs, off_t offset, off_t len, int advise); QLIST_ENTRY(BlockDriver) list; }; diff --git a/qemu-img.c b/qemu-img.c index 78fc868..2b900d0 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1298,7 +1298,8 @@ static int img_convert(int argc, char **argv) total_sectors = 0; for (bs_i = 0; bs_i < bs_n; bs_i++) { - bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt, BDRV_O_FLAGS, true, + bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt, + BDRV_O_FLAGS | BDRV_O_SEQUENTIAL, true, quiet); if (!bs[bs_i]) { error_report("Could not open '%s'", argv[optind + bs_i]);