From patchwork Thu Oct 29 03:07:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Matthew Ruffell X-Patchwork-Id: 1389820 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CM9P316LVz9sSP; Thu, 29 Oct 2020 14:08:07 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1kXyIE-0005FL-S2; Thu, 29 Oct 2020 03:08:02 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1kXyIA-0005Cg-S1 for kernel-team@lists.ubuntu.com; Thu, 29 Oct 2020 03:07:58 +0000 Received: from mail-pl1-f200.google.com ([209.85.214.200]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1kXyIA-000357-E4 for kernel-team@lists.ubuntu.com; Thu, 29 Oct 2020 03:07:58 +0000 Received: by mail-pl1-f200.google.com with SMTP id g10so996719plq.16 for ; Wed, 28 Oct 2020 20:07:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=q77YgqxtzkTLHWRJ4Cu9m3bf0c67Yl9AaDmULT/YXE4=; b=nDwnhWuDeN2OPGZwvvkl8/dvL9bOh8ri5SSIZtlDfGTRix8G0Qzox0oQpIWEPSfLp/ 0UP2GUhZyaqJbvZ+sgD6oRCxSeUPRsagJNuFTpJZ+WagsfEi0UylQQXOAnO12nvxFyjN Ak+ViDz5xa6MUoF+VsCcUPgldVXf7GD6hzofrYUgW1cNXuNBeCc84NC55wEC34M8rxOC lq8bI9etmWgpohMq74UubpvFk5Z2k2Dg402/3KoNaVuPcto7GHiP5OsA3poZILXSB0u4 zIxr6IufBIIX6fbyKBqowlCJMf38lOoiblrOoKKBRuiUneWpe6Go20Bv5llgvEpvfXWx zYJg== X-Gm-Message-State: AOAM533Ezn1DBYAcjobirGaSGB4ybWNv6QMHmd+d2IjqZhMwi2SeDf0y 3AfoePnEw45RMx1M6tjbg7TXYDwyJA8kenm7cE3rujMDmw3S09cXiyMEM2QPP8NK4Opc5mmjlki vVE/7SeWbqULEqWICyguD6VTfHTSWI7J2R9W842RgYQ== X-Received: by 2002:a05:6a00:1693:b029:155:abe5:caa2 with SMTP id k19-20020a056a001693b0290155abe5caa2mr2045481pfc.39.1603940876741; Wed, 28 Oct 2020 20:07:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyq8msz1w0IOynrwJU5oJmmMiN1Lc6YLtQZ9kHLquhFTUGZpWcZlT6nT2ACJJPm3sOsjTuwjA== X-Received: by 2002:a05:6a00:1693:b029:155:abe5:caa2 with SMTP id k19-20020a056a001693b0290155abe5caa2mr2045463pfc.39.1603940876392; Wed, 28 Oct 2020 20:07:56 -0700 (PDT) Received: from localhost.localdomain (222-152-178-139-fibre.sparkbb.co.nz. [222.152.178.139]) by smtp.gmail.com with ESMTPSA id r6sm966882pfg.85.2020.10.28.20.07.54 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Oct 2020 20:07:55 -0700 (PDT) From: Matthew Ruffell To: kernel-team@lists.ubuntu.com Subject: [SRU][F][G][PATCH 4/7] md/raid10: improve raid10 discard request Date: Thu, 29 Oct 2020 16:07:32 +1300 Message-Id: <20201029030737.21204-6-matthew.ruffell@canonical.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20201029030737.21204-1-matthew.ruffell@canonical.com> References: <20201029030737.21204-1-matthew.ruffell@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Xiao Ni BugLink: https://bugs.launchpad.net/bugs/1896578 Now the discard request is split by chunk size. So it takes a long time to finish mkfs on disks which support discard function. This patch improve handling raid10 discard request. It uses the similar way with patch 29efc390b (md/md0: optimize raid0 discard handling). But it's a little complex than raid0. Because raid10 has different layout. If raid10 is offset layout and the discard request is smaller than stripe size. There are some holes when we submit discard bio to underlayer disks. For example: five disks (disk1 - disk5) D01 D02 D03 D04 D05 D05 D01 D02 D03 D04 D06 D07 D08 D09 D10 D10 D06 D07 D08 D09 The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request can handle one chunk perfectly. So the part that is not aligned with stripe size is still handled by raid10_write_request. If reshape is running when discard bio comes and the discard bio spans the reshape position, raid10_write_request is responsible to handle this discard bio. I did a test with this patch set. Without patch: time mkfs.xfs /dev/md0 real4m39.775s user0m0.000s sys0m0.298s With patch: time mkfs.xfs /dev/md0 real0m0.105s user0m0.000s sys0m0.007s nvme3n1 259:1 0 477G 0 disk └─nvme3n1p1 259:10 0 50G 0 part nvme4n1 259:2 0 477G 0 disk └─nvme4n1p1 259:11 0 50G 0 part nvme5n1 259:6 0 477G 0 disk └─nvme5n1p1 259:12 0 50G 0 part nvme2n1 259:9 0 477G 0 disk └─nvme2n1p1 259:15 0 50G 0 part nvme0n1 259:13 0 477G 0 disk └─nvme0n1p1 259:14 0 50G 0 part Reviewed-by: Coly Li Reviewed-by: Guoqing Jiang Signed-off-by: Xiao Ni Signed-off-by: Song Liu (backported from commit bcc90d280465ebd51ab8688be86e1f00c62dccf9) [mruffell: change submit_bio_noacct() to generic_make_request()] Signed-off-by: Matthew Ruffell --- drivers/md/raid10.c | 256 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 255 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 51c483d71562..e67ecd1067a1 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1533,6 +1533,256 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors) raid10_write_request(mddev, bio, r10_bio); } +static struct bio *raid10_split_bio(struct r10conf *conf, + struct bio *bio, sector_t sectors, bool want_first) +{ + struct bio *split; + + split = bio_split(bio, sectors, GFP_NOIO, &conf->bio_split); + bio_chain(split, bio); + allow_barrier(conf); + if (want_first) { + generic_make_request(bio); + bio = split; + } else + generic_make_request(split); + wait_barrier(conf); + + return bio; +} + +static void raid10_end_discard_request(struct bio *bio) +{ + struct r10bio *r10_bio = bio->bi_private; + struct r10conf *conf = r10_bio->mddev->private; + struct md_rdev *rdev = NULL; + int dev; + int slot, repl; + + /* + * We don't care the return value of discard bio + */ + if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) + set_bit(R10BIO_Uptodate, &r10_bio->state); + + dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl); + if (repl) + rdev = conf->mirrors[dev].replacement; + if (!rdev) { + /* raid10_remove_disk uses smp_mb to make sure rdev is set to + * replacement before setting replacement to NULL. It can read + * rdev first without barrier protect even replacment is NULL + */ + smp_rmb(); + rdev = conf->mirrors[dev].rdev; + } + + if (atomic_dec_and_test(&r10_bio->remaining)) { + md_write_end(r10_bio->mddev); + raid_end_bio_io(r10_bio); + } + + rdev_dec_pending(rdev, conf->mddev); +} + +/* There are some limitations to handle discard bio + * 1st, the discard size is bigger than stripe_size*2. + * 2st, if the discard bio spans reshape progress, we use the old way to + * handle discard bio + */ +static int raid10_handle_discard(struct mddev *mddev, struct bio *bio) +{ + struct r10conf *conf = mddev->private; + struct geom *geo = &conf->geo; + struct r10bio *r10_bio; + + int disk; + sector_t chunk; + unsigned int stripe_size; + sector_t split_size; + + sector_t bio_start, bio_end; + sector_t first_stripe_index, last_stripe_index; + sector_t start_disk_offset; + unsigned int start_disk_index; + sector_t end_disk_offset; + unsigned int end_disk_index; + unsigned int remainder; + + if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) + return -EAGAIN; + + wait_barrier(conf); + + /* Check reshape again to avoid reshape happens after checking + * MD_RECOVERY_RESHAPE and before wait_barrier + */ + if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) + goto out; + + stripe_size = geo->raid_disks << geo->chunk_shift; + bio_start = bio->bi_iter.bi_sector; + bio_end = bio_end_sector(bio); + + /* Maybe one discard bio is smaller than strip size or across one stripe + * and discard region is larger than one stripe size. For far offset layout, + * if the discard region is not aligned with stripe size, there is hole + * when we submit discard bio to member disk. For simplicity, we only + * handle discard bio which discard region is bigger than stripe_size*2 + */ + if (bio_sectors(bio) < stripe_size*2) + goto out; + + /* For far offset layout, if bio is not aligned with stripe size, it splits + * the part that is not aligned with strip size. + */ + div_u64_rem(bio_start, stripe_size, &remainder); + if (geo->far_offset && remainder) { + split_size = stripe_size - remainder; + bio = raid10_split_bio(conf, bio, split_size, false); + } + div_u64_rem(bio_end, stripe_size, &remainder); + if (geo->far_offset && remainder) { + split_size = bio_sectors(bio) - remainder; + bio = raid10_split_bio(conf, bio, split_size, true); + } + + r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO); + r10_bio->mddev = mddev; + r10_bio->state = 0; + r10_bio->sectors = 0; + memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * geo->raid_disks); + + wait_blocked_dev(mddev, r10_bio); + + r10_bio->master_bio = bio; + + bio_start = bio->bi_iter.bi_sector; + bio_end = bio_end_sector(bio); + + /* raid10 uses chunk as the unit to store data. It's similar like raid0. + * One stripe contains the chunks from all member disk (one chunk from + * one disk at the same HBA address). For layout detail, see 'man md 4' + */ + chunk = bio_start >> geo->chunk_shift; + chunk *= geo->near_copies; + first_stripe_index = chunk; + start_disk_index = sector_div(first_stripe_index, geo->raid_disks); + if (geo->far_offset) + first_stripe_index *= geo->far_copies; + start_disk_offset = (bio_start & geo->chunk_mask) + + (first_stripe_index << geo->chunk_shift); + + chunk = bio_end >> geo->chunk_shift; + chunk *= geo->near_copies; + last_stripe_index = chunk; + end_disk_index = sector_div(last_stripe_index, geo->raid_disks); + if (geo->far_offset) + last_stripe_index *= geo->far_copies; + end_disk_offset = (bio_end & geo->chunk_mask) + + (last_stripe_index << geo->chunk_shift); + + rcu_read_lock(); + for (disk = 0; disk < geo->raid_disks; disk++) { + struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev); + struct md_rdev *rrdev = rcu_dereference( + conf->mirrors[disk].replacement); + + r10_bio->devs[disk].bio = NULL; + r10_bio->devs[disk].repl_bio = NULL; + + if (rdev && (test_bit(Faulty, &rdev->flags))) + rdev = NULL; + if (rrdev && (test_bit(Faulty, &rrdev->flags))) + rrdev = NULL; + if (!rdev && !rrdev) + continue; + + if (rdev) { + r10_bio->devs[disk].bio = bio; + atomic_inc(&rdev->nr_pending); + } + if (rrdev) { + r10_bio->devs[disk].repl_bio = bio; + atomic_inc(&rrdev->nr_pending); + } + } + rcu_read_unlock(); + + atomic_set(&r10_bio->remaining, 1); + for (disk = 0; disk < geo->raid_disks; disk++) { + sector_t dev_start, dev_end; + struct bio *mbio, *rbio = NULL; + struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev); + struct md_rdev *rrdev = rcu_dereference( + conf->mirrors[disk].replacement); + + /* + * Now start to calculate the start and end address for each disk. + * The space between dev_start and dev_end is the discard region. + * + * For dev_start, it needs to consider three conditions: + * 1st, the disk is before start_disk, you can imagine the disk in + * the next stripe. So the dev_start is the start address of next + * stripe. + * 2st, the disk is after start_disk, it means the disk is at the + * same stripe of first disk + * 3st, the first disk itself, we can use start_disk_offset directly + */ + if (disk < start_disk_index) + dev_start = (first_stripe_index + 1) * mddev->chunk_sectors; + else if (disk > start_disk_index) + dev_start = first_stripe_index * mddev->chunk_sectors; + else + dev_start = start_disk_offset; + + if (disk < end_disk_index) + dev_end = (last_stripe_index + 1) * mddev->chunk_sectors; + else if (disk > end_disk_index) + dev_end = last_stripe_index * mddev->chunk_sectors; + else + dev_end = end_disk_offset; + + /* It only handles discard bio which size is >= stripe size, so + * dev_end > dev_start all the time + */ + if (r10_bio->devs[disk].bio) { + mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set); + mbio->bi_end_io = raid10_end_discard_request; + mbio->bi_private = r10_bio; + r10_bio->devs[disk].bio = mbio; + r10_bio->devs[disk].devnum = disk; + atomic_inc(&r10_bio->remaining); + md_submit_discard_bio(mddev, rdev, mbio, + dev_start + choose_data_offset(r10_bio, rdev), + dev_end - dev_start); + bio_endio(mbio); + } + if (r10_bio->devs[disk].repl_bio) { + rbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set); + rbio->bi_end_io = raid10_end_discard_request; + rbio->bi_private = r10_bio; + r10_bio->devs[disk].repl_bio = rbio; + r10_bio->devs[disk].devnum = disk; + atomic_inc(&r10_bio->remaining); + md_submit_discard_bio(mddev, rrdev, rbio, + dev_start + choose_data_offset(r10_bio, rrdev), + dev_end - dev_start); + bio_endio(rbio); + } + } + + if (atomic_dec_and_test(&r10_bio->remaining)) { + md_write_end(r10_bio->mddev); + raid_end_bio_io(r10_bio); + } + + return 0; +out: + allow_barrier(conf); + return -EAGAIN; +} + static bool raid10_make_request(struct mddev *mddev, struct bio *bio) { struct r10conf *conf = mddev->private; @@ -1547,6 +1797,10 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio) if (!md_write_start(mddev, bio)) return false; + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) + if (!raid10_handle_discard(mddev, bio)) + return true; + /* * If this request crosses a chunk boundary, we need to split * it. @@ -3777,7 +4031,7 @@ static int raid10_run(struct mddev *mddev) chunk_size = mddev->chunk_sectors << 9; if (mddev->queue) { blk_queue_max_discard_sectors(mddev->queue, - mddev->chunk_sectors); + UINT_MAX); blk_queue_max_write_same_sectors(mddev->queue, 0); blk_queue_max_write_zeroes_sectors(mddev->queue, 0); blk_queue_io_min(mddev->queue, chunk_size);