From patchwork Sun Oct 13 23:47:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Guilherme G. Piccoli" X-Patchwork-Id: 1175926 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 46ryzV4FP6z9sPT; Mon, 14 Oct 2019 10:47:34 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1iJnaF-0006Ix-5w; Sun, 13 Oct 2019 23:47:31 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1iJnaC-0006Ig-IY for kernel-team@lists.ubuntu.com; Sun, 13 Oct 2019 23:47:28 +0000 Received: from mail-pf1-f197.google.com ([209.85.210.197]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1iJnaC-0006rb-8A for kernel-team@lists.ubuntu.com; Sun, 13 Oct 2019 23:47:28 +0000 Received: by mail-pf1-f197.google.com with SMTP id i187so12368217pfc.10 for ; Sun, 13 Oct 2019 16:47:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=siDuZQcW9Z6K5W5T2qKfZtRlltOyWxqf/XCgge3dQfQ=; b=NJizT8L/yWvi+A1B8dnYClFhwiHvn62GErCWKDns0kZkHkqnuBLhaLTF+CJIqPPX5c YNeb7aO8qcCeOPem2SkcphbGypzq6Yt0SiHr/sVM9Ms+cCWofuBHT5SHybYDjNRMYj2S oP9ul3qxq1gQTqmpYZKQ+fYP5tqjBTYzLLyizInQQPE0FojFqAaHAA3YhvhPKj3iH7HL HVdIDRzwrClUJ+FncOUwoVbkiiING/E5zDk/VxXTul6J3cZxbQFikXmsT2I2ln8kFOS/ Y2zXk0+t7PnAVs8kH0xGfVWZU3FBx27aGbz+wrDTjCUdpdoGFBnXNfrdyrowdHa0NKgH t1Vw== X-Gm-Message-State: APjAAAW7UMj14EO//Xfa/qaFYbkSHGn8D8ghNy8wszJVAZ8wlV8uldAe 8YZDytQ/T8lc99x3SVbHoFLJgt3WuCFcc3ee88dJA6208EuxPlpC7ONiAwv5CyXwEAZrfh1GcMk HDMrNOn2ciTbwmPTwZ99E+81I72IVcXLldOG8h01IYA== X-Received: by 2002:a62:4ed6:: with SMTP id c205mr10693685pfb.208.1571010446068; Sun, 13 Oct 2019 16:47:26 -0700 (PDT) X-Google-Smtp-Source: APXvYqziO+o94fD7FtT907QBXwYgGj6Mj31x0jkeRKcSzdp/I4m0u/cYIYSpnbfWsevPN6JdUuhamA== X-Received: by 2002:a62:4ed6:: with SMTP id c205mr10693668pfb.208.1571010445686; Sun, 13 Oct 2019 16:47:25 -0700 (PDT) Received: from localhost ([177.183.163.179]) by smtp.gmail.com with ESMTPSA id z20sm17388696pfj.156.2019.10.13.16.47.23 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 Oct 2019 16:47:24 -0700 (PDT) From: "Guilherme G. Piccoli" To: kernel-team@lists.ubuntu.com Subject: [B] [PATCH] md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone Date: Sun, 13 Oct 2019 20:47:06 -0300 Message-Id: <20191013234707.17334-2-gpiccoli@canonical.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20191013234707.17334-1-gpiccoli@canonical.com> References: <20191013234707.17334-1-gpiccoli@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: gpiccoli@canonical.com, kernel@gpiccoli.net Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" BugLink: https://bugs.launchpad.net/bugs/1847773 Currently md raid0/linear are not provided with any mechanism to validate if an array member got removed or failed. The driver keeps sending BIOs regardless of the state of array members, and kernel shows state 'clean' in the 'array_state' sysfs attribute. This leads to the following situation: if a raid0/linear array member is removed and the array is mounted, some user writing to this array won't realize that errors are happening unless they check dmesg or perform one fsync per written file. Despite udev signaling the member device is gone, 'mdadm' cannot issue the STOP_ARRAY ioctl successfully, given the array is mounted. In other words, no -EIO is returned and writes (except direct ones) appear normal. Meaning the user might think the wrote data is correctly stored in the array, but instead garbage was written given that raid0 does stripping (and so, it requires all its members to be working in order to not corrupt data). For md/linear, writes to the available members will work fine, but if the writes go to the missing member(s), it'll cause a file corruption situation, whereas the portion of the writes to the missing devices aren't written effectively. This patch changes this behavior: we check if the block device's gendisk is UP when submitting the BIO to the array member, and if it isn't, we flag the md device as MD_BROKEN and fail subsequent I/Os to that device; a read request to the array requiring data from a valid member is still completed. While flagging the device as MD_BROKEN, we also show a rate-limited warning in the kernel log. A new array state 'broken' was added too: it mimics the state 'clean' in every aspect, being useful only to distinguish if the array has some member missing. We rely on the MD_BROKEN flag to put the array in the 'broken' state. This state cannot be written in 'array_state' as it just shows one or more members of the array are missing but acts like 'clean', it wouldn't make sense to write it. With this patch, the filesystem reacts much faster to the event of missing array member: after some I/O errors, ext4 for instance aborts the journal and prevents corruption. Without this change, we're able to keep writing in the disk and after a machine reboot, e2fsck shows some severe fs errors that demand fixing. This patch was tested in ext4 and xfs filesystems, and requires a 'mdadm' counterpart to handle the 'broken' state. Cc: Song Liu Reviewed-by: NeilBrown Signed-off-by: Guilherme G. Piccoli Signed-off-by: Song Liu (backported from commit 62f7b1989c02feed9274131b2fd5e990de4aba6f) [gpiccoli: - minimal code adjustment in md.c - new kernels have an extra check in the if() statement inside array_state_show(). - context adjustment.] Signed-off-by: Guilherme G. Piccoli --- Hi kernel team, I'd like to request an "exception" here - if possible, this patch shouldn't be in Xenial/HWE kernel. There's a counter-part for this on mdadm that ideally is needed in order this patch is fully functional; without that, the patch works fine but mdadm output is a bit modified in a way different than was intended. I didn't backport the mdadm patch to Xenial due to regression risk (the mdadm code changed a bit much between Xenial and Bionic version, the former is missing an infrastructure needed by my patch), so if possible, we could get a revert for this one in Xenial 4.15 after the merge. Thanks, Guilherme drivers/md/md-linear.c | 5 +++++ drivers/md/md.c | 22 ++++++++++++++++++---- drivers/md/md.h | 16 ++++++++++++++++ drivers/md/raid0.c | 6 ++++++ 4 files changed, 45 insertions(+), 4 deletions(-) diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c index 773fc70dced7..c8fb680bb952 100644 --- a/drivers/md/md-linear.c +++ b/drivers/md/md-linear.c @@ -266,6 +266,11 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio) bio_sector < start_sector)) goto out_of_bounds; + if (unlikely(is_mddev_broken(tmp_dev->rdev, "linear"))) { + bio_io_error(bio); + return true; + } + if (unlikely(bio_end_sector(bio) > end_sector)) { /* This bio crosses a device boundary, so we have to split it */ struct bio *split = bio_split(bio, end_sector - bio_sector, diff --git a/drivers/md/md.c b/drivers/md/md.c index 1c88626b2ec0..324d6116572e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -319,6 +319,11 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio) unsigned int sectors; int cpu; + if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) { + bio_io_error(bio); + return BLK_QC_T_NONE; + } + blk_queue_split(q, &bio); if (mddev == NULL || mddev->pers == NULL) { @@ -4106,12 +4111,17 @@ __ATTR_PREALLOC(resync_start, S_IRUGO|S_IWUSR, * active-idle * like active, but no writes have been seen for a while (100msec). * + * broken + * RAID0/LINEAR-only: same as clean, but array is missing a member. + * It's useful because RAID0/LINEAR mounted-arrays aren't stopped + * when a member is gone, so this state will at least alert the + * user that something is wrong. */ enum array_state { clear, inactive, suspended, readonly, read_auto, clean, active, - write_pending, active_idle, bad_word}; + write_pending, active_idle, broken, bad_word}; static char *array_states[] = { "clear", "inactive", "suspended", "readonly", "read-auto", "clean", "active", - "write-pending", "active-idle", NULL }; + "write-pending", "active-idle", "broken", NULL }; static int match_word(const char *word, char **list) { @@ -4127,7 +4137,7 @@ array_state_show(struct mddev *mddev, char *page) { enum array_state st = inactive; - if (mddev->pers) + if (mddev->pers) { switch(mddev->ro) { case 1: st = readonly; @@ -4147,7 +4157,10 @@ array_state_show(struct mddev *mddev, char *page) st = active; spin_unlock(&mddev->lock); } - else { + + if (test_bit(MD_BROKEN, &mddev->flags) && st == clean) + st = broken; + } else { if (list_empty(&mddev->disks) && mddev->raid_disks == 0 && mddev->dev_sectors == 0) @@ -4261,6 +4274,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) break; case write_pending: case active_idle: + case broken: /* these cannot be set */ break; } diff --git a/drivers/md/md.h b/drivers/md/md.h index e0e32f9535cc..97ee8fb689db 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -243,6 +243,9 @@ enum mddev_flags { MD_UPDATING_SB, /* md_check_recovery is updating the metadata * without explicitly holding reconfig_mutex. */ + MD_BROKEN, /* This is used in RAID-0/LINEAR only, to stop + * I/O in case an array member is gone/failed. + */ }; enum mddev_sb_flags { @@ -708,6 +711,19 @@ extern void md_update_sb(struct mddev *mddev, int force); extern void md_kick_rdev_from_array(struct md_rdev * rdev); struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); +static inline bool is_mddev_broken(struct md_rdev *rdev, const char *md_type) +{ + int flags = rdev->bdev->bd_disk->flags; + + if (!(flags & GENHD_FL_UP)) { + if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) + pr_warn("md: %s: %s array has a missing/failed member\n", + mdname(rdev->mddev), md_type); + return true; + } + return false; +} + static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) { int faulty = test_bit(Faulty, &rdev->flags); diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 5ecba9eef441..48d6bb6b18e5 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -590,6 +590,12 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) zone = find_zone(mddev->private, §or); tmp_dev = map_sector(mddev, zone, sector, §or); + + if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) { + bio_io_error(bio); + return true; + } + bio_set_dev(bio, tmp_dev->bdev); bio->bi_iter.bi_sector = sector + zone->dev_start + tmp_dev->data_offset;