From patchwork Wed Apr 16 05:00:10 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Amit Sahrawat <amit.sahrawat83@gmail.com>
X-Patchwork-Id: 339431
Return-Path: <linux-ext4-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 740A014008F
	for <patchwork-incoming@ozlabs.org>;
	Wed, 16 Apr 2014 15:00:49 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753164AbaDPFAP (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Wed, 16 Apr 2014 01:00:15 -0400
Received: from mail-wg0-f49.google.com ([74.125.82.49]:51877 "EHLO
	mail-wg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751046AbaDPFAM convert rfc822-to-8bit (ORCPT
	<rfc822; linux-ext4@vger.kernel.org>); Wed, 16 Apr 2014 01:00:12 -0400
Received: by mail-wg0-f49.google.com with SMTP id a1so10522835wgh.20
	for <multiple recipients>; Tue, 15 Apr 2014 22:00:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:content-type:content-transfer-encoding;
	bh=RS60dTVuGUl6ieiN6EAYjAS8rJ/qB4UVoaA+5XRBHYA=;
	b=xI+EeoJp261XDQWFdvF3UWxeFRgEVtMW4RveFJuQVy+XfeY77BOEvA6WC6cP8yvNHR
	xXjNkFi99h6PakBEfT0dwY0nQA+8UKiQL6a8G3NXfKom3WZq+rwF75O/qJ2b4AubHvXE
	oXKooSk6x+ankaFG5CzRJg0yvxG3hkf7t0mEgiGu8Jcz0oqJzmK63Q654B2E8ozzkSv+
	CkQsHWhQZ/OnCqzUk88Ga92Xt8l6Bb3grh1zeaNRmFMzRX+BowjZ6qgiEqfJP8lp99br
	IsLltni7QR0oqxifJ9UZpszX5zNaD+eg3ky7Du9jM+xVQ67tLEVasDGHhw3K7UiMj1me
	s0oA==
MIME-Version: 1.0
X-Received: by 10.180.76.244 with SMTP id n20mr3086751wiw.4.1397624410816;
	Tue, 15 Apr 2014 22:00:10 -0700 (PDT)
Received: by 10.194.21.4 with HTTP; Tue, 15 Apr 2014 22:00:10 -0700 (PDT)
In-Reply-To: <20140415124743.GD3403@thunk.org>
References: 
 <CADDb1s2RvN_S+abFXCe4ZhZPKZgP_PiocJdpiLzRC_Se5sgVVg@mail.gmail.com>
	<20140415124743.GD3403@thunk.org>
Date: Wed, 16 Apr 2014 10:30:10 +0530
Message-ID: 
 <CADDb1s3HYDvb51Ngrwk82gkpbUWg1bRo7kaUmbGRmb0g_9JKgw@mail.gmail.com>
Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors
	are seen.
From: Amit Sahrawat <amit.sahrawat83@gmail.com>
To: "Theodore Ts'o" <tytso@mit.edu>,
	Amit Sahrawat <amit.sahrawat83@gmail.com>,
	Jan Kara <jack@suse.cz>, darrick.wong@oracle.com,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>, Namjae Jeon <linkinjeon@gmail.com>
Sender: linux-ext4-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-ext4.vger.kernel.org>
X-Mailing-List: linux-ext4@vger.kernel.org

Thanks Ted, for the detailed reply.

We could not retain the original HDD – on which we got the issue. In
order to replicate the problem we did steps like this:
1)	Make 250MB partition
2)	Format the partition with blocksize ‘4K’
3)	Using dumpe2fs – get the block group information
Group 0: (Blocks 0-32767)
  Checksum 0x89ae, unused inodes 29109
  Primary superblock at 0, Group descriptors at 1-1
  Reserved GDT blocks at 2-15
  Block bitmap at 16 (+16), Inode bitmap at 32 (+32)
  Inode table at 48-957 (+48)
  26826 free blocks, 29109 free inodes, 2 directories, 29109 unused inodes
  Free blocks: 5942-32767
  Free inodes: 12-29120
Group 1: (Blocks 32768-58226) [INODE_UNINIT]
  Checksum 0xb43a, unused inodes 29120
  Backup superblock at 32768, Group descriptors at 32769-32769
  Reserved GDT blocks at 32770-32783
  Block bitmap at 17 (bg #0 + 17), Inode bitmap at 33 (bg #0 + 33)
  Inode table at 958-1867 (bg #0 + 958)
  25443 free blocks, 29120 free inodes, 0 directories, 29120 unused inodes
  Free blocks: 32784-58226
  Free inodes: 29121-58240
4)	Corrupt the block group ‘1’  by writing all ‘1’, we had one file
with all 1’s, so using ‘dd’ –
dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
After this mount the partition – create few random size files and then
ran ‘fsstress,

fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir

and we get logs like before the hang:
#fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
seed = 582332
EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

EXT4-fs (sdb1): delayed block allocation failed for inode 26 at
logical offset 181 with max blocks 2 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=51
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=9
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 101 at
logical offset 198 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=43
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=1
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 94 at
logical offset 450 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=36
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=12
EXT4-fs (sdb1): i_reserved_meta_blocks=2


…
EXT4-fs (sdb1): error count: 3
EXT4-fs (sdb1): initial error at 545: ext4_mb_generate_buddy:743
EXT4-fs (sdb1): last error at 42: ext4_mb_generate_buddy:743
…
Yes, we are running this on ARM target and we did not set the time
before running these operations. So, the time actually corresponds to
date -d @545
Thu Jan  1 05:39:05 IST 1970
date -d @42
Thu Jan  1 05:30:42 IST 1970


Yes, we are running kernel version 3.8 and applied the patches from
‘Darrick’ to fix the first looping around in ext4_da_writepages.

As you suggested, we also made change to mark the FS READ ONLY in case
of corruption. The changes are;
                ext4_da_block_invalidatepages(mpd);
@@ -2303,6 +2305,36 @@ out:
        return ret;
 }

+static void ext4_invalidate_mapping(struct  address_space *mapping)
+{
+       struct pagevec pvec;
+       unsigned int i;
+       pgoff_t index = 0;
+       struct inode* inode = mapping->host;
+
+       pagevec_init(&pvec, 0);
+       while (pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_DIRTY,
+                                 PAGEVEC_SIZE)) {
+               for (i = 0; i < pagevec_count(&pvec); i++) {
+                       struct page *page = pvec.pages[i];
+                       lock_page(page);
+                       spin_lock_irq(&mapping->tree_lock);
+                       if (test_bit(PG_dirty, &page->flags)) {
+                               printk(KERN_ERR"INO[%lu], i(%d)\n",
inode->i_ino, i);
+                               radix_tree_tag_clear(&mapping->page_tree,
+                               page_index(page),
+                               PAGECACHE_TAG_DIRTY);
+                               spin_unlock_irq(&mapping->tree_lock);
+                               clear_page_dirty_for_io(page);
+                               goto unlock_page;
+                       }
+                       spin_unlock_irq(&mapping->tree_lock);
+unlock_page:
+                       unlock_page(page);
+               }
+               pagevec_release(&pvec);
+       }
+}
static int ext4_da_writepages(struct address_space *mapping,
                              struct writeback_control *wbc)
@@ -2419,6 +2451,14 @@ retry:
                        ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
                               "%ld pages, ino %lu; err %d", __func__,
                                wbc->nr_to_write, inode->i_ino, ret);
+
+                       if(ret == -EROFS) {
+                               ext4_msg(inode->i_sb, KERN_CRIT,"This
is probably result of corruption"
+                                                "that FS is marked RO
in between writepages\n");
+                               /*Invalidate mapping if RO encountered
during writepages */
+                                ext4_invalidate_mapping(mapping);
+                                invalidate_inode_pages2(mapping);
+                       }
                        blk_finish_plug(&plug);
                        goto out_writepages;
                }

After making these changes, as soon as there is corruption – the FS is
marked RO and all pending DIRTY writes are cleared. So, the hang is
not seen.

But we have a doubt about the possible side-effects of this code. Can
you please provide your inputs on this? It will be of immense help.

Thanks & Regards,
Amit Sahrawat


On Tue, Apr 15, 2014 at 6:17 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Apr 15, 2014 at 01:24:26PM +0530, Amit Sahrawat wrote:
>> Initially in normal write path, when the disk was almost full – we got
>> hung for the ‘sync’ because the flusher (which is busy in the
>> writepages is not responding). Before the hung task, we also found the
>> logs like:
>>
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
>> clusters in bitmap, 58339 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
>> clusters in bitmap, 3 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
>> clusters in bitmap, 1 in gd
>
> These errors indicate that the several block groups contain have
> corrupt block bitmap: block group #1493, #1000, and #1425.  The fact
> that there are 0 free blocks/clusters in the bitmap indicate that the
> bitmap was all zero's, which could be the potential cause of the
> corruption.
>
> The other thing which is funny is the number of free blocks/clusters
> being greater than 32768 in block group #1493.  Assuming a 4k page
> size, that shouldn't be possible.  Can you send the output of
> "dumpe2fs -h /dev/sdXX" so we can take a look at the file system parameters?
>
> How much before the hung task did you see these messages?  I normally
> recommend that the file system be set up to either panic the system,
> or force the file system to be remounted read/only when EXT4-fs error
> messages are reported, since that means that the file system is
> corrupted, and further operaion can cause more data to be lost.
>
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>>
>> EXT4-fs (sda1): error count: 58
>> EXT4-fs (sda1): initial error at 607: ext4_mb_generate_buddy:742
>> EXT4-fs (sda1): last error at 58: ext4_mb_generate_buddy:742
>
> The "607" and "58" in the "at 607" and "at 58" are normally supposed
> to be a unix time_t value.  That is, it's normally a number like:
> 1397564866, and it can be decoded via:
>
> % date -d @1397564866
> Tue Apr 15 08:27:46 EDT 2014
>
> The fact that these numbers are numerically so small means that the
> time wasn't set correctly on your system.  Was this a test system
> running under kvm without a proper real-time clock?
>
>> When we analysed the problem, it occurred from the writepages path in ext4.
>> This is because of the difference in the free blocks reported by
>> cluster bitmap and the number of free blocks reported by group
>> descriptor.
>
> Yes, indicating that the file system was corrupt.
>
>> During ext4_fill_super, ext4 calculates the number of free blocks by
>> reading all the descriptors in function ext4_count_free_clusters and
>> store it in percpu counter s_freeclusters_counter.
>> ext4_count_free_clusters:
>> desc_count = 0;
>> for (i = 0; i < ngroups; i++) {
>> gdp = ext4_get_group_desc(sb, i, NULL);
>> if (!gdp)
>> continue;
>> desc_count += ext4_free_group_clusters(sb, gdp);
>> }
>> return desc_count;
>>
>> During writebegin call, ext4 checks this s_freeclusters_counter
>> counter to know if there are free blocks present or not.
>> When the free blocks reported by group descriptor are greater than the
>> actual free blocks reported by bitmap, a call to writebegin could
>> still succeed even if the free blocks represented by bitmaps are 0.
>
> Yes.  We used to have code that would optionally read every single
> bitmap, and verify that the descriptor counts match the values in the
> bitmap.  However, that was expensive, and wasn't a full check of all
> possible file system inconsistencies that could lead to data loss.  So
> we ultimately removed this code.  If the file system is potentially
> corrupt, it is the system administrator's responsibility to force an
> fsck run to make sure the file system data structures are consistent.
>
>> When searching for the relevant problem which occurs in this path. We
>> got the patch-set from ‘Darrick’ which revolves around this problem.
>> ext4: error out if verifying the block bitmap fails
>> ext4: fix type declaration of ext4_validate_block_bitmap
>> ext4: mark block group as corrupt on block bitmap error
>> ext4: mark block group as corrupt on inode bitmap error
>> ext4: mark group corrupt on group descriptor checksum
>> ext4: don't count free clusters from a corrupt block group
>>
>> After adopting the patch-set and performing verification on the
>> similar setup, we ran ‘fsstress’. But now it is resulting in hang at
>> different points.
>>
>> In the current logs we got:
>> EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
>> 20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
>> JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>
> OK, what version of the kernel are you using?  The patches that you
> reference above have been in the upstream kernel since 3.12, so I'm
> assuming you're not using the latest upstream kernel, but rather an
> older kernel with some patches applied.  Hmmm, skipping ahead:
>
>> Kernel Version: 3.8
>> Test command:
>> fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
>
> There is clearly either some kernel bug or hardware problem which is
> causing the file system corruption.  Given that you are using a much
> older kernel, it's quite likely that there is some bug that has been
> fixed in a later version of the kernel (although we can't really rule
> out a hardware problem without know much more about your setup).
>
> Unfortunately, there has been a *large* number of changes since
> version 3.8, and I can't remember all of the changes and bug fixes
> that we might have made in the past year or more (v3.8 dates from
> March 2013).
>
> Something that might be helpful is for you to use xfstests.  That's a
> much more thorough set of tests which we've been using so if you must
> use an antique version of the kernel, that will probably be a much
> better set of tests.  It includes fsstress, and much more besides.
> More importantly, there are times when fixes are identified by the
> xfstest failure that has gotten fixed up in the commit logs.  So that
> might help you find the bug fix that you need to backport.
>
> For your convenience, there is a simple test framework that makes it
> relatively easy to build and run xfstests under KVM.  You can find it
> here:
>
>         git://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git
>
> See the documentation found at:
>
>         https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/README
>
> for more details.
>
> I hope this helps,
>
>                                                 - Ted
---
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c0fbd96..04f3a66 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1641,8 +1641,10 @@ static void mpage_da_map_and_submit(struct
mpage_da_data *mpd)
                                 mpd->b_size >> mpd->inode->i_blkbits, err);
                        ext4_msg(sb, KERN_CRIT,
                                "This should not happen!! Data will be lost\n");
-                       if (err == -ENOSPC)
+                       if (err == -ENOSPC) {
                                ext4_print_free_blocks(mpd->inode);
+                               mpd->inode->i_sb->s_flags |= MS_RDONLY;
+                       }
                }
                /* invalidate all the pages */