[15/23] jbd2: Change disk layout for metadata checksumming

Message ID	20120306204941.1663.56283.stgit@elm3b70.beaverton.ibm.com
State	Superseded, archived
Delegated to:	Theodore Ts'o
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> Gateway: Authorized Use Only! Violators will be prosecuted for <linux-ext4@vger.kernel.org> from <djwong@us.ibm.com>; Tue, 6 Mar 2012 13:49:49 -0700 Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 6 Mar 2012 13:49:48 -0700 Subject: [PATCH 15/23] jbd2: Change disk layout for metadata checksumming To: Andreas Dilger <adilger.kernel@dilger.ca>, Theodore Tso <tytso@mit.edu>, "Darrick J. Wong" <djwong@us.ibm.com> From: "Darrick J. Wong" <djwong@us.ibm.com> Cc: Sunil Mushran <sunil.mushran@oracle.com>, Martin K Petersen <martin.petersen@oracle.com>, Greg Freemyer <greg.freemyer@gmail.com>, Amir Goldstein <amir73il@gmail.com>, linux-kernel <linux-kernel@vger.kernel.org>, Andi Kleen <andi@firstfloor.org>, Mingming Cao <cmm@us.ibm.com>, Joel Becker <jlbec@evilplan.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-ext4@vger.kernel.org, Coly Li <colyli@gmail.com> Date: Tue, 06 Mar 2012 12:49:41 -0800 Message-ID: <20120306204941.1663.56283.stgit@elm3b70.beaverton.ibm.com> In-Reply-To: <20120306204750.1663.96751.stgit@elm3b70.beaverton.ibm.com> References: <20120306204750.1663.96751.stgit@elm3b70.beaverton.ibm.com> User-Agent: StGit/0.15 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk

Darrick J. Wong March 6, 2012, 8:49 p.m. UTC

Define flags and allocate space in on-disk journal structures to support
checksumming of journal metadata.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---
 include/linux/jbd2.h |   28 +++++++++++++++++++++++++++-
 1 files changed, 27 insertions(+), 1 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Ts'o April 28, 2012, 2:19 p.m. UTC | #1

On Tue, Mar 06, 2012 at 12:49:41PM -0800, Darrick J. Wong wrote:
> @@ -177,11 +189,17 @@ typedef struct journal_block_tag_s
>  	__be32		t_blocknr;	/* The on-disk block number */
>  	__be32		t_flags;	/* See below */
>  	__be32		t_blocknr_high; /* most-significant high 32bits. */
> +	__be32		t_checksum;	/* crc32c(uuid+seq+block) */
>  } journal_block_tag_t;
>  
>  #define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
>  #define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))

There's a problem with this patch here --- we are changing the size of
journal_block_tag_t, which is an on-disk data structure.  So for
64-bit journals, this represents a format change.  This means that if
you have a 64-bit file system that needs to have its journal
recovered, if the journal was written with an older kernel, and then
we try to recover it with a new kernel, things won't be good.
Similarly, for e2fsck's recovery code, it's not going to be able to
recover 64-bit file systems using current coding, since this patch
series changes the size of JBD2_TAG_SIZE64.

What we need to do is something like this:

#define JBD2_TAG_SIZE64 (offsetof(journal_block_tag_t, t_checksum))
#define JBD2_TAG_SIZE_CSUM (sizeof(journal_block_tag_t))

And then change the code appropriately in e2fsprogs and in the kernel
to use the correct tag size depending on the journal options.

       	   	       	    	      	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger April 28, 2012, 10:58 p.m. UTC | #2

On 2012-04-28, at 8:19, Ted Ts'o <tytso@mit.edu> wrote:

> On Tue, Mar 06, 2012 at 12:49:41PM -0800, Darrick J. Wong wrote:
>> @@ -177,11 +189,17 @@ typedef struct journal_block_tag_s
>>    __be32        t_blocknr;    /* The on-disk block number */
>>    __be32        t_flags;    /* See below */
>>    __be32        t_blocknr_high; /* most-significant high 32bits. */
>> +    __be32        t_checksum;    /* crc32c(uuid+seq+block) */
>> } journal_block_tag_t;
>> 
>> #define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
>> #define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))
> 
> There's a problem with this patch here --- we are changing the size of
> journal_block_tag_t, which is an on-disk data structure.  So for
> 64-bit journals, this represents a format change.  This means that if
> you have a 64-bit file system that needs to have its journal
> recovered, if the journal was written with an older kernel, and then
> we try to recover it with a new kernel, things won't be good.
> Similarly, for e2fsck's recovery code, it's not going to be able to
> recover 64-bit file systems using current coding, since this patch
> series changes the size of JBD2_TAG_SIZE64.
> 
> What we need to do is something like this:
> 
> #define JBD2_TAG_SIZE64 (offsetof(journal_block_tag_t, t_checksum))
> #define JBD2_TAG_SIZE_CSUM (sizeof(journal_block_tag_t))
> 
> And then change the code appropriately in e2fsprogs and in the kernel
> to use the correct tag size depending on the journal options.

I thought we originally discussed using the high 16 bits of the t_flags field to store the checksum?  This would avoid the need to change the disk format. 

Since there is still a whole transaction checksum, it isn't so critical that the per-block checksum be strong.

One idea is to do the crc32c for each block, then store the high 16 bits into t_flags, and checksum the full 32-bit per-block checksums to make the commit block checksum, to avoid having to do the block checksums twice. 

Cheers, Andreas--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Ts'o April 29, 2012, 7:39 p.m. UTC | #3

[ I've trimmed the cc line to avoid spamming lots of folks who might not
  care about the details of jbd2 checksumming. ]

On Sat, Apr 28, 2012 at 04:58:12PM -0600, Andreas Dilger wrote:
> 
> I thought we originally discussed using the high 16 bits of the
> t_flags field to store the checksum?  This would avoid the need to
> change the disk format.

I don't recall that suggestion, but I like it.  One thing that will
get subtle about this is that t_flags is stored big-endian (jbd/jbd2
data structures are stored be, but the data structures in ext4 proper
are stored le; sigh).   So we'd have to do something like this:

typedef struct journal_block_tag_s
{
	__u32		t_blocknr;	/* The on-disk block number */
	__u16		t_checksum;	/* 16-bit checksum */
	__u16		t_flags;	/* See below */
	__u32		t_blocknr_high; /* most-significant high 32bits. */
} journal_block_tag_t;

... and then make sure we change all of the places that access t_flags
using cpu_to_be32() and be32_to_cpu() get changed to the 16-bit
variant.

> Since there is still a whole transaction checksum, it isn't so
> critical that the per-block checksum be strong.
>
> One idea is to do the crc32c for each block, then store the high 16
> bits into t_flags, and checksum the full 32-bit per-block checksums
> to make the commit block checksum, to avoid having to do the block
> checksums twice.

It's not critical because the hard drive is doing its own ECC.  So I'm
not that worried about detecting a large burst of bit errors, which is
the main advantage of using a larger CRC.  I'm more worried about a
disk block getting written to the wrong place, or not getting written
at all.  So whether the chance of detecting a wrong block is 99.9985%
at all (w/ a 16-bit checksum) or 99.9999% (with a 32-bit checksum),
at all, either is fine.

I'm not even sure I would worry combining the per-block checksums into
the commit block checksum.  In the rare case where there is an error
not detected by the 16-bit checksum which is detected in the commit
checksum, what are we supposed to do?  Throw away the entire commit
again?  Just simply testing to see what we do in this rare case is
going to be interesting / painful.

   	     	     	      	 	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger April 30, 2012, 4:08 a.m. UTC | #4

On 2012-04-29, at 1:39 PM, Ted Ts'o wrote:
> On Sat, Apr 28, 2012 at 04:58:12PM -0600, Andreas Dilger wrote:
>> 
>> I thought we originally discussed using the high 16 bits of the
>> t_flags field to store the checksum?  This would avoid the need to
>> change the disk format.
> 
> I don't recall that suggestion, but I like it.  One thing that will
> get subtle about this is that t_flags is stored big-endian (jbd/jbd2
> data structures are stored be, but the data structures in ext4 proper
> are stored le; sigh).   So we'd have to do something like this:
> 
> typedef struct journal_block_tag_s
> {
> 	__u32		t_blocknr;	/* The on-disk block number */
> 	__u16		t_checksum;	/* 16-bit checksum */
> 	__u16		t_flags;	/* See below */
> 	__u32		t_blocknr_high; /* most-significant high 32bits. */
> } journal_block_tag_t;
> 
> ... and then make sure we change all of the places that access t_flags
> using cpu_to_be32() and be32_to_cpu() get changed to the 16-bit
> variant.

Sigh...

>> Since there is still a whole transaction checksum, it isn't so
>> critical that the per-block checksum be strong.
>> 
>> One idea is to do the crc32c for each block, then store the high 16
>> bits into t_flags, and checksum the full 32-bit per-block checksums
>> to make the commit block checksum, to avoid having to do the block
>> checksums twice.
> 
> It's not critical because the hard drive is doing its own ECC.  So I'm
> not that worried about detecting a large burst of bit errors, which is
> the main advantage of using a larger CRC.  I'm more worried about a
> disk block getting written to the wrong place, or not getting written
> at all.  So whether the chance of detecting a wrong block is 99.9985%
> at all (w/ a 16-bit checksum) or 99.9999% (with a 32-bit checksum),
> at all, either is fine.
> 
> I'm not even sure I would worry combining the per-block checksums into
> the commit block checksum.  In the rare case where there is an error
> not detected by the 16-bit checksum which is detected in the commit
> checksum, what are we supposed to do?  Throw away the entire commit
> again?  Just simply testing to see what we do in this rare case is
> going to be interesting / painful.

The main reason to create the commit block checksum out of the
per-block checksums is to avoid having to checksum the data blocks
themselves twice.  That would be a significant overhead to compute,
say, 4096 * 4kB = 16MB of block data, while computing the commit
block checksum out of 4096 * 2 bytes = 8kB of the 16-bit checksums
is relatively minor.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Darrick J. Wong April 30, 2012, 3:53 p.m. UTC | #5

On Sat, Apr 28, 2012 at 10:19:33AM -0400, Ted Ts'o wrote:
> On Tue, Mar 06, 2012 at 12:49:41PM -0800, Darrick J. Wong wrote:
> > @@ -177,11 +189,17 @@ typedef struct journal_block_tag_s
> >  	__be32		t_blocknr;	/* The on-disk block number */
> >  	__be32		t_flags;	/* See below */
> >  	__be32		t_blocknr_high; /* most-significant high 32bits. */
> > +	__be32		t_checksum;	/* crc32c(uuid+seq+block) */
> >  } journal_block_tag_t;
> >  
> >  #define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
> >  #define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))
> 
> There's a problem with this patch here --- we are changing the size of
> journal_block_tag_t, which is an on-disk data structure.  So for
> 64-bit journals, this represents a format change.  This means that if
> you have a 64-bit file system that needs to have its journal
> recovered, if the journal was written with an older kernel, and then
> we try to recover it with a new kernel, things won't be good.
> Similarly, for e2fsck's recovery code, it's not going to be able to
> recover 64-bit file systems using current coding, since this patch
> series changes the size of JBD2_TAG_SIZE64.
> 
> What we need to do is something like this:
> 
> #define JBD2_TAG_SIZE64 (offsetof(journal_block_tag_t, t_checksum))
> #define JBD2_TAG_SIZE_CSUM (sizeof(journal_block_tag_t))
> 
> And then change the code appropriately in e2fsprogs and in the kernel
> to use the correct tag size depending on the journal options.

Oops.  I forgot to update JBD2_TAG_SIZE64.

I have a question, though -- it looks as though the code that handles reading
and writing tags from raw disk blocks calls journal_tag_bytes() to determine
the tag size, and manually increments a pointer "tagp" to step through the
block.  This construction seems to be be sufficient to deal with possible
differences between sizeof(journal_block_tag_t) and the on-disk tag size, and
both increases over the 32bit tag size are gated on INCOMPAT_64BIT and
INCOMPAT_CSUM_V2.

Had I defined JBD2_TAG_SIZE64 with offsetof() as Ted did above, I think that
journal_tag_bytes() would return the correct on-disk tag size, which should fix
the scenario Ted outlined above.  The tag checksum set/verify functions would
also need to be taught where t_checksum is (in the space occupied by
t_blocknr_high) on a 32bit journal.  Could those two suggestions fix the
problem without causing us to discard half the checksum bits?

Well, not quite -- the calculation of tags per block in journal.c below the
comment "journal descriptor can store up to n blocks -bzzz" probably ought to
be using journal_tag_bytes(), not sizeof(journal_block_tag_t) to figure out how
many tags can be crammed into a disk block, since right now I think it
underreports the number of tags per block on a 32bit journal.
journal_tag_disk_size() is a more descriptive name for journal_tag_bytes().

As for putting half the checksum into the upper 16 bits of the flags field --
is journal space at such a premium that we need to overload the field and
reduce the strength of the checksum?  Enabling journal checksums on a 4k block
filesystem causes tags_per_block to decrease from 512 to 341 on a 32bit journal
and from 341 to 256 on a 64bit journal.  Do transactions typically have that
many blocks?  I didn't think most transactions had 1-2MB of dirty data.

--D
> 
>        	   	       	    	      	     - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger April 30, 2012, 4:51 p.m. UTC | #6

On 2012-04-30, at 9:53 AM, djwong wrote:
> As for putting half the checksum into the upper 16 bits of the flags
> field -- is journal space at such a premium that we need to overload
> the field and reduce the strength of the checksum?  Enabling journal
> checksums on a 4k block filesystem causes tags_per_block to decrease
> from 512 to 341 on a 32bit journal and from 341 to 256 on a 64bit
> journal.  Do transactions typically have that many blocks?  I didn't
> think most transactions had 1-2MB of dirty data.

I think on a busy filesystem there can be many thousands of blocks in
a single transaction.  We run Lustre with 400MB journals, and under
metadata-intensive workloads we can hit the 100MB transaction size
limit easily.  However, this doesn't mean there are 25k blocks in
each transaction, since most of these blocks are reserved for the
worst case, but not used.

As for the impact of reducing the number of tags in each block, for a
4096-block transaction this would currently mean 8 32-bit tag blocks,
and it would grow to 12 or 13, which isn't significant in the end.

My suggestion was mostly to avoid problems with the disk format change.
If this can be handled in another manner, AND it doesn't break journal
recovery on older kernels/e2fsprogs, then I'm OK with the cleaner
approach.  Please ensure that this is tested.

Cheers, Andreas

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Darrick J. Wong May 18, 2012, 10:51 p.m. UTC | #7

On Sun, Apr 29, 2012 at 03:39:21PM -0400, Ted Ts'o wrote:
> [ I've trimmed the cc line to avoid spamming lots of folks who might not
>   care about the details of jbd2 checksumming. ]
> 
> On Sat, Apr 28, 2012 at 04:58:12PM -0600, Andreas Dilger wrote:
> > 
> > I thought we originally discussed using the high 16 bits of the
> > t_flags field to store the checksum?  This would avoid the need to
> > change the disk format.
> 
> I don't recall that suggestion, but I like it.  One thing that will
> get subtle about this is that t_flags is stored big-endian (jbd/jbd2
> data structures are stored be, but the data structures in ext4 proper
> are stored le; sigh).   So we'd have to do something like this:
> 
> typedef struct journal_block_tag_s
> {
> 	__u32		t_blocknr;	/* The on-disk block number */
> 	__u16		t_checksum;	/* 16-bit checksum */
> 	__u16		t_flags;	/* See below */
> 	__u32		t_blocknr_high; /* most-significant high 32bits. */
> } journal_block_tag_t;
> 
> ... and then make sure we change all of the places that access t_flags
> using cpu_to_be32() and be32_to_cpu() get changed to the 16-bit
> variant.
> 
> > Since there is still a whole transaction checksum, it isn't so
> > critical that the per-block checksum be strong.
> >
> > One idea is to do the crc32c for each block, then store the high 16
> > bits into t_flags, and checksum the full 32-bit per-block checksums
> > to make the commit block checksum, to avoid having to do the block
> > checksums twice.
> 
> It's not critical because the hard drive is doing its own ECC.  So I'm
> not that worried about detecting a large burst of bit errors, which is
> the main advantage of using a larger CRC.  I'm more worried about a
> disk block getting written to the wrong place, or not getting written
> at all.  So whether the chance of detecting a wrong block is 99.9985%
> at all (w/ a 16-bit checksum) or 99.9999% (with a 32-bit checksum),
> at all, either is fine.

Hmmm, what about 64k block filesystems?

Anyway, I revised two of the patches quite a while ago and apparently forgot to
send them. :(  They simply enlarge the journal tag struct and adjust the code
to use journal_tag_bytes() instead of the constants.

I was going to send them out, but I rebased off e2fsprogs head and 3.4-rc7 just
today and saw new regressions about group descriptor checksums.  Oh well.

--D

> 
> I'm not even sure I would worry combining the per-block checksums into
> the commit block checksum.  In the rare case where there is an error
> not detected by the 16-bit checksum which is detected in the commit
> checksum, what are we supposed to do?  Throw away the entire commit
> again?  Just simply testing to see what we do in this rare case is
> going to be interesting / painful.
> 
>    	     	     	      	 	     - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[15/23] jbd2: Change disk layout for metadata checksumming

Commit Message

Comments

Patch