Message ID | 20120306204941.1663.56283.stgit@elm3b70.beaverton.ibm.com |
---|---|
State | Superseded, archived |
Delegated to: | Theodore Ts'o |
Headers | show |
On Tue, Mar 06, 2012 at 12:49:41PM -0800, Darrick J. Wong wrote: > @@ -177,11 +189,17 @@ typedef struct journal_block_tag_s > __be32 t_blocknr; /* The on-disk block number */ > __be32 t_flags; /* See below */ > __be32 t_blocknr_high; /* most-significant high 32bits. */ > + __be32 t_checksum; /* crc32c(uuid+seq+block) */ > } journal_block_tag_t; > > #define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) > #define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) There's a problem with this patch here --- we are changing the size of journal_block_tag_t, which is an on-disk data structure. So for 64-bit journals, this represents a format change. This means that if you have a 64-bit file system that needs to have its journal recovered, if the journal was written with an older kernel, and then we try to recover it with a new kernel, things won't be good. Similarly, for e2fsck's recovery code, it's not going to be able to recover 64-bit file systems using current coding, since this patch series changes the size of JBD2_TAG_SIZE64. What we need to do is something like this: #define JBD2_TAG_SIZE64 (offsetof(journal_block_tag_t, t_checksum)) #define JBD2_TAG_SIZE_CSUM (sizeof(journal_block_tag_t)) And then change the code appropriately in e2fsprogs and in the kernel to use the correct tag size depending on the journal options. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-04-28, at 8:19, Ted Ts'o <tytso@mit.edu> wrote: > On Tue, Mar 06, 2012 at 12:49:41PM -0800, Darrick J. Wong wrote: >> @@ -177,11 +189,17 @@ typedef struct journal_block_tag_s >> __be32 t_blocknr; /* The on-disk block number */ >> __be32 t_flags; /* See below */ >> __be32 t_blocknr_high; /* most-significant high 32bits. */ >> + __be32 t_checksum; /* crc32c(uuid+seq+block) */ >> } journal_block_tag_t; >> >> #define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) >> #define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) > > There's a problem with this patch here --- we are changing the size of > journal_block_tag_t, which is an on-disk data structure. So for > 64-bit journals, this represents a format change. This means that if > you have a 64-bit file system that needs to have its journal > recovered, if the journal was written with an older kernel, and then > we try to recover it with a new kernel, things won't be good. > Similarly, for e2fsck's recovery code, it's not going to be able to > recover 64-bit file systems using current coding, since this patch > series changes the size of JBD2_TAG_SIZE64. > > What we need to do is something like this: > > #define JBD2_TAG_SIZE64 (offsetof(journal_block_tag_t, t_checksum)) > #define JBD2_TAG_SIZE_CSUM (sizeof(journal_block_tag_t)) > > And then change the code appropriately in e2fsprogs and in the kernel > to use the correct tag size depending on the journal options. I thought we originally discussed using the high 16 bits of the t_flags field to store the checksum? This would avoid the need to change the disk format. Since there is still a whole transaction checksum, it isn't so critical that the per-block checksum be strong. One idea is to do the crc32c for each block, then store the high 16 bits into t_flags, and checksum the full 32-bit per-block checksums to make the commit block checksum, to avoid having to do the block checksums twice. Cheers, Andreas-- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ I've trimmed the cc line to avoid spamming lots of folks who might not care about the details of jbd2 checksumming. ] On Sat, Apr 28, 2012 at 04:58:12PM -0600, Andreas Dilger wrote: > > I thought we originally discussed using the high 16 bits of the > t_flags field to store the checksum? This would avoid the need to > change the disk format. I don't recall that suggestion, but I like it. One thing that will get subtle about this is that t_flags is stored big-endian (jbd/jbd2 data structures are stored be, but the data structures in ext4 proper are stored le; sigh). So we'd have to do something like this: typedef struct journal_block_tag_s { __u32 t_blocknr; /* The on-disk block number */ __u16 t_checksum; /* 16-bit checksum */ __u16 t_flags; /* See below */ __u32 t_blocknr_high; /* most-significant high 32bits. */ } journal_block_tag_t; ... and then make sure we change all of the places that access t_flags using cpu_to_be32() and be32_to_cpu() get changed to the 16-bit variant. > Since there is still a whole transaction checksum, it isn't so > critical that the per-block checksum be strong. > > One idea is to do the crc32c for each block, then store the high 16 > bits into t_flags, and checksum the full 32-bit per-block checksums > to make the commit block checksum, to avoid having to do the block > checksums twice. It's not critical because the hard drive is doing its own ECC. So I'm not that worried about detecting a large burst of bit errors, which is the main advantage of using a larger CRC. I'm more worried about a disk block getting written to the wrong place, or not getting written at all. So whether the chance of detecting a wrong block is 99.9985% at all (w/ a 16-bit checksum) or 99.9999% (with a 32-bit checksum), at all, either is fine. I'm not even sure I would worry combining the per-block checksums into the commit block checksum. In the rare case where there is an error not detected by the 16-bit checksum which is detected in the commit checksum, what are we supposed to do? Throw away the entire commit again? Just simply testing to see what we do in this rare case is going to be interesting / painful. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-04-29, at 1:39 PM, Ted Ts'o wrote: > On Sat, Apr 28, 2012 at 04:58:12PM -0600, Andreas Dilger wrote: >> >> I thought we originally discussed using the high 16 bits of the >> t_flags field to store the checksum? This would avoid the need to >> change the disk format. > > I don't recall that suggestion, but I like it. One thing that will > get subtle about this is that t_flags is stored big-endian (jbd/jbd2 > data structures are stored be, but the data structures in ext4 proper > are stored le; sigh). So we'd have to do something like this: > > typedef struct journal_block_tag_s > { > __u32 t_blocknr; /* The on-disk block number */ > __u16 t_checksum; /* 16-bit checksum */ > __u16 t_flags; /* See below */ > __u32 t_blocknr_high; /* most-significant high 32bits. */ > } journal_block_tag_t; > > ... and then make sure we change all of the places that access t_flags > using cpu_to_be32() and be32_to_cpu() get changed to the 16-bit > variant. Sigh... >> Since there is still a whole transaction checksum, it isn't so >> critical that the per-block checksum be strong. >> >> One idea is to do the crc32c for each block, then store the high 16 >> bits into t_flags, and checksum the full 32-bit per-block checksums >> to make the commit block checksum, to avoid having to do the block >> checksums twice. > > It's not critical because the hard drive is doing its own ECC. So I'm > not that worried about detecting a large burst of bit errors, which is > the main advantage of using a larger CRC. I'm more worried about a > disk block getting written to the wrong place, or not getting written > at all. So whether the chance of detecting a wrong block is 99.9985% > at all (w/ a 16-bit checksum) or 99.9999% (with a 32-bit checksum), > at all, either is fine. > > I'm not even sure I would worry combining the per-block checksums into > the commit block checksum. In the rare case where there is an error > not detected by the 16-bit checksum which is detected in the commit > checksum, what are we supposed to do? Throw away the entire commit > again? Just simply testing to see what we do in this rare case is > going to be interesting / painful. The main reason to create the commit block checksum out of the per-block checksums is to avoid having to checksum the data blocks themselves twice. That would be a significant overhead to compute, say, 4096 * 4kB = 16MB of block data, while computing the commit block checksum out of 4096 * 2 bytes = 8kB of the 16-bit checksums is relatively minor. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Apr 28, 2012 at 10:19:33AM -0400, Ted Ts'o wrote: > On Tue, Mar 06, 2012 at 12:49:41PM -0800, Darrick J. Wong wrote: > > @@ -177,11 +189,17 @@ typedef struct journal_block_tag_s > > __be32 t_blocknr; /* The on-disk block number */ > > __be32 t_flags; /* See below */ > > __be32 t_blocknr_high; /* most-significant high 32bits. */ > > + __be32 t_checksum; /* crc32c(uuid+seq+block) */ > > } journal_block_tag_t; > > > > #define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) > > #define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) > > There's a problem with this patch here --- we are changing the size of > journal_block_tag_t, which is an on-disk data structure. So for > 64-bit journals, this represents a format change. This means that if > you have a 64-bit file system that needs to have its journal > recovered, if the journal was written with an older kernel, and then > we try to recover it with a new kernel, things won't be good. > Similarly, for e2fsck's recovery code, it's not going to be able to > recover 64-bit file systems using current coding, since this patch > series changes the size of JBD2_TAG_SIZE64. > > What we need to do is something like this: > > #define JBD2_TAG_SIZE64 (offsetof(journal_block_tag_t, t_checksum)) > #define JBD2_TAG_SIZE_CSUM (sizeof(journal_block_tag_t)) > > And then change the code appropriately in e2fsprogs and in the kernel > to use the correct tag size depending on the journal options. Oops. I forgot to update JBD2_TAG_SIZE64. I have a question, though -- it looks as though the code that handles reading and writing tags from raw disk blocks calls journal_tag_bytes() to determine the tag size, and manually increments a pointer "tagp" to step through the block. This construction seems to be be sufficient to deal with possible differences between sizeof(journal_block_tag_t) and the on-disk tag size, and both increases over the 32bit tag size are gated on INCOMPAT_64BIT and INCOMPAT_CSUM_V2. Had I defined JBD2_TAG_SIZE64 with offsetof() as Ted did above, I think that journal_tag_bytes() would return the correct on-disk tag size, which should fix the scenario Ted outlined above. The tag checksum set/verify functions would also need to be taught where t_checksum is (in the space occupied by t_blocknr_high) on a 32bit journal. Could those two suggestions fix the problem without causing us to discard half the checksum bits? Well, not quite -- the calculation of tags per block in journal.c below the comment "journal descriptor can store up to n blocks -bzzz" probably ought to be using journal_tag_bytes(), not sizeof(journal_block_tag_t) to figure out how many tags can be crammed into a disk block, since right now I think it underreports the number of tags per block on a 32bit journal. journal_tag_disk_size() is a more descriptive name for journal_tag_bytes(). As for putting half the checksum into the upper 16 bits of the flags field -- is journal space at such a premium that we need to overload the field and reduce the strength of the checksum? Enabling journal checksums on a 4k block filesystem causes tags_per_block to decrease from 512 to 341 on a 32bit journal and from 341 to 256 on a 64bit journal. Do transactions typically have that many blocks? I didn't think most transactions had 1-2MB of dirty data. --D > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-04-30, at 9:53 AM, djwong wrote: > As for putting half the checksum into the upper 16 bits of the flags > field -- is journal space at such a premium that we need to overload > the field and reduce the strength of the checksum? Enabling journal > checksums on a 4k block filesystem causes tags_per_block to decrease > from 512 to 341 on a 32bit journal and from 341 to 256 on a 64bit > journal. Do transactions typically have that many blocks? I didn't > think most transactions had 1-2MB of dirty data. I think on a busy filesystem there can be many thousands of blocks in a single transaction. We run Lustre with 400MB journals, and under metadata-intensive workloads we can hit the 100MB transaction size limit easily. However, this doesn't mean there are 25k blocks in each transaction, since most of these blocks are reserved for the worst case, but not used. As for the impact of reducing the number of tags in each block, for a 4096-block transaction this would currently mean 8 32-bit tag blocks, and it would grow to 12 or 13, which isn't significant in the end. My suggestion was mostly to avoid problems with the disk format change. If this can be handled in another manner, AND it doesn't break journal recovery on older kernels/e2fsprogs, then I'm OK with the cleaner approach. Please ensure that this is tested. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Apr 29, 2012 at 03:39:21PM -0400, Ted Ts'o wrote: > [ I've trimmed the cc line to avoid spamming lots of folks who might not > care about the details of jbd2 checksumming. ] > > On Sat, Apr 28, 2012 at 04:58:12PM -0600, Andreas Dilger wrote: > > > > I thought we originally discussed using the high 16 bits of the > > t_flags field to store the checksum? This would avoid the need to > > change the disk format. > > I don't recall that suggestion, but I like it. One thing that will > get subtle about this is that t_flags is stored big-endian (jbd/jbd2 > data structures are stored be, but the data structures in ext4 proper > are stored le; sigh). So we'd have to do something like this: > > typedef struct journal_block_tag_s > { > __u32 t_blocknr; /* The on-disk block number */ > __u16 t_checksum; /* 16-bit checksum */ > __u16 t_flags; /* See below */ > __u32 t_blocknr_high; /* most-significant high 32bits. */ > } journal_block_tag_t; > > ... and then make sure we change all of the places that access t_flags > using cpu_to_be32() and be32_to_cpu() get changed to the 16-bit > variant. > > > Since there is still a whole transaction checksum, it isn't so > > critical that the per-block checksum be strong. > > > > One idea is to do the crc32c for each block, then store the high 16 > > bits into t_flags, and checksum the full 32-bit per-block checksums > > to make the commit block checksum, to avoid having to do the block > > checksums twice. > > It's not critical because the hard drive is doing its own ECC. So I'm > not that worried about detecting a large burst of bit errors, which is > the main advantage of using a larger CRC. I'm more worried about a > disk block getting written to the wrong place, or not getting written > at all. So whether the chance of detecting a wrong block is 99.9985% > at all (w/ a 16-bit checksum) or 99.9999% (with a 32-bit checksum), > at all, either is fine. Hmmm, what about 64k block filesystems? Anyway, I revised two of the patches quite a while ago and apparently forgot to send them. :( They simply enlarge the journal tag struct and adjust the code to use journal_tag_bytes() instead of the constants. I was going to send them out, but I rebased off e2fsprogs head and 3.4-rc7 just today and saw new regressions about group descriptor checksums. Oh well. --D > > I'm not even sure I would worry combining the per-block checksums into > the commit block checksum. In the rare case where there is an error > not detected by the 16-bit checksum which is detected in the commit > checksum, what are we supposed to do? Throw away the entire commit > again? Just simply testing to see what we do in this rare case is > going to be interesting / painful. > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 5557bae..c286153 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -147,12 +147,24 @@ typedef struct journal_header_s #define JBD2_CRC32_CHKSUM 1 #define JBD2_MD5_CHKSUM 2 #define JBD2_SHA1_CHKSUM 3 +#define JBD2_CRC32C_CHKSUM 4 #define JBD2_CRC32_CHKSUM_SIZE 4 #define JBD2_CHECKSUM_BYTES (32 / sizeof(u32)) /* * Commit block header for storing transactional checksums: + * + * NOTE: If FEATURE_COMPAT_CHECKSUM (checksum v1) is set, the h_chksum* + * fields are used to store a checksum of the descriptor and data blocks. + * + * If FEATURE_INCOMPAT_CSUM_V2 (checksum v2) is set, then the h_chksum + * field is used to store crc32c(uuid+commit_block). Each journal metadata + * block gets its own checksum, and data block checksums are stored in + * journal_block_tag (in the descriptor). The other h_chksum* fields are + * not used. + * + * Checksum v1 and v2 are mutually exclusive features. */ struct commit_header { __be32 h_magic; @@ -177,11 +189,17 @@ typedef struct journal_block_tag_s __be32 t_blocknr; /* The on-disk block number */ __be32 t_flags; /* See below */ __be32 t_blocknr_high; /* most-significant high 32bits. */ + __be32 t_checksum; /* crc32c(uuid+seq+block) */ } journal_block_tag_t; #define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) #define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) +/* Tail of descriptor block, for checksumming */ +struct jbd2_journal_block_tail { + __be32 t_checksum; /* crc32c(uuid+descr_block) */ +}; + /* * The revoke descriptor: used on disk to describe a series of blocks to * be revoked from the log @@ -192,6 +210,10 @@ typedef struct jbd2_journal_revoke_header_s __be32 r_count; /* Count of bytes used in the block */ } jbd2_journal_revoke_header_t; +/* Tail of revoke block, for checksumming */ +struct jbd2_journal_revoke_tail { + __be32 r_checksum; /* crc32c(uuid+revoke_block) */ +}; /* Definitions for the journal tag flags word: */ #define JBD2_FLAG_ESCAPE 1 /* on-disk block is escaped */ @@ -241,7 +263,10 @@ typedef struct journal_superblock_s __be32 s_max_trans_data; /* Limit of data blocks per trans. */ /* 0x0050 */ - __u32 s_padding[44]; + __u8 s_checksum_type; /* checksum type */ + __u8 s_padding2[3]; + __u32 s_padding[42]; + __be32 s_checksum; /* crc32c(superblock) */ /* 0x0100 */ __u8 s_users[16*48]; /* ids of all fs'es sharing the log */ @@ -263,6 +288,7 @@ typedef struct journal_superblock_s #define JBD2_FEATURE_INCOMPAT_REVOKE 0x00000001 #define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002 #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004 +#define JBD2_FEATURE_INCOMPAT_CSUM_V2 0x00000008 /* Features known to this kernel version: */ #define JBD2_KNOWN_COMPAT_FEATURES JBD2_FEATURE_COMPAT_CHECKSUM
Define flags and allocate space in on-disk journal structures to support checksumming of journal metadata. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- include/linux/jbd2.h | 28 +++++++++++++++++++++++++++- 1 files changed, 27 insertions(+), 1 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html