Message ID | 20240529012003.4006535-3-harshadshirwadkar@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | Ext4 fast commit performance patch series | expand |
On Wed 29-05-24 01:19:55, Harshad Shirwadkar wrote: > If the inode that's being requested to track using ext4_fc_track_inode > is being committed, then wait until the inode finishes the > commit. Also, add calls to ext4_fc_track_inode at the right places. > > With this patch, now calling ext4_reserve_inode_write() results in > inode being tracked for next fast commit. A subtle lock ordering > requirement with i_data_sem (which is documented in the code) requires > that ext4_fc_track_inode() be called before grabbing i_data_sem. So, > this patch also adds explicit ext4_fc_track_inode() calls in places > where i_data_sem grabbed. > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> > --- > fs/ext4/fast_commit.c | 34 ++++++++++++++++++++++++++++++++++ > fs/ext4/inline.c | 3 +++ > fs/ext4/inode.c | 4 ++++ > 3 files changed, 41 insertions(+) > > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c > index a1aadebfcd66..fa93ce399440 100644 > --- a/fs/ext4/fast_commit.c > +++ b/fs/ext4/fast_commit.c > @@ -581,6 +581,8 @@ static int __track_inode(struct inode *inode, void *arg, bool update) > > void ext4_fc_track_inode(handle_t *handle, struct inode *inode) > { > + struct ext4_inode_info *ei = EXT4_I(inode); > + wait_queue_head_t *wq; > int ret; > > if (S_ISDIR(inode->i_mode)) > @@ -598,6 +600,38 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode) > if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)) > return; > > + if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) || > + (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) || > + ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE) || > + !list_empty(&ei->i_fc_list)) Indentation here is wrong (we never indent conditions by the same amount as subsequent code block). Also EXT4_MF_FC_INELIGIBLE was just tested above so why repeat it and ext4_fc_disabled() tested the EXT4_FC_REPLAY and JOURNAL_FAST_COMMIT flags. So list_empty() should be the only thing needed here. BTW a separate helper for ext4_fc_disabled() + EXT4_MF_FC_INELIGIBLE test would be a nice cleanup as it is a pattern happening in quite a few places. > + return; > + > + /* > + * If we come here, we may sleep while waiting for the inode to > + * commit. We shouldn't be holding i_data_sem in write mode when we go > + * to sleep since the commit path needs to grab the lock while > + * committing the inode. > + */ > + WARN_ON(lockdep_is_held_type(&ei->i_data_sem, 1)); Actually even holding it in read mode is problematic. rwsems can block other readers from acquiring rwsem if there's some writer waiting to acquire the lock (and that will be blocked behind us). Shortly, this should be just lockdep_is_held(). Otherwise the patch looks good. Honza
On Wed 29-05-24 01:19:55, Harshad Shirwadkar wrote: > If the inode that's being requested to track using ext4_fc_track_inode > is being committed, then wait until the inode finishes the > commit. Also, add calls to ext4_fc_track_inode at the right places. > > With this patch, now calling ext4_reserve_inode_write() results in > inode being tracked for next fast commit. A subtle lock ordering > requirement with i_data_sem (which is documented in the code) requires > that ext4_fc_track_inode() be called before grabbing i_data_sem. So, > this patch also adds explicit ext4_fc_track_inode() calls in places > where i_data_sem grabbed. > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> One more thing I've noticed: > @@ -5727,6 +5730,7 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode, > brelse(iloc->bh); > iloc->bh = NULL; > } > + ext4_fc_track_inode(handle, inode); > } > ext4_std_error(inode->i_sb, err); > return err; Calling ext4_fc_track_inode() when ext4_get_write_access() failed is pointless (inode isn't going to be written) and confusing. We should do that only in the success case. Honza
On Wed, May 29, 2024 at 01:19:55AM +0000, Harshad Shirwadkar wrote: > If the inode that's being requested to track using ext4_fc_track_inode > is being committed, then wait until the inode finishes the > commit. Also, add calls to ext4_fc_track_inode at the right places. > > With this patch, now calling ext4_reserve_inode_write() results in > inode being tracked for next fast commit. A subtle lock ordering > requirement with i_data_sem (which is documented in the code) requires > that ext4_fc_track_inode() be called before grabbing i_data_sem. So, > this patch also adds explicit ext4_fc_track_inode() calls in places > where i_data_sem grabbed. > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> I tried applying this patchset to both the current ext4/dev branch as well as on to 6.10-rc5, and generic/241 is triggering large series of WARNINGS followed by a BUG (or in some cases, a soft lockup). A bisection leads me to this patch. The WARN stack trace: [ 4.061189] ------------[ cut here ]------------ [ 4.061848] WARNING: CPU: 1 PID: 2627 at fs/ext4/fast_commit.c:259 ext4_fc_del+0x7d/0x190 [ 4.062919] CPU: 1 PID: 2627 Comm: dbench Not tainted 6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350 [ 4.064070] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 4.065077] RIP: 0010:ext4_fc_del+0x7d/0x190 [ 4.065404] Code: 0f 84 f0 00 00 00 48 8b 83 48 ff ff ff 48 0f ba e0 2a 73 18 48 8b 43 28 48 8b 80 90 03 00 00 48 8b 80 80 00 00 00 a8 02 75 02 <0f> 0b 48 89 ef e8 09 ad 68 00 84 c0 74 0f 48 8b 53 98 48 8b 43 a0 [ 4.066190] RSP: 0018:ffffc90003707d98 EFLAGS: 00010246 [ 4.066415] RAX: 0000000000000001 RBX: ffff888013de5c08 RCX: 0000000000000000 [ 4.066718] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffff88800a22c7f0 [ 4.067019] RBP: ffff888013de5ba0 R08: ffffffff827fc6fe R09: ffff888008bed000 [ 4.067323] R10: 0000000000000008 R11: 000000000000001e R12: ffff88800a22c7f0 [ 4.067629] R13: ffff88800a22c000 R14: ffff888013de5b90 R15: ffff88800ac0c000 [ 4.067935] FS: 00007fec79a4e740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 4.068281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.068530] CR2: 000055d2f34b87e8 CR3: 000000000f492003 CR4: 0000000000770ef0 [ 4.068834] PKRU: 55555554 [ 4.068952] Call Trace: [ 4.069061] <TASK> [ 4.069158] ? __warn+0x7b/0x120 [ 4.069338] ? ext4_fc_del+0x7d/0x190 [ 4.069497] ? report_bug+0x174/0x1c0 [ 4.069655] ? handle_bug+0x3a/0x70 [ 4.069809] ? exc_invalid_op+0x17/0x70 [ 4.069975] ? asm_exc_invalid_op+0x1a/0x20 [ 4.070156] ? ext4_fc_del+0x7d/0x190 [ 4.070309] ? ext4_fc_del+0x44/0x190 [ 4.070468] ext4_clear_inode+0x12/0xb0 [ 4.070636] ext4_free_inode+0x86/0x5a0 [ 4.070802] ext4_evict_inode+0x457/0x6b0 [ 4.070976] evict+0xcd/0x1d0 [ 4.071114] do_unlinkat+0x2de/0x330 [ 4.071271] __x64_sys_unlink+0x23/0x30 [ 4.071436] do_syscall_64+0x4b/0x110 [ 4.071596] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 4.071812] RIP: 0033:0x7fec79b4aa07 [ 4.071969] Code: f0 ff ff 73 01 c3 48 8b 0d f6 83 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 83 0d 00 f7 d8 64 89 01 48 [ 4.072760] RSP: 002b:00007ffed55de918 EFLAGS: 00000206 ORIG_RAX: 0000000000000057 [ 4.073081] RAX: ffffffffffffffda RBX: 00007ffed55dedb0 RCX: 00007fec79b4aa07 [ 4.073429] RDX: 0000000000000000 RSI: 00007ffed55dedb0 RDI: 00007ffed55dedb0 [ 4.073733] RBP: 000055d2f34a37f0 R08: 0fffffffffffffff R09: 0000000000000000 [ 4.074033] R10: 0000000000000000 R11: 0000000000000206 R12: 000055d2f34a35d0 [ 4.074338] R13: 00007fec79c35000 R14: 0000000000000004 R15: 00007fec79c35050 [ 4.074650] </TASK> [ 4.074747] ---[ end trace 0000000000000000 ]--- And here's the BUG: [ 5.121989] BUG: kernel NULL pointer dereference, address: 00000000000000b8 [ 5.122281] #PF: supervisor read access in kernel mode [ 5.122500] #PF: error_code(0x0000) - not-present page [ 5.122717] PGD 0 P4D 0 [ 5.122828] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 5.123036] CPU: 0 PID: 2629 Comm: dbench Tainted: G W 6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350 [ 5.123470] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 5.123857] RIP: 0010:ext4_fc_perform_commit+0x303/0x4b0 [ 5.124082] Code: fd 48 2d a0 00 00 00 48 39 d3 75 af e9 ac fe ff ff 49 8b 4d 58 49 8d 45 58 48 39 c1 0f 84 9f 01 00 00 49 8b 45 58 49 63 4d 08 <48> 39 88 b8 00 00 00 4c 8d 78 78 0f 85 7f 01 00 00 48 89 ef e8 c4 [ 5.124855] RSP: 0018:ffffc90003727df8 EFLAGS: 00010246 [ 5.125072] RAX: 0000000000000000 RBX: ffff8880089e5f08 RCX: 00000000089e5f08 [ 5.125416] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88800a22c7f0 [ 5.125712] RBP: ffff88800a22c7f0 R08: ffff88807dc2fbc0 R09: 0000000000000000 [ 5.126009] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88800a22c7c8 [ 5.126310] R13: ffff8880089e5f08 R14: ffff88800a22c7a8 R15: ffff88800a22c708 [ 5.126609] FS: 00007fec79a4e740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 5.126943] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.127188] CR2: 00000000000000b8 CR3: 000000000f48e002 CR4: 0000000000770ef0 [ 5.127483] PKRU: 55555554 [ 5.127598] Call Trace: [ 5.127705] <TASK> [ 5.127838] ? __die+0x23/0x60 [ 5.127973] ? page_fault_oops+0xa3/0x160 [ 5.128143] ? exc_page_fault+0x6a/0x160 [ 5.128351] ? asm_exc_page_fault+0x26/0x30 [ 5.128530] ? ext4_fc_perform_commit+0x303/0x4b0 [ 5.128728] ? ext4_fc_perform_commit+0x36b/0x4b0 [ 5.128948] ext4_fc_commit+0x17f/0x300 [ 5.129116] ext4_sync_file+0x1ce/0x380 [ 5.129310] __x64_sys_fsync+0x3b/0x70 [ 5.129470] do_syscall_64+0x4b/0x110 [ 5.129627] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 5.129840] RIP: 0033:0x7fec79b4fb10 [ 5.129995] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d d1 ba 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c [ 5.130764] RSP: 002b:00007ffed55de948 EFLAGS: 00000202 ORIG_RAX: 000000000000004a [ 5.131079] RAX: ffffffffffffffda RBX: 00007fec79c35450 RCX: 00007fec79b4fb10 [ 5.131381] RDX: 0000000000002b6f RSI: 0000000000002b6f RDI: 0000000000000005 [ 5.131681] RBP: 000055d2f34a3660 R08: 1999999999999999 R09: 0000000000000000 [ 5.131981] R10: 00007fec79bcdac0 R11: 0000000000000202 R12: 000055d2f34a35d0 [ 5.132280] R13: 00007fec79c35450 R14: 0000000000000003 R15: 00007fec79c354a0 [ 5.132575] </TASK> [ 5.132670] CR2: 00000000000000b8 [ 5.132812] ---[ end trace 0000000000000000 ]--- Harshad, can you take a look? Thanks! - Ted
Thank you for the review, Jan and the bug report Ted. I think I have found the issue here. In fast commit code, we are using s_fc_lock to protect additions and deletions to lists s_fc_q and s_fc_dentry_q. However, since s_fc_lock is a spin_lock, we have to give it up in many places where we do things like memory allocations / deletions, since those cannot be performed in an atomic context. I am realizing that unlocking that lock in the middle of the list traversal to allow it to perform these non-atomic tasks is opening a door to a lot of subtle concurrency bugs. Unlocking this lock in the middle of the traversal leaves the door open to ext4_fc_del, which may just remove an entry from the list, and leave the initial traversal in a broken state. So as an immediate remedy, I am thinking that we can convert s_fc_lock to mutex so that the commit code doesn't have to leave the lock in the middle of the loop. In the long run, we need to revisit the whole staging queue / main queue design. I think we can just simplify that such that there is really only one queue. Commit code just copies the queue in a local variable and initializes sbi->s_fc_q to empty. That would get rid of all the messy "if commit ongoing insert in staging otherwise insert in main" conditions and also simplify the overall code. I'll make the change to convert to mutex and handle all other Jan's comments (thanks Jan for the detailed feedback on other patches) to submit V7 of this patch series. Thank you, Harshad On Mon, Jul 1, 2024 at 3:08 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Wed, May 29, 2024 at 01:19:55AM +0000, Harshad Shirwadkar wrote: > > If the inode that's being requested to track using ext4_fc_track_inode > > is being committed, then wait until the inode finishes the > > commit. Also, add calls to ext4_fc_track_inode at the right places. > > > > With this patch, now calling ext4_reserve_inode_write() results in > > inode being tracked for next fast commit. A subtle lock ordering > > requirement with i_data_sem (which is documented in the code) requires > > that ext4_fc_track_inode() be called before grabbing i_data_sem. So, > > this patch also adds explicit ext4_fc_track_inode() calls in places > > where i_data_sem grabbed. > > > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> > > I tried applying this patchset to both the current ext4/dev branch as > well as on to 6.10-rc5, and generic/241 is triggering large series of > WARNINGS followed by a BUG (or in some cases, a soft lockup). A > bisection leads me to this patch. > > The WARN stack trace: > > [ 4.061189] ------------[ cut here ]------------ > [ 4.061848] WARNING: CPU: 1 PID: 2627 at fs/ext4/fast_commit.c:259 ext4_fc_del+0x7d/0x190 > [ 4.062919] CPU: 1 PID: 2627 Comm: dbench Not tainted 6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350 > [ 4.064070] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 4.065077] RIP: 0010:ext4_fc_del+0x7d/0x190 > [ 4.065404] Code: 0f 84 f0 00 00 00 48 8b 83 48 ff ff ff 48 0f ba e0 2a 73 18 48 8b 43 28 48 8b 80 90 03 00 00 48 8b 80 80 00 00 00 a8 02 75 02 <0f> 0b 48 89 ef e8 09 ad 68 00 84 c0 74 0f 48 8b 53 98 48 8b 43 a0 > [ 4.066190] RSP: 0018:ffffc90003707d98 EFLAGS: 00010246 > [ 4.066415] RAX: 0000000000000001 RBX: ffff888013de5c08 RCX: 0000000000000000 > [ 4.066718] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffff88800a22c7f0 > [ 4.067019] RBP: ffff888013de5ba0 R08: ffffffff827fc6fe R09: ffff888008bed000 > [ 4.067323] R10: 0000000000000008 R11: 000000000000001e R12: ffff88800a22c7f0 > [ 4.067629] R13: ffff88800a22c000 R14: ffff888013de5b90 R15: ffff88800ac0c000 > [ 4.067935] FS: 00007fec79a4e740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 > [ 4.068281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 4.068530] CR2: 000055d2f34b87e8 CR3: 000000000f492003 CR4: 0000000000770ef0 > [ 4.068834] PKRU: 55555554 > [ 4.068952] Call Trace: > [ 4.069061] <TASK> > [ 4.069158] ? __warn+0x7b/0x120 > [ 4.069338] ? ext4_fc_del+0x7d/0x190 > [ 4.069497] ? report_bug+0x174/0x1c0 > [ 4.069655] ? handle_bug+0x3a/0x70 > [ 4.069809] ? exc_invalid_op+0x17/0x70 > [ 4.069975] ? asm_exc_invalid_op+0x1a/0x20 > [ 4.070156] ? ext4_fc_del+0x7d/0x190 > [ 4.070309] ? ext4_fc_del+0x44/0x190 > [ 4.070468] ext4_clear_inode+0x12/0xb0 > [ 4.070636] ext4_free_inode+0x86/0x5a0 > [ 4.070802] ext4_evict_inode+0x457/0x6b0 > [ 4.070976] evict+0xcd/0x1d0 > [ 4.071114] do_unlinkat+0x2de/0x330 > [ 4.071271] __x64_sys_unlink+0x23/0x30 > [ 4.071436] do_syscall_64+0x4b/0x110 > [ 4.071596] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 4.071812] RIP: 0033:0x7fec79b4aa07 > [ 4.071969] Code: f0 ff ff 73 01 c3 48 8b 0d f6 83 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 83 0d 00 f7 d8 64 89 01 48 > [ 4.072760] RSP: 002b:00007ffed55de918 EFLAGS: 00000206 ORIG_RAX: 0000000000000057 > [ 4.073081] RAX: ffffffffffffffda RBX: 00007ffed55dedb0 RCX: 00007fec79b4aa07 > [ 4.073429] RDX: 0000000000000000 RSI: 00007ffed55dedb0 RDI: 00007ffed55dedb0 > [ 4.073733] RBP: 000055d2f34a37f0 R08: 0fffffffffffffff R09: 0000000000000000 > [ 4.074033] R10: 0000000000000000 R11: 0000000000000206 R12: 000055d2f34a35d0 > [ 4.074338] R13: 00007fec79c35000 R14: 0000000000000004 R15: 00007fec79c35050 > [ 4.074650] </TASK> > [ 4.074747] ---[ end trace 0000000000000000 ]--- > > And here's the BUG: > > [ 5.121989] BUG: kernel NULL pointer dereference, address: 00000000000000b8 > [ 5.122281] #PF: supervisor read access in kernel mode > [ 5.122500] #PF: error_code(0x0000) - not-present page > [ 5.122717] PGD 0 P4D 0 > [ 5.122828] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI > [ 5.123036] CPU: 0 PID: 2629 Comm: dbench Tainted: G W 6.10.0-rc5-xfstests-00033-gb6f5b0076b56 #350 > [ 5.123470] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 5.123857] RIP: 0010:ext4_fc_perform_commit+0x303/0x4b0 > [ 5.124082] Code: fd 48 2d a0 00 00 00 48 39 d3 75 af e9 ac fe ff ff 49 8b 4d 58 49 8d 45 58 48 39 c1 0f 84 9f 01 00 00 49 8b 45 58 49 63 4d 08 <48> 39 88 b8 00 00 00 4c 8d 78 78 0f 85 7f 01 00 00 48 89 ef e8 c4 > [ 5.124855] RSP: 0018:ffffc90003727df8 EFLAGS: 00010246 > [ 5.125072] RAX: 0000000000000000 RBX: ffff8880089e5f08 RCX: 00000000089e5f08 > [ 5.125416] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88800a22c7f0 > [ 5.125712] RBP: ffff88800a22c7f0 R08: ffff88807dc2fbc0 R09: 0000000000000000 > [ 5.126009] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88800a22c7c8 > [ 5.126310] R13: ffff8880089e5f08 R14: ffff88800a22c7a8 R15: ffff88800a22c708 > [ 5.126609] FS: 00007fec79a4e740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 > [ 5.126943] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 5.127188] CR2: 00000000000000b8 CR3: 000000000f48e002 CR4: 0000000000770ef0 > [ 5.127483] PKRU: 55555554 > [ 5.127598] Call Trace: > [ 5.127705] <TASK> > [ 5.127838] ? __die+0x23/0x60 > [ 5.127973] ? page_fault_oops+0xa3/0x160 > [ 5.128143] ? exc_page_fault+0x6a/0x160 > [ 5.128351] ? asm_exc_page_fault+0x26/0x30 > [ 5.128530] ? ext4_fc_perform_commit+0x303/0x4b0 > [ 5.128728] ? ext4_fc_perform_commit+0x36b/0x4b0 > [ 5.128948] ext4_fc_commit+0x17f/0x300 > [ 5.129116] ext4_sync_file+0x1ce/0x380 > [ 5.129310] __x64_sys_fsync+0x3b/0x70 > [ 5.129470] do_syscall_64+0x4b/0x110 > [ 5.129627] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 5.129840] RIP: 0033:0x7fec79b4fb10 > [ 5.129995] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d d1 ba 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c > [ 5.130764] RSP: 002b:00007ffed55de948 EFLAGS: 00000202 ORIG_RAX: 000000000000004a > [ 5.131079] RAX: ffffffffffffffda RBX: 00007fec79c35450 RCX: 00007fec79b4fb10 > [ 5.131381] RDX: 0000000000002b6f RSI: 0000000000002b6f RDI: 0000000000000005 > [ 5.131681] RBP: 000055d2f34a3660 R08: 1999999999999999 R09: 0000000000000000 > [ 5.131981] R10: 00007fec79bcdac0 R11: 0000000000000202 R12: 000055d2f34a35d0 > [ 5.132280] R13: 00007fec79c35450 R14: 0000000000000003 R15: 00007fec79c354a0 > [ 5.132575] </TASK> > [ 5.132670] CR2: 00000000000000b8 > [ 5.132812] ---[ end trace 0000000000000000 ]--- > > Harshad, can you take a look? Thanks! > > - Ted >
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index a1aadebfcd66..fa93ce399440 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -581,6 +581,8 @@ static int __track_inode(struct inode *inode, void *arg, bool update) void ext4_fc_track_inode(handle_t *handle, struct inode *inode) { + struct ext4_inode_info *ei = EXT4_I(inode); + wait_queue_head_t *wq; int ret; if (S_ISDIR(inode->i_mode)) @@ -598,6 +600,38 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode) if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)) return; + if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) || + (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) || + ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE) || + !list_empty(&ei->i_fc_list)) + return; + + /* + * If we come here, we may sleep while waiting for the inode to + * commit. We shouldn't be holding i_data_sem in write mode when we go + * to sleep since the commit path needs to grab the lock while + * committing the inode. + */ + WARN_ON(lockdep_is_held_type(&ei->i_data_sem, 1)); + + while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) { +#if (BITS_PER_LONG < 64) + DEFINE_WAIT_BIT(wait, &ei->i_state_flags, + EXT4_STATE_FC_COMMITTING); + wq = bit_waitqueue(&ei->i_state_flags, + EXT4_STATE_FC_COMMITTING); +#else + DEFINE_WAIT_BIT(wait, &ei->i_flags, + EXT4_STATE_FC_COMMITTING); + wq = bit_waitqueue(&ei->i_flags, + EXT4_STATE_FC_COMMITTING); +#endif + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); + if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) + schedule(); + finish_wait(wq, &wait.wq_entry); + } + ret = ext4_fc_track_template(handle, inode, __track_inode, NULL, 1); trace_ext4_fc_track_inode(handle, inode, ret); } diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index d5bd1e3a5d36..bd5778e680c0 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -596,6 +596,7 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping, goto out; } + ext4_fc_track_inode(handle, inode); ret = ext4_destroy_inline_data_nolock(handle, inode); if (ret) goto out; @@ -696,6 +697,7 @@ int ext4_try_to_write_inline_data(struct address_space *mapping, goto convert; } + ext4_fc_track_inode(handle, inode); ret = ext4_journal_get_write_access(handle, inode->i_sb, iloc.bh, EXT4_JTR_NONE); if (ret) @@ -948,6 +950,7 @@ int ext4_da_write_inline_data_begin(struct address_space *mapping, if (ret < 0) goto out_release_page; } + ext4_fc_track_inode(handle, inode); ret = ext4_journal_get_write_access(handle, inode->i_sb, iloc.bh, EXT4_JTR_NONE); if (ret) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 4bae9ccf5fe0..aa6440992a55 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -607,6 +607,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, */ map->m_flags &= ~EXT4_MAP_FLAGS; + ext4_fc_track_inode(handle, inode); /* * New blocks allocate and/or writing to unwritten extent * will possibly result in updating i_data, so we take @@ -3978,6 +3979,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) if (stop_block > first_block) { ext4_lblk_t hole_len = stop_block - first_block; + ext4_fc_track_inode(handle, inode); down_write(&EXT4_I(inode)->i_data_sem); ext4_discard_preallocations(inode); @@ -4131,6 +4133,7 @@ int ext4_truncate(struct inode *inode) if (err) goto out_stop; + ext4_fc_track_inode(handle, inode); down_write(&EXT4_I(inode)->i_data_sem); ext4_discard_preallocations(inode); @@ -5727,6 +5730,7 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode, brelse(iloc->bh); iloc->bh = NULL; } + ext4_fc_track_inode(handle, inode); } ext4_std_error(inode->i_sb, err); return err;
If the inode that's being requested to track using ext4_fc_track_inode is being committed, then wait until the inode finishes the commit. Also, add calls to ext4_fc_track_inode at the right places. With this patch, now calling ext4_reserve_inode_write() results in inode being tracked for next fast commit. A subtle lock ordering requirement with i_data_sem (which is documented in the code) requires that ext4_fc_track_inode() be called before grabbing i_data_sem. So, this patch also adds explicit ext4_fc_track_inode() calls in places where i_data_sem grabbed. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> --- fs/ext4/fast_commit.c | 34 ++++++++++++++++++++++++++++++++++ fs/ext4/inline.c | 3 +++ fs/ext4/inode.c | 4 ++++ 3 files changed, 41 insertions(+)