Message ID | 20240823082237.713543-1-zhaoyang.huang@unisoc.com |
---|---|
State | Rejected |
Headers | show |
Series | [RFC,PATCHv2,1/1] fs: ext4: Don't use CMA for buffer_head | expand |
On Fri, Aug 23, 2024 at 4:24 PM zhaoyang.huang <zhaoyang.huang@unisoc.com> wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > cma_alloc() keep failed in our system which thanks to a jh->bh->b_page > can not be migrated out of CMA area[1] as the jh has one cp_transaction > pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6]. > We temporarily solve this by launching jbd2_log_do_checkpoint forcefully > somewhere. Since journal is common mechanism to all JFSs and > cp_transaction has a little fewer opportunity to be launched, the > cma_alloc() could be affected under the same scenario. This patch > would like to have buffer_head of ext4 not use CMA pages when doing > sb_getblk. > > [1] > crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping) > fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private //within CMA area > fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private > fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private > fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private > fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private > fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private > fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private > fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private > > [2] page -> buffer_head > crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x > private = 0xffffff802fca0c00 > > [3] buffer_head -> journal_head > crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00 > b_private = 0xffffff8041338e10, > > [4] journal_head -> b_cp_transaction > crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x > b_cp_transaction = 0xffffff80410f1900, > > [5] transaction_t -> journal > crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x > t_journal = 0xffffff80e70f3000, > > [6] j_free & j_max_transaction_buffers > crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x > j_free = 0x3f1, > j_max_transaction_buffers = 0x100, > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > --- > v2: update commit message > --- > --- > fs/ext4/inode.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 941c1c0d5c6e..4422246851fe 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -869,7 +869,11 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode, > if (nowait) > return sb_find_get_block(inode->i_sb, map.m_pblk); > > +#ifndef CONFIG_CMA > bh = sb_getblk(inode->i_sb, map.m_pblk); > +#else > + bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0); > +#endif > if (unlikely(!bh)) > return ERR_PTR(-ENOMEM); > if (map.m_flags & EXT4_MAP_NEW) { > -- > 2.25.1 > A hint for these, from my understanding, other FSs such as f2fs and erofs have had the meta data avoid using the CMA area by using GFP_NOFS/GFP_KERNEL on meta inode. struct inode *f2fs_iget(struct super_block *sb, unsigned long ino) { ... if (ino == F2FS_NODE_INO(sbi)) { inode->i_mapping->a_ops = &f2fs_node_aops; mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS); } else if (ino == F2FS_META_INO(sbi)) { inode->i_mapping->a_ops = &f2fs_meta_aops; mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS); int erofs_init_managed_cache(struct super_block *sb) { struct inode *const inode = new_inode(sb); if (!inode) return -ENOMEM; set_nlink(inode, 1); inode->i_size = OFFSET_MAX; inode->i_mapping->a_ops = &z_erofs_cache_aops; mapping_set_gfp_mask(inode->i_mapping, GFP_KERNEL);
On Fri, Aug 23, 2024 at 04:22:37PM +0800, zhaoyang.huang wrote: > > +#ifndef CONFIG_CMA > bh = sb_getblk(inode->i_sb, map.m_pblk); > +#else > + bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0); > +#endif So all of these patches to try to work around your issue with CMA are a bit ugly. But passing in a GFP mask of zero is definitely not the right way to go about thing, since there might be certain GFP masks that are required by a particular block device. What I think you are trying to do is to avoid setting the __GFP_MOVEABLE flag. So in that case, in the CMA path something like this is what you want: bh = getblk_unmoveable(sb->s_bdev, map.m_pblk, sb->s_blocksize); I'd also sugest only trying to use this is the file system has journaling enabled. If the file system is an ext2 file system without a journal, there's no reason avoid using the CMA region --- and I assume the reason why the buffer cache is trying to use the moveable flag is because the amount of non-CMA memory might be a precious resource in some systems. - Ted
On Tue, Sep 3, 2024 at 10:29 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Fri, Aug 23, 2024 at 04:22:37PM +0800, zhaoyang.huang wrote: > > > > +#ifndef CONFIG_CMA > > bh = sb_getblk(inode->i_sb, map.m_pblk); > > +#else > > + bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0); > > +#endif > > So all of these patches to try to work around your issue with CMA are > a bit ugly. But passing in a GFP mask of zero is definitely not the > right way to go about thing, since there might be certain GFP masks > that are required by a particular block device. What I think you are > trying to do is to avoid setting the __GFP_MOVEABLE flag. So in that > case, in the CMA path something like this is what you want: > > bh = getblk_unmoveable(sb->s_bdev, map.m_pblk, sb->s_blocksize); > > I'd also sugest only trying to use this is the file system has > journaling enabled. If the file system is an ext2 file system without > a journal, there's no reason avoid using the CMA region agree. > assume the reason why the buffer cache is trying to use the moveable > flag is because the amount of non-CMA memory might be a precious > resource in some systems. I don't think so. All migrate type page blocks possess the same position as each other as they could fallback to all migrate types when current fails. I guess the purpose could be to enlarge the scope of available memory as __GFP_MOVEABLE has the capability of recruiting CMA. > > - Ted
On Tue, Sep 03, 2024 at 04:50:46PM +0800, Zhaoyang Huang wrote: > > I'd also sugest only trying to use this is the file system has > > journaling enabled. If the file system is an ext2 file system without > > a journal, there's no reason avoid using the CMA region > agree. > > assume the reason why the buffer cache is trying to use the moveable > > flag is because the amount of non-CMA memory might be a precious > > resource in some systems. > > I don't think so. All migrate type page blocks possess the same > position as each other as they could fallback to all migrate types > when current fails. I guess the purpose could be to enlarge the scope > of available memory as __GFP_MOVEABLE has the capability of recruiting > CMA. Well, I guess I'm a bit confused why the buffer cache is trying to use __GFP_MOVEABLE in the first place. In general CMA is to allow systems to be able to allocate big chunks of memory which have to be physically contiguous because the I/O device(s) are too primitive to be able to do scatter-gather, right? So why are we trying to use CMA eligible memory for 4k buffer cache pages? Presumably, because there's not enough non-CMA eligible memory? After all, using GFP_MOVEABLE memory seems to mean that the buffer cache might get thrashed a lot by having a lot of cached disk buffers getting ejected from memory to try to make room for some contiguous frame buffer memory, which means extra I/O overhead. So what's the upside of using GFP_MOVEABLE for the buffer cache? Just curious, because in general I'm blessed by not having to use CMA in the first place (not having I/O devices too primitive so they can't do scatter-gather :-). So I don't tend to use CMA, and obviously I'm missing some of the design considerations behind CMA. I thought in general CMA tends to used in early boot to allocate things like frame buffers, and after that CMA doesn't tend to get used at all? That's clearly not the case for you, apparently? Cheers, - Ted
On Tue, Sep 3, 2024 at 8:08 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Tue, Sep 03, 2024 at 04:50:46PM +0800, Zhaoyang Huang wrote: > > > I'd also sugest only trying to use this is the file system has > > > journaling enabled. If the file system is an ext2 file system without > > > a journal, there's no reason avoid using the CMA region > > agree. > > > assume the reason why the buffer cache is trying to use the moveable > > > flag is because the amount of non-CMA memory might be a precious > > > resource in some systems. > > > > I don't think so. All migrate type page blocks possess the same > > position as each other as they could fallback to all migrate types > > when current fails. I guess the purpose could be to enlarge the scope > > of available memory as __GFP_MOVEABLE has the capability of recruiting > > CMA. > > Well, I guess I'm a bit confused why the buffer cache is trying to use > __GFP_MOVEABLE in the first place. In general CMA is to allow systems > to be able to allocate big chunks of memory which have to be > physically contiguous because the I/O device(s) are too primitive to > be able to do scatter-gather, right? So why are we trying to use CMA > eligible memory for 4k buffer cache pages? Presumably, because > there's not enough non-CMA eligible memory? I suppose maybe you missed the way of how CMA work as the second client as the special fallback of MIGRATE_MOVABLE during normal alloc_pages.(cma_alloc is the first client of CMA area) //MIGRATE_MOVABLE->ALLOC_CMA gfp_to_alloc_flags_cma { #ifdef CONFIG_CMA if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; #endif return alloc_flags; } //ALLOC_CMA->__rmqueue_cma_fallback __rmqueue(struct zone *zone, unsigned int order, int migratetype, unsigned int alloc_flags) { struct page *page; if (IS_ENABLED(CONFIG_CMA)) { if (alloc_flags & ALLOC_CMA && zone_page_state(zone, NR_FREE_CMA_PAGES) > zone_page_state(zone, NR_FREE_PAGES) / 2) { page = __rmqueue_cma_fallback(zone, order); if (page) return page; } } page = __rmqueue_smallest(zone, order, migratetype); ... } > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > cache might get thrashed a lot by having a lot of cached disk buffers > getting ejected from memory to try to make room for some contiguous > frame buffer memory, which means extra I/O overhead. So what's the > upside of using GFP_MOVEABLE for the buffer cache? To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce extra IO as they just be migrated to free pages instead of ejected directly when they are the target memory area. In terms of reclaiming, all migrate types of page blocks possess the same position. > > Just curious, because in general I'm blessed by not having to use CMA > in the first place (not having I/O devices too primitive so they can't > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > missing some of the design considerations behind CMA. I thought in > general CMA tends to used in early boot to allocate things like frame > buffers, and after that CMA doesn't tend to get used at all? That's > clearly not the case for you, apparently? Yes. CMA is designed for contiguous physical memory and has been used via cma_alloc during the whole lifetime especially on the system without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page blocks, they also could have compaction path retry for many times which is common during high-order alloc_pages. > > Cheers, > > - Ted
On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > cache might get thrashed a lot by having a lot of cached disk buffers > > getting ejected from memory to try to make room for some contiguous > > frame buffer memory, which means extra I/O overhead. So what's the > > upside of using GFP_MOVEABLE for the buffer cache? > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > extra IO as they just be migrated to free pages instead of ejected > directly when they are the target memory area. In terms of reclaiming, > all migrate types of page blocks possess the same position. Where is that being done? I don't see any evidence of this kind of migration in fs/buffer.c. It's *possile* I suppose, but you'd have to remove the buffer_head so it can't be found by getblk(), and then wait for bh->b_count to go to zero, and then allocate a new page, and then copy buffer_head's page, update the buffer_head, and then rechain the bh into the buffer cache. And as I said, I can't see any kind of code like that. It would be much simpler to just try to eject the bh from the buffer cache. And that's consistent which what you've observed, which is that if the buffer_head is prevented from being ejected because it's held by the jbd2 layer until the buffer has been checkpointed. > > Just curious, because in general I'm blessed by not having to use CMA > > in the first place (not having I/O devices too primitive so they can't > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > missing some of the design considerations behind CMA. I thought in > > general CMA tends to used in early boot to allocate things like frame > > buffers, and after that CMA doesn't tend to get used at all? That's > > clearly not the case for you, apparently? > > Yes. CMA is designed for contiguous physical memory and has been used > via cma_alloc during the whole lifetime especially on the system > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > blocks, they also could have compaction path retry for many times > which is common during high-order alloc_pages. But then what's the point of using CMA-eligible memory for the buffer cache, as opposed to just always using !__GFP_MOVEABLE for all buffer cache allocations? After all, that's what is being proposed for ext4's ext4_getblk(). What's the downside of avoiding the use of CMA-eligible memory for ext4's buffer cache? Why not do this for *all* buffers in the buffer cache? - Ted
On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > getting ejected from memory to try to make room for some contiguous > > > frame buffer memory, which means extra I/O overhead. So what's the > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > extra IO as they just be migrated to free pages instead of ejected > > directly when they are the target memory area. In terms of reclaiming, > > all migrate types of page blocks possess the same position. > > Where is that being done? I don't see any evidence of this kind of > migration in fs/buffer.c. The journaled pages which carry jh->bh are treated as file pages during isolation of a range of PFNs in the callstack below[1]. The bh will be migrated via each aops's migrate_folio and performs what you described below such as copy the content and reattach the bh to a new page. In terms of the journal enabled ext4 partition, the inode is a blockdev inode which applies buffer_migrate_folio_norefs as its migrate_folio[2]. [1] cma_alloc/alloc_contig_range __alloc_contig_migrate_range migrate_pages migrate_folio_move move_to_new_folio mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) [2] static int __buffer_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode, bool check_refs) { ... if (check_refs) { bool busy; bool invalidated = false; recheck_buffers: busy = false; spin_lock(&mapping->i_private_lock); bh = head; do { if (atomic_read(&bh->b_count)) { //My case failed here as bh is referred by a journal head. busy = true; break; } bh = bh->b_this_page; } while (bh != head); > > It's *possile* I suppose, but you'd have to remove the buffer_head so > it can't be found by getblk(), and then wait for bh->b_count to go to > zero, and then allocate a new page, and then copy buffer_head's page, > update the buffer_head, and then rechain the bh into the buffer cache. > And as I said, I can't see any kind of code like that. It would be > much simpler to just try to eject the bh from the buffer cache. And > that's consistent which what you've observed, which is that if the > buffer_head is prevented from being ejected because it's held by the > jbd2 layer until the buffer has been checkpointed. All of above is right except the buffer_head is going to be reattached to a new page instead of being ejected as it still point to checkpoint data. > > > > Just curious, because in general I'm blessed by not having to use CMA > > > in the first place (not having I/O devices too primitive so they can't > > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > > missing some of the design considerations behind CMA. I thought in > > > general CMA tends to used in early boot to allocate things like frame > > > buffers, and after that CMA doesn't tend to get used at all? That's > > > clearly not the case for you, apparently? > > > > Yes. CMA is designed for contiguous physical memory and has been used > > via cma_alloc during the whole lifetime especially on the system > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > > blocks, they also could have compaction path retry for many times > > which is common during high-order alloc_pages. > > But then what's the point of using CMA-eligible memory for the buffer > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer > cache allocations? After all, that's what is being proposed for > ext4's ext4_getblk(). What's the downside of avoiding the use of > CMA-eligible memory for ext4's buffer cache? Why not do this for > *all* buffers in the buffer cache? Since migration which arised from alloc_pages or cma_alloc always happens, we need appropriate users over MOVABLE pages. AFAIU, buffer cache pages under regular files are the best candidate for migration as we just need to modify page cache and PTE. Actually, all FSs apply GFP_MOVABLE on their regular files via the below functions. new_inode alloc_inode inode_init_always(struct super_block *sb, struct inode *inode) { ... mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); static int filemap_create_folio(struct file *file, struct address_space *mapping, pgoff_t index, struct folio_batch *fbatch) { struct folio *folio; int error; folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0); > > - Ted
On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote: > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > > getting ejected from memory to try to make room for some contiguous > > > > frame buffer memory, which means extra I/O overhead. So what's the > > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > > extra IO as they just be migrated to free pages instead of ejected > > > directly when they are the target memory area. In terms of reclaiming, > > > all migrate types of page blocks possess the same position. > > > > Where is that being done? I don't see any evidence of this kind of > > migration in fs/buffer.c. > The journaled pages which carry jh->bh are treated as file pages > during isolation of a range of PFNs in the callstack below[1]. The bh > will be migrated via each aops's migrate_folio and performs what you > described below such as copy the content and reattach the bh to a new > page. In terms of the journal enabled ext4 partition, the inode is a > blockdev inode which applies buffer_migrate_folio_norefs as its > migrate_folio[2]. > > [1] > cma_alloc/alloc_contig_range > __alloc_contig_migrate_range > migrate_pages > migrate_folio_move > move_to_new_folio > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) > > [2] > static int __buffer_migrate_folio(struct address_space *mapping, > struct folio *dst, struct folio *src, enum migrate_mode mode, > bool check_refs) > { > ... > if (check_refs) { > bool busy; > bool invalidated = false; > > recheck_buffers: > busy = false; > spin_lock(&mapping->i_private_lock); > bh = head; > do { > if (atomic_read(&bh->b_count)) { > //My case failed here as bh is referred by a journal head. > busy = true; > break; > } > bh = bh->b_this_page; > } while (bh != head); Correct. Currently pages with journal heads attached cannot be migrated mostly out of pure caution that the generic code isn't sure what's happening with them. As I wrote in [1] we could make pages with jhs on checkpoint list only migratable as for them the buffer lock is enough to stop anybody from touching the bh data. Bhs which are part of a running / committing transaction are not realistically migratable but then these states are more shortlived so it shouldn't be a big problem. > > > > Just curious, because in general I'm blessed by not having to use CMA > > > > in the first place (not having I/O devices too primitive so they can't > > > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > > > missing some of the design considerations behind CMA. I thought in > > > > general CMA tends to used in early boot to allocate things like frame > > > > buffers, and after that CMA doesn't tend to get used at all? That's > > > > clearly not the case for you, apparently? > > > > > > Yes. CMA is designed for contiguous physical memory and has been used > > > via cma_alloc during the whole lifetime especially on the system > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > > > blocks, they also could have compaction path retry for many times > > > which is common during high-order alloc_pages. > > > > But then what's the point of using CMA-eligible memory for the buffer > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer > > cache allocations? After all, that's what is being proposed for > > ext4's ext4_getblk(). What's the downside of avoiding the use of > > CMA-eligible memory for ext4's buffer cache? Why not do this for > > *all* buffers in the buffer cache? > Since migration which arised from alloc_pages or cma_alloc always > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer > cache pages under regular files are the best candidate for migration > as we just need to modify page cache and PTE. Actually, all FSs apply > GFP_MOVABLE on their regular files via the below functions. > > new_inode > alloc_inode > inode_init_always(struct super_block *sb, struct inode *inode) > { > ... > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); Here you speak about data page cache pages. Indeed they can be allocated from CMA area. But when Ted speaks about "buffer cache" he specifically means page cache of the block device inode and there I can see: struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) { ... mapping_set_gfp_mask(&inode->i_data, GFP_USER); ... } so at this point I'm confused how come you can see block device pages in CMA area. Are you using data=journal mode of ext4 in your setup by any chance? That would explain it but then that is a horrible idea as well... Honza
On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote: > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote: > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote: > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > > > getting ejected from memory to try to make room for some contiguous > > > > > frame buffer memory, which means extra I/O overhead. So what's the > > > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > > > extra IO as they just be migrated to free pages instead of ejected > > > > directly when they are the target memory area. In terms of reclaiming, > > > > all migrate types of page blocks possess the same position. > > > > > > Where is that being done? I don't see any evidence of this kind of > > > migration in fs/buffer.c. > > The journaled pages which carry jh->bh are treated as file pages > > during isolation of a range of PFNs in the callstack below[1]. The bh > > will be migrated via each aops's migrate_folio and performs what you > > described below such as copy the content and reattach the bh to a new > > page. In terms of the journal enabled ext4 partition, the inode is a > > blockdev inode which applies buffer_migrate_folio_norefs as its > > migrate_folio[2]. > > > > [1] > > cma_alloc/alloc_contig_range > > __alloc_contig_migrate_range > > migrate_pages > > migrate_folio_move > > move_to_new_folio > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) > > > > [2] > > static int __buffer_migrate_folio(struct address_space *mapping, > > struct folio *dst, struct folio *src, enum migrate_mode mode, > > bool check_refs) > > { > > ... > > if (check_refs) { > > bool busy; > > bool invalidated = false; > > > > recheck_buffers: > > busy = false; > > spin_lock(&mapping->i_private_lock); > > bh = head; > > do { > > if (atomic_read(&bh->b_count)) { > > //My case failed here as bh is referred by a journal head. > > busy = true; > > break; > > } > > bh = bh->b_this_page; > > } while (bh != head); > > Correct. Currently pages with journal heads attached cannot be migrated > mostly out of pure caution that the generic code isn't sure what's > happening with them. As I wrote in [1] we could make pages with jhs on > checkpoint list only migratable as for them the buffer lock is enough to > stop anybody from touching the bh data. Bhs which are part of a running / > committing transaction are not realistically migratable but then these > states are more shortlived so it shouldn't be a big problem. By observing from our test case, the jh remains there for a long time when journal->j_free is bigger than j_max_transaction_buffers which failed cma_alloc. So you think this is rare or abnormal? [6] j_free & j_max_transaction_buffers crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x j_free = 0x3f1, j_max_transaction_buffers = 0x100, > > > > > > Just curious, because in general I'm blessed by not having to use CMA > > > > > in the first place (not having I/O devices too primitive so they can't > > > > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > > > > missing some of the design considerations behind CMA. I thought in > > > > > general CMA tends to used in early boot to allocate things like frame > > > > > buffers, and after that CMA doesn't tend to get used at all? That's > > > > > clearly not the case for you, apparently? > > > > > > > > Yes. CMA is designed for contiguous physical memory and has been used > > > > via cma_alloc during the whole lifetime especially on the system > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > > > > blocks, they also could have compaction path retry for many times > > > > which is common during high-order alloc_pages. > > > > > > But then what's the point of using CMA-eligible memory for the buffer > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer > > > cache allocations? After all, that's what is being proposed for > > > ext4's ext4_getblk(). What's the downside of avoiding the use of > > > CMA-eligible memory for ext4's buffer cache? Why not do this for > > > *all* buffers in the buffer cache? > > Since migration which arised from alloc_pages or cma_alloc always > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer > > cache pages under regular files are the best candidate for migration > > as we just need to modify page cache and PTE. Actually, all FSs apply > > GFP_MOVABLE on their regular files via the below functions. > > > > new_inode > > alloc_inode > > inode_init_always(struct super_block *sb, struct inode *inode) > > { > > ... > > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); > > Here you speak about data page cache pages. Indeed they can be allocated > from CMA area. But when Ted speaks about "buffer cache" he specifically > means page cache of the block device inode and there I can see: > > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) > { > ... > mapping_set_gfp_mask(&inode->i_data, GFP_USER); > ... > } > > so at this point I'm confused how come you can see block device pages in > CMA area. Are you using data=journal mode of ext4 in your setup by any > chance? That would explain it but then that is a horrible idea as well... The page of 'fffffffe01a51c00'[1] which has bh attached comes from creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file pages under the super_block but not available for metadata, right? [1] crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping) fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private //within CMA area fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private [2] 02148 < 4> [ 14.133703] [08-11 18:38:25.133] __find_get_block+0x29c/0x634 02149 < 4> [ 14.133711] [08-11 18:38:25.133] __getblk_gfp+0xa8/0x290 0214A < 4> [ 14.133716] [08-11 18:38:25.133] ext4_read_inode_bitmap+0xa0/0x6c4 0214B < 4> [ 14.133725] [08-11 18:38:25.133] __ext4_new_inode+0x34c/0x10d4 0214C < 4> [ 14.133730] [08-11 18:38:25.133] ext4_create+0xdc/0x1cc 0214D < 4> [ 14.133737] [08-11 18:38:25.133] path_openat+0x4fc/0xc84 0214E < 4> [ 14.133745] [08-11 18:38:25.133] do_filp_open+0xc0/0x16c 0214F < 4> [ 14.133751] [08-11 18:38:25.133] do_sys_openat2+0x8c/0xf8 02150 < 4> [ 14.133758] [08-11 18:38:25.133] __arm64_sys_openat+0x78/0xa4 02151 < 4> [ 14.133764] [08-11 18:38:25.133] invoke_syscall+0x60/0x11c 02152 < 4> [ 14.133771] [08-11 18:38:25.133] el0_svc_common+0xb4/0xe8 02153 < 4> [ 14.133777] [08-11 18:38:25.133] do_el0_svc+0x24/0x30 02154 < 4> [ 14.133783] [08-11 18:38:25.133] el0_svc+0x3c/0x70 > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote: > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote: > > > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote: > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote: > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > > > > getting ejected from memory to try to make room for some contiguous > > > > > > frame buffer memory, which means extra I/O overhead. So what's the > > > > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > > > > extra IO as they just be migrated to free pages instead of ejected > > > > > directly when they are the target memory area. In terms of reclaiming, > > > > > all migrate types of page blocks possess the same position. > > > > > > > > Where is that being done? I don't see any evidence of this kind of > > > > migration in fs/buffer.c. > > > The journaled pages which carry jh->bh are treated as file pages > > > during isolation of a range of PFNs in the callstack below[1]. The bh > > > will be migrated via each aops's migrate_folio and performs what you > > > described below such as copy the content and reattach the bh to a new > > > page. In terms of the journal enabled ext4 partition, the inode is a > > > blockdev inode which applies buffer_migrate_folio_norefs as its > > > migrate_folio[2]. > > > > > > [1] > > > cma_alloc/alloc_contig_range > > > __alloc_contig_migrate_range > > > migrate_pages > > > migrate_folio_move > > > move_to_new_folio > > > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) > > > > > > [2] > > > static int __buffer_migrate_folio(struct address_space *mapping, > > > struct folio *dst, struct folio *src, enum migrate_mode mode, > > > bool check_refs) > > > { > > > ... > > > if (check_refs) { > > > bool busy; > > > bool invalidated = false; > > > > > > recheck_buffers: > > > busy = false; > > > spin_lock(&mapping->i_private_lock); > > > bh = head; > > > do { > > > if (atomic_read(&bh->b_count)) { > > > //My case failed here as bh is referred by a journal head. > > > busy = true; > > > break; > > > } > > > bh = bh->b_this_page; > > > } while (bh != head); > > > > Correct. Currently pages with journal heads attached cannot be migrated > > mostly out of pure caution that the generic code isn't sure what's > > happening with them. As I wrote in [1] we could make pages with jhs on > > checkpoint list only migratable as for them the buffer lock is enough to > > stop anybody from touching the bh data. Bhs which are part of a running / > > committing transaction are not realistically migratable but then these > > states are more shortlived so it shouldn't be a big problem. > By observing from our test case, the jh remains there for a long time > when journal->j_free is bigger than j_max_transaction_buffers which > failed cma_alloc. So you think this is rare or abnormal? > > [6] j_free & j_max_transaction_buffers > crash_arm64_v8.0.4++> struct > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x > j_free = 0x3f1, > j_max_transaction_buffers = 0x100, So jh can stay attached to the bh for a very long time (basically only memory pressure will evict it) and this is what blocks migration. But what I meant is that in fact, most of the time we can migrate bh with jh attached just fine. There are only relatively short moments (max 5s) where a buffer (jh) is part of a running or committing transaction and then we cannot really migrate. > > > > > > Just curious, because in general I'm blessed by not having to use CMA > > > > > > in the first place (not having I/O devices too primitive so they can't > > > > > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > > > > > missing some of the design considerations behind CMA. I thought in > > > > > > general CMA tends to used in early boot to allocate things like frame > > > > > > buffers, and after that CMA doesn't tend to get used at all? That's > > > > > > clearly not the case for you, apparently? > > > > > > > > > > Yes. CMA is designed for contiguous physical memory and has been used > > > > > via cma_alloc during the whole lifetime especially on the system > > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > > > > > blocks, they also could have compaction path retry for many times > > > > > which is common during high-order alloc_pages. > > > > > > > > But then what's the point of using CMA-eligible memory for the buffer > > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer > > > > cache allocations? After all, that's what is being proposed for > > > > ext4's ext4_getblk(). What's the downside of avoiding the use of > > > > CMA-eligible memory for ext4's buffer cache? Why not do this for > > > > *all* buffers in the buffer cache? > > > Since migration which arised from alloc_pages or cma_alloc always > > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer > > > cache pages under regular files are the best candidate for migration > > > as we just need to modify page cache and PTE. Actually, all FSs apply > > > GFP_MOVABLE on their regular files via the below functions. > > > > > > new_inode > > > alloc_inode > > > inode_init_always(struct super_block *sb, struct inode *inode) > > > { > > > ... > > > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); > > > > Here you speak about data page cache pages. Indeed they can be allocated > > from CMA area. But when Ted speaks about "buffer cache" he specifically > > means page cache of the block device inode and there I can see: > > > > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) > > { > > ... > > mapping_set_gfp_mask(&inode->i_data, GFP_USER); > > ... > > } > > > > so at this point I'm confused how come you can see block device pages in > > CMA area. Are you using data=journal mode of ext4 in your setup by any > > chance? That would explain it but then that is a horrible idea as well... > The page of 'fffffffe01a51c00'[1] which has bh attached comes from > creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where > the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file > pages under the super_block but not available for metadata, right? Ah, right, __getblk() overrides the GFP mode set in bdev's mapping_gfp_mask and sets __GFP_MOVABLE there. This behavior goes way back to 2007 (769848c03895b ("Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated")). So another clean and simple way to fix this (as Ted suggests) would be to stop setting __GFP_MOVABLE in __getblk(). For ext4 there would be no practical difference to current situation where metadata pages are practically unmovable due to attached jhs, other filesystems can set __GFP_MOVABLE in bdev's mapping_gfp_mask if they care (but I don't think there are other big buffer cache users on systems which need CMA). Honza
On Thu, Sep 12, 2024 at 6:16 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote: > > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote: > > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote: > > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > > > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > > > > > getting ejected from memory to try to make room for some contiguous > > > > > > > frame buffer memory, which means extra I/O overhead. So what's the > > > > > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > > > > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > > > > > extra IO as they just be migrated to free pages instead of ejected > > > > > > directly when they are the target memory area. In terms of reclaiming, > > > > > > all migrate types of page blocks possess the same position. > > > > > > > > > > Where is that being done? I don't see any evidence of this kind of > > > > > migration in fs/buffer.c. > > > > The journaled pages which carry jh->bh are treated as file pages > > > > during isolation of a range of PFNs in the callstack below[1]. The bh > > > > will be migrated via each aops's migrate_folio and performs what you > > > > described below such as copy the content and reattach the bh to a new > > > > page. In terms of the journal enabled ext4 partition, the inode is a > > > > blockdev inode which applies buffer_migrate_folio_norefs as its > > > > migrate_folio[2]. > > > > > > > > [1] > > > > cma_alloc/alloc_contig_range > > > > __alloc_contig_migrate_range > > > > migrate_pages > > > > migrate_folio_move > > > > move_to_new_folio > > > > > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) > > > > > > > > [2] > > > > static int __buffer_migrate_folio(struct address_space *mapping, > > > > struct folio *dst, struct folio *src, enum migrate_mode mode, > > > > bool check_refs) > > > > { > > > > ... > > > > if (check_refs) { > > > > bool busy; > > > > bool invalidated = false; > > > > > > > > recheck_buffers: > > > > busy = false; > > > > spin_lock(&mapping->i_private_lock); > > > > bh = head; > > > > do { > > > > if (atomic_read(&bh->b_count)) { > > > > //My case failed here as bh is referred by a journal head. > > > > busy = true; > > > > break; > > > > } > > > > bh = bh->b_this_page; > > > > } while (bh != head); > > > > > > Correct. Currently pages with journal heads attached cannot be migrated > > > mostly out of pure caution that the generic code isn't sure what's > > > happening with them. As I wrote in [1] we could make pages with jhs on > > > checkpoint list only migratable as for them the buffer lock is enough to > > > stop anybody from touching the bh data. Bhs which are part of a running / > > > committing transaction are not realistically migratable but then these > > > states are more shortlived so it shouldn't be a big problem. > > By observing from our test case, the jh remains there for a long time > > when journal->j_free is bigger than j_max_transaction_buffers which > > failed cma_alloc. So you think this is rare or abnormal? > > > > [6] j_free & j_max_transaction_buffers > > crash_arm64_v8.0.4++> struct > > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x > > j_free = 0x3f1, > > j_max_transaction_buffers = 0x100, > > So jh can stay attached to the bh for a very long time (basically only > memory pressure will evict it) and this is what blocks migration. But what > I meant is that in fact, most of the time we can migrate bh with jh > attached just fine. There are only relatively short moments (max 5s) where > a buffer (jh) is part of a running or committing transaction and then we > cannot really migrate. Please correct me if I am wrong. According to __buffer_migrate_folio, the bh can not be migrated as long as it has jh attached which could remain until the next cp transaction is launched. In my case, the jbd2' log space is big enough( j_free = 0x3f1 > j_max_transaction_buffers = 0x100) to escape the launch. static int __buffer_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode, bool check_refs) { ... recheck_buffers: busy = false; spin_lock(&mapping->i_private_lock); bh = head; do { if (atomic_read(&bh->b_count)) { //migrate will fail here as bh->jh attached busy = true; break; } bh = bh->b_this_page; } while (bh != head); > > > > > > > > Just curious, because in general I'm blessed by not having to use CMA > > > > > > > in the first place (not having I/O devices too primitive so they can't > > > > > > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > > > > > > missing some of the design considerations behind CMA. I thought in > > > > > > > general CMA tends to used in early boot to allocate things like frame > > > > > > > buffers, and after that CMA doesn't tend to get used at all? That's > > > > > > > clearly not the case for you, apparently? > > > > > > > > > > > > Yes. CMA is designed for contiguous physical memory and has been used > > > > > > via cma_alloc during the whole lifetime especially on the system > > > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > > > > > > blocks, they also could have compaction path retry for many times > > > > > > which is common during high-order alloc_pages. > > > > > > > > > > But then what's the point of using CMA-eligible memory for the buffer > > > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer > > > > > cache allocations? After all, that's what is being proposed for > > > > > ext4's ext4_getblk(). What's the downside of avoiding the use of > > > > > CMA-eligible memory for ext4's buffer cache? Why not do this for > > > > > *all* buffers in the buffer cache? > > > > Since migration which arised from alloc_pages or cma_alloc always > > > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer > > > > cache pages under regular files are the best candidate for migration > > > > as we just need to modify page cache and PTE. Actually, all FSs apply > > > > GFP_MOVABLE on their regular files via the below functions. > > > > > > > > new_inode > > > > alloc_inode > > > > inode_init_always(struct super_block *sb, struct inode *inode) > > > > { > > > > ... > > > > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); > > > > > > Here you speak about data page cache pages. Indeed they can be allocated > > > from CMA area. But when Ted speaks about "buffer cache" he specifically > > > means page cache of the block device inode and there I can see: > > > > > > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) > > > { > > > ... > > > mapping_set_gfp_mask(&inode->i_data, GFP_USER); > > > ... > > > } > > > > > > so at this point I'm confused how come you can see block device pages in > > > CMA area. Are you using data=journal mode of ext4 in your setup by any > > > chance? That would explain it but then that is a horrible idea as well... > > The page of 'fffffffe01a51c00'[1] which has bh attached comes from > > creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where > > the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file > > pages under the super_block but not available for metadata, right? > > Ah, right, __getblk() overrides the GFP mode set in bdev's mapping_gfp_mask > and sets __GFP_MOVABLE there. This behavior goes way back to 2007 > (769848c03895b ("Add __GFP_MOVABLE for callers to flag allocations from > high memory that may be migrated")). So another clean and simple way to fix > this (as Ted suggests) would be to stop setting __GFP_MOVABLE in > __getblk(). For ext4 there would be no practical difference to current > situation where metadata pages are practically unmovable due to attached > jhs, other filesystems can set __GFP_MOVABLE in bdev's mapping_gfp_mask if > they care (but I don't think there are other big buffer cache users on > systems which need CMA). > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Fri 13-09-24 09:39:57, Zhaoyang Huang wrote: > On Thu, Sep 12, 2024 at 6:16 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote: > > > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote: > > > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote: > > > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > > > > > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > > > > > > getting ejected from memory to try to make room for some contiguous > > > > > > > > frame buffer memory, which means extra I/O overhead. So what's the > > > > > > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > > > > > > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > > > > > > extra IO as they just be migrated to free pages instead of ejected > > > > > > > directly when they are the target memory area. In terms of reclaiming, > > > > > > > all migrate types of page blocks possess the same position. > > > > > > > > > > > > Where is that being done? I don't see any evidence of this kind of > > > > > > migration in fs/buffer.c. > > > > > The journaled pages which carry jh->bh are treated as file pages > > > > > during isolation of a range of PFNs in the callstack below[1]. The bh > > > > > will be migrated via each aops's migrate_folio and performs what you > > > > > described below such as copy the content and reattach the bh to a new > > > > > page. In terms of the journal enabled ext4 partition, the inode is a > > > > > blockdev inode which applies buffer_migrate_folio_norefs as its > > > > > migrate_folio[2]. > > > > > > > > > > [1] > > > > > cma_alloc/alloc_contig_range > > > > > __alloc_contig_migrate_range > > > > > migrate_pages > > > > > migrate_folio_move > > > > > move_to_new_folio > > > > > > > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) > > > > > > > > > > [2] > > > > > static int __buffer_migrate_folio(struct address_space *mapping, > > > > > struct folio *dst, struct folio *src, enum migrate_mode mode, > > > > > bool check_refs) > > > > > { > > > > > ... > > > > > if (check_refs) { > > > > > bool busy; > > > > > bool invalidated = false; > > > > > > > > > > recheck_buffers: > > > > > busy = false; > > > > > spin_lock(&mapping->i_private_lock); > > > > > bh = head; > > > > > do { > > > > > if (atomic_read(&bh->b_count)) { > > > > > //My case failed here as bh is referred by a journal head. > > > > > busy = true; > > > > > break; > > > > > } > > > > > bh = bh->b_this_page; > > > > > } while (bh != head); > > > > > > > > Correct. Currently pages with journal heads attached cannot be migrated > > > > mostly out of pure caution that the generic code isn't sure what's > > > > happening with them. As I wrote in [1] we could make pages with jhs on > > > > checkpoint list only migratable as for them the buffer lock is enough to > > > > stop anybody from touching the bh data. Bhs which are part of a running / > > > > committing transaction are not realistically migratable but then these > > > > states are more shortlived so it shouldn't be a big problem. > > > By observing from our test case, the jh remains there for a long time > > > when journal->j_free is bigger than j_max_transaction_buffers which > > > failed cma_alloc. So you think this is rare or abnormal? > > > > > > [6] j_free & j_max_transaction_buffers > > > crash_arm64_v8.0.4++> struct > > > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x > > > j_free = 0x3f1, > > > j_max_transaction_buffers = 0x100, > > > > So jh can stay attached to the bh for a very long time (basically only > > memory pressure will evict it) and this is what blocks migration. But what > > I meant is that in fact, most of the time we can migrate bh with jh > > attached just fine. There are only relatively short moments (max 5s) where > > a buffer (jh) is part of a running or committing transaction and then we > > cannot really migrate. > Please correct me if I am wrong. According to __buffer_migrate_folio, > the bh can not be migrated as long as it has jh attached which could > remain until the next cp transaction is launched. In my case, the > jbd2' log space is big enough( j_free = 0x3f1 > > j_max_transaction_buffers = 0x100) to escape the launch. Currently yes. My proposal was to make bh migratable even with jh attached (which obviously needs some tweaks to __buffer_migrate_folio()). Honza
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 941c1c0d5c6e..4422246851fe 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -869,7 +869,11 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode, if (nowait) return sb_find_get_block(inode->i_sb, map.m_pblk); +#ifndef CONFIG_CMA bh = sb_getblk(inode->i_sb, map.m_pblk); +#else + bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0); +#endif if (unlikely(!bh)) return ERR_PTR(-ENOMEM); if (map.m_flags & EXT4_MAP_NEW) {