diff mbox series

[RFC,PATCHv2,1/1] fs: ext4: Don't use CMA for buffer_head

Message ID 20240823082237.713543-1-zhaoyang.huang@unisoc.com
State Rejected
Headers show
Series [RFC,PATCHv2,1/1] fs: ext4: Don't use CMA for buffer_head | expand

Commit Message

zhaoyang.huang Aug. 23, 2024, 8:22 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

cma_alloc() keep failed in our system which thanks to a jh->bh->b_page
can not be migrated out of CMA area[1] as the jh has one cp_transaction
pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6].
We temporarily solve this by launching jbd2_log_do_checkpoint forcefully
somewhere. Since journal is common mechanism to all JFSs and
cp_transaction has a little fewer opportunity to be launched, the
cma_alloc() could be affected under the same scenario. This patch
would like to have buffer_head of ext4 not use CMA pages when doing
sb_getblk.

[1]
crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping)
fffffffe01a51c00  e9470000 ffffff808f0aa150        3  2 8000000008020 lru,private //within CMA area
fffffffe03d189c0 174627000 ffffff808f0aa150        4  2 2004000000008020 lru,private
fffffffe03d88e00 176238000 ffffff808f0aa150      3f9  2 2008000000008020 lru,private
fffffffe03d88e40 176239000 ffffff808f0aa150        6  2 2008000000008020 lru,private
fffffffe03d88e80 17623a000 ffffff808f0aa150        5  2 2008000000008020 lru,private
fffffffe03d88ec0 17623b000 ffffff808f0aa150        1  2 2008000000008020 lru,private
fffffffe03d88f00 17623c000 ffffff808f0aa150        0  2 2008000000008020 lru,private
fffffffe040e6540 183995000 ffffff808f0aa150      3f4  2 2004000000008020 lru,private

[2] page -> buffer_head
crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x
      private = 0xffffff802fca0c00

[3] buffer_head -> journal_head
crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00
  b_private = 0xffffff8041338e10,

[4] journal_head -> b_cp_transaction
crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x
  b_cp_transaction = 0xffffff80410f1900,

[5] transaction_t -> journal
crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x
  t_journal = 0xffffff80e70f3000,

[6] j_free & j_max_transaction_buffers
crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
  j_free = 0x3f1,
  j_max_transaction_buffers = 0x100,

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
v2: update commit message
---
---
 fs/ext4/inode.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Zhaoyang Huang Aug. 23, 2024, 8:53 a.m. UTC | #1
On Fri, Aug 23, 2024 at 4:24 PM zhaoyang.huang
<zhaoyang.huang@unisoc.com> wrote:
>
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>
> cma_alloc() keep failed in our system which thanks to a jh->bh->b_page
> can not be migrated out of CMA area[1] as the jh has one cp_transaction
> pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6].
> We temporarily solve this by launching jbd2_log_do_checkpoint forcefully
> somewhere. Since journal is common mechanism to all JFSs and
> cp_transaction has a little fewer opportunity to be launched, the
> cma_alloc() could be affected under the same scenario. This patch
> would like to have buffer_head of ext4 not use CMA pages when doing
> sb_getblk.
>
> [1]
> crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping)
> fffffffe01a51c00  e9470000 ffffff808f0aa150        3  2 8000000008020 lru,private //within CMA area
> fffffffe03d189c0 174627000 ffffff808f0aa150        4  2 2004000000008020 lru,private
> fffffffe03d88e00 176238000 ffffff808f0aa150      3f9  2 2008000000008020 lru,private
> fffffffe03d88e40 176239000 ffffff808f0aa150        6  2 2008000000008020 lru,private
> fffffffe03d88e80 17623a000 ffffff808f0aa150        5  2 2008000000008020 lru,private
> fffffffe03d88ec0 17623b000 ffffff808f0aa150        1  2 2008000000008020 lru,private
> fffffffe03d88f00 17623c000 ffffff808f0aa150        0  2 2008000000008020 lru,private
> fffffffe040e6540 183995000 ffffff808f0aa150      3f4  2 2004000000008020 lru,private
>
> [2] page -> buffer_head
> crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x
>       private = 0xffffff802fca0c00
>
> [3] buffer_head -> journal_head
> crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00
>   b_private = 0xffffff8041338e10,
>
> [4] journal_head -> b_cp_transaction
> crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x
>   b_cp_transaction = 0xffffff80410f1900,
>
> [5] transaction_t -> journal
> crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x
>   t_journal = 0xffffff80e70f3000,
>
> [6] j_free & j_max_transaction_buffers
> crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
>   j_free = 0x3f1,
>   j_max_transaction_buffers = 0x100,
>
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
> v2: update commit message
> ---
> ---
>  fs/ext4/inode.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 941c1c0d5c6e..4422246851fe 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -869,7 +869,11 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
>         if (nowait)
>                 return sb_find_get_block(inode->i_sb, map.m_pblk);
>
> +#ifndef CONFIG_CMA
>         bh = sb_getblk(inode->i_sb, map.m_pblk);
> +#else
> +       bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0);
> +#endif
>         if (unlikely(!bh))
>                 return ERR_PTR(-ENOMEM);
>         if (map.m_flags & EXT4_MAP_NEW) {
> --
> 2.25.1
>
A hint for these, from my understanding, other FSs such as f2fs and
erofs have had the meta data avoid using the CMA area by using
GFP_NOFS/GFP_KERNEL on meta inode.

struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
{
...
        if (ino == F2FS_NODE_INO(sbi)) {
                inode->i_mapping->a_ops = &f2fs_node_aops;
                mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
        } else if (ino == F2FS_META_INO(sbi)) {
                inode->i_mapping->a_ops = &f2fs_meta_aops;
                mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);

int erofs_init_managed_cache(struct super_block *sb)
{
        struct inode *const inode = new_inode(sb);

        if (!inode)
                return -ENOMEM;

        set_nlink(inode, 1);
        inode->i_size = OFFSET_MAX;
        inode->i_mapping->a_ops = &z_erofs_cache_aops;
        mapping_set_gfp_mask(inode->i_mapping, GFP_KERNEL);
Theodore Ts'o Sept. 3, 2024, 2:29 a.m. UTC | #2
On Fri, Aug 23, 2024 at 04:22:37PM +0800, zhaoyang.huang wrote:
>  
> +#ifndef CONFIG_CMA
>  	bh = sb_getblk(inode->i_sb, map.m_pblk);
> +#else
> +	bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0);
> +#endif

So all of these patches to try to work around your issue with CMA are
a bit ugly.  But passing in a GFP mask of zero is definitely not the
right way to go about thing, since there might be certain GFP masks
that are required by a particular block device.  What I think you are
trying to do is to avoid setting the __GFP_MOVEABLE flag.  So in that
case, in the CMA path something like this is what you want:

	bh = getblk_unmoveable(sb->s_bdev, map.m_pblk, sb->s_blocksize);

I'd also sugest only trying to use this is the file system has
journaling enabled.  If the file system is an ext2 file system without
a journal, there's no reason avoid using the CMA region --- and I
assume the reason why the buffer cache is trying to use the moveable
flag is because the amount of non-CMA memory might be a precious
resource in some systems.

				- Ted
Zhaoyang Huang Sept. 3, 2024, 8:50 a.m. UTC | #3
On Tue, Sep 3, 2024 at 10:29 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Aug 23, 2024 at 04:22:37PM +0800, zhaoyang.huang wrote:
> >
> > +#ifndef CONFIG_CMA
> >       bh = sb_getblk(inode->i_sb, map.m_pblk);
> > +#else
> > +     bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0);
> > +#endif
>
> So all of these patches to try to work around your issue with CMA are
> a bit ugly.  But passing in a GFP mask of zero is definitely not the
> right way to go about thing, since there might be certain GFP masks
> that are required by a particular block device.  What I think you are
> trying to do is to avoid setting the __GFP_MOVEABLE flag.  So in that
> case, in the CMA path something like this is what you want:
>
>         bh = getblk_unmoveable(sb->s_bdev, map.m_pblk, sb->s_blocksize);
>
> I'd also sugest only trying to use this is the file system has
> journaling enabled.  If the file system is an ext2 file system without
> a journal, there's no reason avoid using the CMA region
agree.
> assume the reason why the buffer cache is trying to use the moveable
> flag is because the amount of non-CMA memory might be a precious
> resource in some systems.
I don't think so. All migrate type page blocks possess the same
position as each other as they could fallback to all migrate types
when current fails. I guess the purpose could be to enlarge the scope
of available memory as __GFP_MOVEABLE has the capability of recruiting
CMA.
>
>                                 - Ted
Theodore Ts'o Sept. 3, 2024, 12:08 p.m. UTC | #4
On Tue, Sep 03, 2024 at 04:50:46PM +0800, Zhaoyang Huang wrote:
> > I'd also sugest only trying to use this is the file system has
> > journaling enabled.  If the file system is an ext2 file system without
> > a journal, there's no reason avoid using the CMA region
> agree.
> > assume the reason why the buffer cache is trying to use the moveable
> > flag is because the amount of non-CMA memory might be a precious
> > resource in some systems.
>  
> I don't think so. All migrate type page blocks possess the same
> position as each other as they could fallback to all migrate types
> when current fails. I guess the purpose could be to enlarge the scope
> of available memory as __GFP_MOVEABLE has the capability of recruiting
> CMA.

Well, I guess I'm a bit confused why the buffer cache is trying to use
__GFP_MOVEABLE in the first place.  In general CMA is to allow systems
to be able to allocate big chunks of memory which have to be
physically contiguous because the I/O device(s) are too primitive to
be able to do scatter-gather, right?  So why are we trying to use CMA
eligible memory for 4k buffer cache pages?  Presumably, because
there's not enough non-CMA eligible memory?

After all, using GFP_MOVEABLE memory seems to mean that the buffer
cache might get thrashed a lot by having a lot of cached disk buffers
getting ejected from memory to try to make room for some contiguous
frame buffer memory, which means extra I/O overhead.  So what's the
upside of using GFP_MOVEABLE for the buffer cache?

Just curious, because in general I'm blessed by not having to use CMA
in the first place (not having I/O devices too primitive so they can't
do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
missing some of the design considerations behind CMA.  I thought in
general CMA tends to used in early boot to allocate things like frame
buffers, and after that CMA doesn't tend to get used at all?  That's
clearly not the case for you, apparently?

Cheers,

							- Ted
Zhaoyang Huang Sept. 4, 2024, 12:49 a.m. UTC | #5
On Tue, Sep 3, 2024 at 8:08 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Sep 03, 2024 at 04:50:46PM +0800, Zhaoyang Huang wrote:
> > > I'd also sugest only trying to use this is the file system has
> > > journaling enabled.  If the file system is an ext2 file system without
> > > a journal, there's no reason avoid using the CMA region
> > agree.
> > > assume the reason why the buffer cache is trying to use the moveable
> > > flag is because the amount of non-CMA memory might be a precious
> > > resource in some systems.
> >
> > I don't think so. All migrate type page blocks possess the same
> > position as each other as they could fallback to all migrate types
> > when current fails. I guess the purpose could be to enlarge the scope
> > of available memory as __GFP_MOVEABLE has the capability of recruiting
> > CMA.
>
> Well, I guess I'm a bit confused why the buffer cache is trying to use
> __GFP_MOVEABLE in the first place.  In general CMA is to allow systems
> to be able to allocate big chunks of memory which have to be
> physically contiguous because the I/O device(s) are too primitive to
> be able to do scatter-gather, right?  So why are we trying to use CMA
> eligible memory for 4k buffer cache pages?  Presumably, because
> there's not enough non-CMA eligible memory?
I suppose maybe you missed the way of how CMA work as the second
client as the special fallback of MIGRATE_MOVABLE during normal
alloc_pages.(cma_alloc is the first client of CMA area)

//MIGRATE_MOVABLE->ALLOC_CMA
gfp_to_alloc_flags_cma
{
#ifdef CONFIG_CMA
        if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
                alloc_flags |= ALLOC_CMA;
#endif
        return alloc_flags;
}

//ALLOC_CMA->__rmqueue_cma_fallback
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
                                                unsigned int alloc_flags)
{
        struct page *page;

        if (IS_ENABLED(CONFIG_CMA)) {
                if (alloc_flags & ALLOC_CMA &&
                    zone_page_state(zone, NR_FREE_CMA_PAGES) >
                    zone_page_state(zone, NR_FREE_PAGES) / 2) {
                        page = __rmqueue_cma_fallback(zone, order);
                        if (page)
                                return page;
                }
        }

        page = __rmqueue_smallest(zone, order, migratetype);
       ...
}

>
> After all, using GFP_MOVEABLE memory seems to mean that the buffer
> cache might get thrashed a lot by having a lot of cached disk buffers
> getting ejected from memory to try to make room for some contiguous
> frame buffer memory, which means extra I/O overhead.  So what's the
> upside of using GFP_MOVEABLE for the buffer cache?
To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
extra IO as they just be migrated to free pages instead of ejected
directly when they are the target memory area. In terms of reclaiming,
all migrate types of page blocks possess the same position.
>
> Just curious, because in general I'm blessed by not having to use CMA
> in the first place (not having I/O devices too primitive so they can't
> do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> missing some of the design considerations behind CMA.  I thought in
> general CMA tends to used in early boot to allocate things like frame
> buffers, and after that CMA doesn't tend to get used at all?  That's
> clearly not the case for you, apparently?
Yes. CMA is designed for contiguous physical memory and has been used
via cma_alloc during the whole lifetime especially on the system
without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
blocks, they also could have compaction path retry for many times
which is common during high-order alloc_pages.
>
> Cheers,
>
>                                                         - Ted
Theodore Ts'o Sept. 4, 2024, 2:44 a.m. UTC | #6
On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> >
> > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > cache might get thrashed a lot by having a lot of cached disk buffers
> > getting ejected from memory to try to make room for some contiguous
> > frame buffer memory, which means extra I/O overhead.  So what's the
> > upside of using GFP_MOVEABLE for the buffer cache?
>
> To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> extra IO as they just be migrated to free pages instead of ejected
> directly when they are the target memory area. In terms of reclaiming,
> all migrate types of page blocks possess the same position.

Where is that being done?  I don't see any evidence of this kind of
migration in fs/buffer.c.

It's *possile* I suppose, but you'd have to remove the buffer_head so
it can't be found by getblk(), and then wait for bh->b_count to go to
zero, and then allocate a new page, and then copy buffer_head's page,
update the buffer_head, and then rechain the bh into the buffer cache.
And as I said, I can't see any kind of code like that.  It would be
much simpler to just try to eject the bh from the buffer cache.  And
that's consistent which what you've observed, which is that if the
buffer_head is prevented from being ejected because it's held by the
jbd2 layer until the buffer has been checkpointed.

> > Just curious, because in general I'm blessed by not having to use CMA
> > in the first place (not having I/O devices too primitive so they can't
> > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > missing some of the design considerations behind CMA.  I thought in
> > general CMA tends to used in early boot to allocate things like frame
> > buffers, and after that CMA doesn't tend to get used at all?  That's
> > clearly not the case for you, apparently?
>
> Yes. CMA is designed for contiguous physical memory and has been used
> via cma_alloc during the whole lifetime especially on the system
> without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> blocks, they also could have compaction path retry for many times
> which is common during high-order alloc_pages.

But then what's the point of using CMA-eligible memory for the buffer
cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
cache allocations?  After all, that's what is being proposed for
ext4's ext4_getblk().  What's the downside of avoiding the use of
CMA-eligible memory for ext4's buffer cache?  Why not do this for
*all* buffers in the buffer cache?

					- Ted
Zhaoyang Huang Sept. 4, 2024, 6:56 a.m. UTC | #7
On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > >
> > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > getting ejected from memory to try to make room for some contiguous
> > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > upside of using GFP_MOVEABLE for the buffer cache?
> >
> > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > extra IO as they just be migrated to free pages instead of ejected
> > directly when they are the target memory area. In terms of reclaiming,
> > all migrate types of page blocks possess the same position.
>
> Where is that being done?  I don't see any evidence of this kind of
> migration in fs/buffer.c.
The journaled pages which carry jh->bh are treated as file pages
during isolation of a range of PFNs in the callstack below[1]. The bh
will be migrated via each aops's migrate_folio and performs what you
described below such as copy the content and reattach the bh to a new
page. In terms of the journal enabled ext4 partition, the inode is a
blockdev inode which applies buffer_migrate_folio_norefs as its
migrate_folio[2].

[1]
cma_alloc/alloc_contig_range
    __alloc_contig_migrate_range
        migrate_pages
            migrate_folio_move
                move_to_new_folio

mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)

[2]
static int __buffer_migrate_folio(struct address_space *mapping,
                struct folio *dst, struct folio *src, enum migrate_mode mode,
                bool check_refs)
{
...
        if (check_refs) {
                bool busy;
                bool invalidated = false;

recheck_buffers:
                busy = false;
                spin_lock(&mapping->i_private_lock);
                bh = head;
                do {
                        if (atomic_read(&bh->b_count)) {
          //My case failed here as bh is referred by a journal head.
                                busy = true;
                                break;
                        }
                        bh = bh->b_this_page;
                } while (bh != head);

>
> It's *possile* I suppose, but you'd have to remove the buffer_head so
> it can't be found by getblk(), and then wait for bh->b_count to go to
> zero, and then allocate a new page, and then copy buffer_head's page,
> update the buffer_head, and then rechain the bh into the buffer cache.
> And as I said, I can't see any kind of code like that.  It would be
> much simpler to just try to eject the bh from the buffer cache.  And
> that's consistent which what you've observed, which is that if the
> buffer_head is prevented from being ejected because it's held by the
> jbd2 layer until the buffer has been checkpointed.
All of above is right except the buffer_head is going to be reattached
to a new page instead of being ejected as it still point to checkpoint
data.
>
> > > Just curious, because in general I'm blessed by not having to use CMA
> > > in the first place (not having I/O devices too primitive so they can't
> > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > missing some of the design considerations behind CMA.  I thought in
> > > general CMA tends to used in early boot to allocate things like frame
> > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > clearly not the case for you, apparently?
> >
> > Yes. CMA is designed for contiguous physical memory and has been used
> > via cma_alloc during the whole lifetime especially on the system
> > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > blocks, they also could have compaction path retry for many times
> > which is common during high-order alloc_pages.
>
> But then what's the point of using CMA-eligible memory for the buffer
> cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> cache allocations?  After all, that's what is being proposed for
> ext4's ext4_getblk().  What's the downside of avoiding the use of
> CMA-eligible memory for ext4's buffer cache?  Why not do this for
> *all* buffers in the buffer cache?
Since migration which arised from alloc_pages or cma_alloc always
happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
cache pages under regular files are the best candidate for migration
as we just need to modify page cache and PTE. Actually, all FSs apply
GFP_MOVABLE on their regular files via the below functions.

new_inode
    alloc_inode
        inode_init_always(struct super_block *sb, struct inode *inode)
        {
         ...
            mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);

static int filemap_create_folio(struct file *file,
                struct address_space *mapping, pgoff_t index,
                struct folio_batch *fbatch)
{
        struct folio *folio;
        int error;

        folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);

>
>                                         - Ted
Jan Kara Sept. 12, 2024, 8:41 a.m. UTC | #8
On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote:
> > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > >
> > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > getting ejected from memory to try to make room for some contiguous
> > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > upside of using GFP_MOVEABLE for the buffer cache?
> > >
> > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > extra IO as they just be migrated to free pages instead of ejected
> > > directly when they are the target memory area. In terms of reclaiming,
> > > all migrate types of page blocks possess the same position.
> >
> > Where is that being done?  I don't see any evidence of this kind of
> > migration in fs/buffer.c.
> The journaled pages which carry jh->bh are treated as file pages
> during isolation of a range of PFNs in the callstack below[1]. The bh
> will be migrated via each aops's migrate_folio and performs what you
> described below such as copy the content and reattach the bh to a new
> page. In terms of the journal enabled ext4 partition, the inode is a
> blockdev inode which applies buffer_migrate_folio_norefs as its
> migrate_folio[2].
> 
> [1]
> cma_alloc/alloc_contig_range
>     __alloc_contig_migrate_range
>         migrate_pages
>             migrate_folio_move
>                 move_to_new_folio
> 
> mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> 
> [2]
> static int __buffer_migrate_folio(struct address_space *mapping,
>                 struct folio *dst, struct folio *src, enum migrate_mode mode,
>                 bool check_refs)
> {
> ...
>         if (check_refs) {
>                 bool busy;
>                 bool invalidated = false;
> 
> recheck_buffers:
>                 busy = false;
>                 spin_lock(&mapping->i_private_lock);
>                 bh = head;
>                 do {
>                         if (atomic_read(&bh->b_count)) {
>           //My case failed here as bh is referred by a journal head.
>                                 busy = true;
>                                 break;
>                         }
>                         bh = bh->b_this_page;
>                 } while (bh != head);

Correct. Currently pages with journal heads attached cannot be migrated
mostly out of pure caution that the generic code isn't sure what's
happening with them. As I wrote in [1] we could make pages with jhs on
checkpoint list only migratable as for them the buffer lock is enough to
stop anybody from touching the bh data. Bhs which are part of a running /
committing transaction are not realistically migratable but then these
states are more shortlived so it shouldn't be a big problem.

> > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > in the first place (not having I/O devices too primitive so they can't
> > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > missing some of the design considerations behind CMA.  I thought in
> > > > general CMA tends to used in early boot to allocate things like frame
> > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > clearly not the case for you, apparently?
> > >
> > > Yes. CMA is designed for contiguous physical memory and has been used
> > > via cma_alloc during the whole lifetime especially on the system
> > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > blocks, they also could have compaction path retry for many times
> > > which is common during high-order alloc_pages.
> >
> > But then what's the point of using CMA-eligible memory for the buffer
> > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > cache allocations?  After all, that's what is being proposed for
> > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > *all* buffers in the buffer cache?
> Since migration which arised from alloc_pages or cma_alloc always
> happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> cache pages under regular files are the best candidate for migration
> as we just need to modify page cache and PTE. Actually, all FSs apply
> GFP_MOVABLE on their regular files via the below functions.
> 
> new_inode
>     alloc_inode
>         inode_init_always(struct super_block *sb, struct inode *inode)
>         {
>          ...
>             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);

Here you speak about data page cache pages. Indeed they can be allocated
from CMA area. But when Ted speaks about "buffer cache" he specifically
means page cache of the block device inode and there I can see:

struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
{
...
        mapping_set_gfp_mask(&inode->i_data, GFP_USER);
...
}

so at this point I'm confused how come you can see block device pages in
CMA area. Are you using data=journal mode of ext4 in your setup by any
chance? That would explain it but then that is a horrible idea as well...

								Honza
Zhaoyang Huang Sept. 12, 2024, 9:10 a.m. UTC | #9
On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote:
> > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > >
> > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > getting ejected from memory to try to make room for some contiguous
> > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > >
> > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > extra IO as they just be migrated to free pages instead of ejected
> > > > directly when they are the target memory area. In terms of reclaiming,
> > > > all migrate types of page blocks possess the same position.
> > >
> > > Where is that being done?  I don't see any evidence of this kind of
> > > migration in fs/buffer.c.
> > The journaled pages which carry jh->bh are treated as file pages
> > during isolation of a range of PFNs in the callstack below[1]. The bh
> > will be migrated via each aops's migrate_folio and performs what you
> > described below such as copy the content and reattach the bh to a new
> > page. In terms of the journal enabled ext4 partition, the inode is a
> > blockdev inode which applies buffer_migrate_folio_norefs as its
> > migrate_folio[2].
> >
> > [1]
> > cma_alloc/alloc_contig_range
> >     __alloc_contig_migrate_range
> >         migrate_pages
> >             migrate_folio_move
> >                 move_to_new_folio
> >
> > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> >
> > [2]
> > static int __buffer_migrate_folio(struct address_space *mapping,
> >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> >                 bool check_refs)
> > {
> > ...
> >         if (check_refs) {
> >                 bool busy;
> >                 bool invalidated = false;
> >
> > recheck_buffers:
> >                 busy = false;
> >                 spin_lock(&mapping->i_private_lock);
> >                 bh = head;
> >                 do {
> >                         if (atomic_read(&bh->b_count)) {
> >           //My case failed here as bh is referred by a journal head.
> >                                 busy = true;
> >                                 break;
> >                         }
> >                         bh = bh->b_this_page;
> >                 } while (bh != head);
>
> Correct. Currently pages with journal heads attached cannot be migrated
> mostly out of pure caution that the generic code isn't sure what's
> happening with them. As I wrote in [1] we could make pages with jhs on
> checkpoint list only migratable as for them the buffer lock is enough to
> stop anybody from touching the bh data. Bhs which are part of a running /
> committing transaction are not realistically migratable but then these
> states are more shortlived so it shouldn't be a big problem.
By observing from our test case, the jh remains there for a long time
when journal->j_free is bigger than j_max_transaction_buffers which
failed cma_alloc. So you think this is rare or abnormal?

[6] j_free & j_max_transaction_buffers
crash_arm64_v8.0.4++> struct
journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
  j_free = 0x3f1,
  j_max_transaction_buffers = 0x100,

>
> > > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > > in the first place (not having I/O devices too primitive so they can't
> > > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > > missing some of the design considerations behind CMA.  I thought in
> > > > > general CMA tends to used in early boot to allocate things like frame
> > > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > > clearly not the case for you, apparently?
> > > >
> > > > Yes. CMA is designed for contiguous physical memory and has been used
> > > > via cma_alloc during the whole lifetime especially on the system
> > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > > blocks, they also could have compaction path retry for many times
> > > > which is common during high-order alloc_pages.
> > >
> > > But then what's the point of using CMA-eligible memory for the buffer
> > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > > cache allocations?  After all, that's what is being proposed for
> > > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > > *all* buffers in the buffer cache?
> > Since migration which arised from alloc_pages or cma_alloc always
> > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> > cache pages under regular files are the best candidate for migration
> > as we just need to modify page cache and PTE. Actually, all FSs apply
> > GFP_MOVABLE on their regular files via the below functions.
> >
> > new_inode
> >     alloc_inode
> >         inode_init_always(struct super_block *sb, struct inode *inode)
> >         {
> >          ...
> >             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>
> Here you speak about data page cache pages. Indeed they can be allocated
> from CMA area. But when Ted speaks about "buffer cache" he specifically
> means page cache of the block device inode and there I can see:
>
> struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
> {
> ...
>         mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> ...
> }
>
> so at this point I'm confused how come you can see block device pages in
> CMA area. Are you using data=journal mode of ext4 in your setup by any
> chance? That would explain it but then that is a horrible idea as well...
The page of 'fffffffe01a51c00'[1] which has bh attached comes from
creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where
the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file
pages under the super_block but not available for metadata, right?

[1]
crash_arm64_v8.0.4++> kmem -p|grep
ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping)
fffffffe01a51c00  e9470000 ffffff808f0aa150        3  2 8000000008020
lru,private //within CMA area
fffffffe03d189c0 174627000 ffffff808f0aa150        4  2
2004000000008020 lru,private
fffffffe03d88e00 176238000 ffffff808f0aa150      3f9  2
2008000000008020 lru,private
fffffffe03d88e40 176239000 ffffff808f0aa150        6  2
2008000000008020 lru,private
fffffffe03d88e80 17623a000 ffffff808f0aa150        5  2
2008000000008020 lru,private
fffffffe03d88ec0 17623b000 ffffff808f0aa150        1  2
2008000000008020 lru,private
fffffffe03d88f00 17623c000 ffffff808f0aa150        0  2
2008000000008020 lru,private
fffffffe040e6540 183995000 ffffff808f0aa150      3f4  2
2004000000008020 lru,private

[2]
02148 < 4> [   14.133703] [08-11 18:38:25.133] __find_get_block+0x29c/0x634
02149 < 4> [   14.133711] [08-11 18:38:25.133] __getblk_gfp+0xa8/0x290
0214A < 4> [   14.133716] [08-11 18:38:25.133] ext4_read_inode_bitmap+0xa0/0x6c4
0214B < 4> [   14.133725] [08-11 18:38:25.133] __ext4_new_inode+0x34c/0x10d4
0214C < 4> [   14.133730] [08-11 18:38:25.133] ext4_create+0xdc/0x1cc
0214D < 4> [   14.133737] [08-11 18:38:25.133] path_openat+0x4fc/0xc84
0214E < 4> [   14.133745] [08-11 18:38:25.133] do_filp_open+0xc0/0x16c
0214F < 4> [   14.133751] [08-11 18:38:25.133] do_sys_openat2+0x8c/0xf8
02150 < 4> [   14.133758] [08-11 18:38:25.133] __arm64_sys_openat+0x78/0xa4
02151 < 4> [   14.133764] [08-11 18:38:25.133] invoke_syscall+0x60/0x11c
02152 < 4> [   14.133771] [08-11 18:38:25.133] el0_svc_common+0xb4/0xe8
02153 < 4> [   14.133777] [08-11 18:38:25.133] do_el0_svc+0x24/0x30
02154 < 4> [   14.133783] [08-11 18:38:25.133] el0_svc+0x3c/0x70

>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Sept. 12, 2024, 10:16 a.m. UTC | #10
On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote:
> On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote:
> > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > > >
> > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > > getting ejected from memory to try to make room for some contiguous
> > > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > > >
> > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > > extra IO as they just be migrated to free pages instead of ejected
> > > > > directly when they are the target memory area. In terms of reclaiming,
> > > > > all migrate types of page blocks possess the same position.
> > > >
> > > > Where is that being done?  I don't see any evidence of this kind of
> > > > migration in fs/buffer.c.
> > > The journaled pages which carry jh->bh are treated as file pages
> > > during isolation of a range of PFNs in the callstack below[1]. The bh
> > > will be migrated via each aops's migrate_folio and performs what you
> > > described below such as copy the content and reattach the bh to a new
> > > page. In terms of the journal enabled ext4 partition, the inode is a
> > > blockdev inode which applies buffer_migrate_folio_norefs as its
> > > migrate_folio[2].
> > >
> > > [1]
> > > cma_alloc/alloc_contig_range
> > >     __alloc_contig_migrate_range
> > >         migrate_pages
> > >             migrate_folio_move
> > >                 move_to_new_folio
> > >
> > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> > >
> > > [2]
> > > static int __buffer_migrate_folio(struct address_space *mapping,
> > >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> > >                 bool check_refs)
> > > {
> > > ...
> > >         if (check_refs) {
> > >                 bool busy;
> > >                 bool invalidated = false;
> > >
> > > recheck_buffers:
> > >                 busy = false;
> > >                 spin_lock(&mapping->i_private_lock);
> > >                 bh = head;
> > >                 do {
> > >                         if (atomic_read(&bh->b_count)) {
> > >           //My case failed here as bh is referred by a journal head.
> > >                                 busy = true;
> > >                                 break;
> > >                         }
> > >                         bh = bh->b_this_page;
> > >                 } while (bh != head);
> >
> > Correct. Currently pages with journal heads attached cannot be migrated
> > mostly out of pure caution that the generic code isn't sure what's
> > happening with them. As I wrote in [1] we could make pages with jhs on
> > checkpoint list only migratable as for them the buffer lock is enough to
> > stop anybody from touching the bh data. Bhs which are part of a running /
> > committing transaction are not realistically migratable but then these
> > states are more shortlived so it shouldn't be a big problem.
> By observing from our test case, the jh remains there for a long time
> when journal->j_free is bigger than j_max_transaction_buffers which
> failed cma_alloc. So you think this is rare or abnormal?
> 
> [6] j_free & j_max_transaction_buffers
> crash_arm64_v8.0.4++> struct
> journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
>   j_free = 0x3f1,
>   j_max_transaction_buffers = 0x100,

So jh can stay attached to the bh for a very long time (basically only
memory pressure will evict it) and this is what blocks migration. But what
I meant is that in fact, most of the time we can migrate bh with jh
attached just fine. There are only relatively short moments (max 5s) where
a buffer (jh) is part of a running or committing transaction and then we
cannot really migrate.

> > > > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > > > in the first place (not having I/O devices too primitive so they can't
> > > > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > > > missing some of the design considerations behind CMA.  I thought in
> > > > > > general CMA tends to used in early boot to allocate things like frame
> > > > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > > > clearly not the case for you, apparently?
> > > > >
> > > > > Yes. CMA is designed for contiguous physical memory and has been used
> > > > > via cma_alloc during the whole lifetime especially on the system
> > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > > > blocks, they also could have compaction path retry for many times
> > > > > which is common during high-order alloc_pages.
> > > >
> > > > But then what's the point of using CMA-eligible memory for the buffer
> > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > > > cache allocations?  After all, that's what is being proposed for
> > > > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > > > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > > > *all* buffers in the buffer cache?
> > > Since migration which arised from alloc_pages or cma_alloc always
> > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> > > cache pages under regular files are the best candidate for migration
> > > as we just need to modify page cache and PTE. Actually, all FSs apply
> > > GFP_MOVABLE on their regular files via the below functions.
> > >
> > > new_inode
> > >     alloc_inode
> > >         inode_init_always(struct super_block *sb, struct inode *inode)
> > >         {
> > >          ...
> > >             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> >
> > Here you speak about data page cache pages. Indeed they can be allocated
> > from CMA area. But when Ted speaks about "buffer cache" he specifically
> > means page cache of the block device inode and there I can see:
> >
> > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
> > {
> > ...
> >         mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> > ...
> > }
> >
> > so at this point I'm confused how come you can see block device pages in
> > CMA area. Are you using data=journal mode of ext4 in your setup by any
> > chance? That would explain it but then that is a horrible idea as well...
> The page of 'fffffffe01a51c00'[1] which has bh attached comes from
> creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where
> the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file
> pages under the super_block but not available for metadata, right?

Ah, right, __getblk() overrides the GFP mode set in bdev's mapping_gfp_mask
and sets __GFP_MOVABLE there. This behavior goes way back to 2007
(769848c03895b ("Add __GFP_MOVABLE for callers to flag allocations from
high memory that may be migrated")). So another clean and simple way to fix
this (as Ted suggests) would be to stop setting __GFP_MOVABLE in
__getblk(). For ext4 there would be no practical difference to current
situation where metadata pages are practically unmovable due to attached
jhs, other filesystems can set __GFP_MOVABLE in bdev's mapping_gfp_mask if
they care (but I don't think there are other big buffer cache users on
systems which need CMA).

								Honza
Zhaoyang Huang Sept. 13, 2024, 1:39 a.m. UTC | #11
On Thu, Sep 12, 2024 at 6:16 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote:
> > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote:
> > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > > > >
> > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > > > getting ejected from memory to try to make room for some contiguous
> > > > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > > > >
> > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > > > extra IO as they just be migrated to free pages instead of ejected
> > > > > > directly when they are the target memory area. In terms of reclaiming,
> > > > > > all migrate types of page blocks possess the same position.
> > > > >
> > > > > Where is that being done?  I don't see any evidence of this kind of
> > > > > migration in fs/buffer.c.
> > > > The journaled pages which carry jh->bh are treated as file pages
> > > > during isolation of a range of PFNs in the callstack below[1]. The bh
> > > > will be migrated via each aops's migrate_folio and performs what you
> > > > described below such as copy the content and reattach the bh to a new
> > > > page. In terms of the journal enabled ext4 partition, the inode is a
> > > > blockdev inode which applies buffer_migrate_folio_norefs as its
> > > > migrate_folio[2].
> > > >
> > > > [1]
> > > > cma_alloc/alloc_contig_range
> > > >     __alloc_contig_migrate_range
> > > >         migrate_pages
> > > >             migrate_folio_move
> > > >                 move_to_new_folio
> > > >
> > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> > > >
> > > > [2]
> > > > static int __buffer_migrate_folio(struct address_space *mapping,
> > > >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> > > >                 bool check_refs)
> > > > {
> > > > ...
> > > >         if (check_refs) {
> > > >                 bool busy;
> > > >                 bool invalidated = false;
> > > >
> > > > recheck_buffers:
> > > >                 busy = false;
> > > >                 spin_lock(&mapping->i_private_lock);
> > > >                 bh = head;
> > > >                 do {
> > > >                         if (atomic_read(&bh->b_count)) {
> > > >           //My case failed here as bh is referred by a journal head.
> > > >                                 busy = true;
> > > >                                 break;
> > > >                         }
> > > >                         bh = bh->b_this_page;
> > > >                 } while (bh != head);
> > >
> > > Correct. Currently pages with journal heads attached cannot be migrated
> > > mostly out of pure caution that the generic code isn't sure what's
> > > happening with them. As I wrote in [1] we could make pages with jhs on
> > > checkpoint list only migratable as for them the buffer lock is enough to
> > > stop anybody from touching the bh data. Bhs which are part of a running /
> > > committing transaction are not realistically migratable but then these
> > > states are more shortlived so it shouldn't be a big problem.
> > By observing from our test case, the jh remains there for a long time
> > when journal->j_free is bigger than j_max_transaction_buffers which
> > failed cma_alloc. So you think this is rare or abnormal?
> >
> > [6] j_free & j_max_transaction_buffers
> > crash_arm64_v8.0.4++> struct
> > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
> >   j_free = 0x3f1,
> >   j_max_transaction_buffers = 0x100,
>
> So jh can stay attached to the bh for a very long time (basically only
> memory pressure will evict it) and this is what blocks migration. But what
> I meant is that in fact, most of the time we can migrate bh with jh
> attached just fine. There are only relatively short moments (max 5s) where
> a buffer (jh) is part of a running or committing transaction and then we
> cannot really migrate.
Please correct me if I am wrong. According to __buffer_migrate_folio,
the bh can not be migrated as long as it has jh attached which could
remain until the next cp transaction is launched. In my case, the
jbd2' log space is big enough( j_free = 0x3f1 >
j_max_transaction_buffers = 0x100) to escape the launch.

static int __buffer_migrate_folio(struct address_space *mapping,
                struct folio *dst, struct folio *src, enum migrate_mode mode,
                bool check_refs)
{
 ...
recheck_buffers:
                busy = false;
                spin_lock(&mapping->i_private_lock);
                bh = head;
                do {
                        if (atomic_read(&bh->b_count)) {
//migrate will fail here as bh->jh attached
                                busy = true;
                                break;
                        }
                        bh = bh->b_this_page;
                } while (bh != head);

>
> > > > > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > > > > in the first place (not having I/O devices too primitive so they can't
> > > > > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > > > > missing some of the design considerations behind CMA.  I thought in
> > > > > > > general CMA tends to used in early boot to allocate things like frame
> > > > > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > > > > clearly not the case for you, apparently?
> > > > > >
> > > > > > Yes. CMA is designed for contiguous physical memory and has been used
> > > > > > via cma_alloc during the whole lifetime especially on the system
> > > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > > > > blocks, they also could have compaction path retry for many times
> > > > > > which is common during high-order alloc_pages.
> > > > >
> > > > > But then what's the point of using CMA-eligible memory for the buffer
> > > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > > > > cache allocations?  After all, that's what is being proposed for
> > > > > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > > > > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > > > > *all* buffers in the buffer cache?
> > > > Since migration which arised from alloc_pages or cma_alloc always
> > > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> > > > cache pages under regular files are the best candidate for migration
> > > > as we just need to modify page cache and PTE. Actually, all FSs apply
> > > > GFP_MOVABLE on their regular files via the below functions.
> > > >
> > > > new_inode
> > > >     alloc_inode
> > > >         inode_init_always(struct super_block *sb, struct inode *inode)
> > > >         {
> > > >          ...
> > > >             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> > >
> > > Here you speak about data page cache pages. Indeed they can be allocated
> > > from CMA area. But when Ted speaks about "buffer cache" he specifically
> > > means page cache of the block device inode and there I can see:
> > >
> > > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
> > > {
> > > ...
> > >         mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> > > ...
> > > }
> > >
> > > so at this point I'm confused how come you can see block device pages in
> > > CMA area. Are you using data=journal mode of ext4 in your setup by any
> > > chance? That would explain it but then that is a horrible idea as well...
> > The page of 'fffffffe01a51c00'[1] which has bh attached comes from
> > creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where
> > the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file
> > pages under the super_block but not available for metadata, right?
>
> Ah, right, __getblk() overrides the GFP mode set in bdev's mapping_gfp_mask
> and sets __GFP_MOVABLE there. This behavior goes way back to 2007
> (769848c03895b ("Add __GFP_MOVABLE for callers to flag allocations from
> high memory that may be migrated")). So another clean and simple way to fix
> this (as Ted suggests) would be to stop setting __GFP_MOVABLE in
> __getblk(). For ext4 there would be no practical difference to current
> situation where metadata pages are practically unmovable due to attached
> jhs, other filesystems can set __GFP_MOVABLE in bdev's mapping_gfp_mask if
> they care (but I don't think there are other big buffer cache users on
> systems which need CMA).
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Sept. 13, 2024, 10:49 a.m. UTC | #12
On Fri 13-09-24 09:39:57, Zhaoyang Huang wrote:
> On Thu, Sep 12, 2024 at 6:16 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote:
> > > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@mit.edu> wrote:
> > > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > > > > >
> > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > > > > getting ejected from memory to try to make room for some contiguous
> > > > > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > > > > >
> > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > > > > extra IO as they just be migrated to free pages instead of ejected
> > > > > > > directly when they are the target memory area. In terms of reclaiming,
> > > > > > > all migrate types of page blocks possess the same position.
> > > > > >
> > > > > > Where is that being done?  I don't see any evidence of this kind of
> > > > > > migration in fs/buffer.c.
> > > > > The journaled pages which carry jh->bh are treated as file pages
> > > > > during isolation of a range of PFNs in the callstack below[1]. The bh
> > > > > will be migrated via each aops's migrate_folio and performs what you
> > > > > described below such as copy the content and reattach the bh to a new
> > > > > page. In terms of the journal enabled ext4 partition, the inode is a
> > > > > blockdev inode which applies buffer_migrate_folio_norefs as its
> > > > > migrate_folio[2].
> > > > >
> > > > > [1]
> > > > > cma_alloc/alloc_contig_range
> > > > >     __alloc_contig_migrate_range
> > > > >         migrate_pages
> > > > >             migrate_folio_move
> > > > >                 move_to_new_folio
> > > > >
> > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> > > > >
> > > > > [2]
> > > > > static int __buffer_migrate_folio(struct address_space *mapping,
> > > > >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> > > > >                 bool check_refs)
> > > > > {
> > > > > ...
> > > > >         if (check_refs) {
> > > > >                 bool busy;
> > > > >                 bool invalidated = false;
> > > > >
> > > > > recheck_buffers:
> > > > >                 busy = false;
> > > > >                 spin_lock(&mapping->i_private_lock);
> > > > >                 bh = head;
> > > > >                 do {
> > > > >                         if (atomic_read(&bh->b_count)) {
> > > > >           //My case failed here as bh is referred by a journal head.
> > > > >                                 busy = true;
> > > > >                                 break;
> > > > >                         }
> > > > >                         bh = bh->b_this_page;
> > > > >                 } while (bh != head);
> > > >
> > > > Correct. Currently pages with journal heads attached cannot be migrated
> > > > mostly out of pure caution that the generic code isn't sure what's
> > > > happening with them. As I wrote in [1] we could make pages with jhs on
> > > > checkpoint list only migratable as for them the buffer lock is enough to
> > > > stop anybody from touching the bh data. Bhs which are part of a running /
> > > > committing transaction are not realistically migratable but then these
> > > > states are more shortlived so it shouldn't be a big problem.
> > > By observing from our test case, the jh remains there for a long time
> > > when journal->j_free is bigger than j_max_transaction_buffers which
> > > failed cma_alloc. So you think this is rare or abnormal?
> > >
> > > [6] j_free & j_max_transaction_buffers
> > > crash_arm64_v8.0.4++> struct
> > > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
> > >   j_free = 0x3f1,
> > >   j_max_transaction_buffers = 0x100,
> >
> > So jh can stay attached to the bh for a very long time (basically only
> > memory pressure will evict it) and this is what blocks migration. But what
> > I meant is that in fact, most of the time we can migrate bh with jh
> > attached just fine. There are only relatively short moments (max 5s) where
> > a buffer (jh) is part of a running or committing transaction and then we
> > cannot really migrate.
> Please correct me if I am wrong. According to __buffer_migrate_folio,
> the bh can not be migrated as long as it has jh attached which could
> remain until the next cp transaction is launched. In my case, the
> jbd2' log space is big enough( j_free = 0x3f1 >
> j_max_transaction_buffers = 0x100) to escape the launch.

Currently yes. My proposal was to make bh migratable even with jh attached
(which obviously needs some tweaks to __buffer_migrate_folio()).

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 941c1c0d5c6e..4422246851fe 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -869,7 +869,11 @@  struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
 	if (nowait)
 		return sb_find_get_block(inode->i_sb, map.m_pblk);
 
+#ifndef CONFIG_CMA
 	bh = sb_getblk(inode->i_sb, map.m_pblk);
+#else
+	bh = sb_getblk_gfp(inode->i_sb, map.m_pblk, 0);
+#endif
 	if (unlikely(!bh))
 		return ERR_PTR(-ENOMEM);
 	if (map.m_flags & EXT4_MAP_NEW) {