diff mbox series

[v6,06/17] fsnotify: generate pre-content permission event on open

Message ID b509ec78c045d67d4d7e31976eba4b708b238b66.1731355931.git.josef@toxicpanda.com
State Not Applicable
Headers show
Series fanotify: add pre-content hooks | expand

Commit Message

Josef Bacik Nov. 11, 2024, 8:17 p.m. UTC
From: Amir Goldstein <amir73il@gmail.com>

Generate pre-content event on open in addition to FS_OPEN_PERM,
but without sb_writers held and after file was truncated
in case file was opened with O_CREAT and/or O_TRUNC.

The event will have a range info of [0..0] to provide an opportunity
to fill entire file content on open.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/namei.c               | 10 +++++++++-
 include/linux/fsnotify.h |  4 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

Comments

Linus Torvalds Nov. 11, 2024, 9:51 p.m. UTC | #1
On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote:
>
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd,
> +       /*
> +        * This permission hook is different than fsnotify_open_perm() hook.
> +        * This is a pre-content hook that is called without sb_writers held
> +        * and after the file was truncated.
> +        */
> +       return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
>  }

Stop adding sh*t like this to the VFS layer.

Seriously. I spend time and effort looking at profiles, and then
people who do *not* seem to spend the time and effort just willy nilly
add fsnotify and security events and show down basic paths.

I'm going to NAK any new fsnotify and permission hooks unless people
show that they don't add any overhead.

Because I'm really really tired of having to wade through various
permission hooks in the profiles that I can not fix or optimize,
because those hoosk have no sane defined semantics, just "let user
space know".

Yes, right now it's mostly just the security layer. But this really
looks to me like now fsnotify will be the same kind of pain.

And that location is STUPID. Dammit, it is even *documented* to be
stupid. It's a "pre-content" hook that happens after the contents have
already been truncated. WTF? That's no "pre".

I tried to follow the deep chain of inlines to see what actually ends
up happening, and it looks like if the *whole* filesystem has no
notify events at all, the fsnotify_sb_has_watchers() check will make
this mostly go away, except for all the D$ accesses needed just to
check for it.

But even *one* entirely unrelated event will now force every single
open to typically call __fsnotify_parent() (or possibly "just"
fsnotify), because there's no sane "nobody cares about this dentry"
kind of thing.

So effectively this is a new hook that gets called on every single
open call that nobody else cares about than you, and that people have
lived without for three decades.

Stop it, or at least add the code to not do this all completely pointlessly.

Because otherwise I will not take this kind of stuff any more. I just
spent time trying to figure out how to avoid the pointless cache
misses we did for every path component traversal.

So I *really* don't want to see another pointless stupid fsnotify hook
in my profiles.

                Linus
Josef Bacik Nov. 11, 2024, 10:46 p.m. UTC | #2
On Mon, Nov 11, 2024 at 4:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd,
> > +       /*
> > +        * This permission hook is different than fsnotify_open_perm() hook.
> > +        * This is a pre-content hook that is called without sb_writers held
> > +        * and after the file was truncated.
> > +        */
> > +       return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
> >  }
>
> Stop adding sh*t like this to the VFS layer.
>
> Seriously. I spend time and effort looking at profiles, and then
> people who do *not* seem to spend the time and effort just willy nilly
> add fsnotify and security events and show down basic paths.
>
> I'm going to NAK any new fsnotify and permission hooks unless people
> show that they don't add any overhead.
>
> Because I'm really really tired of having to wade through various
> permission hooks in the profiles that I can not fix or optimize,
> because those hoosk have no sane defined semantics, just "let user
> space know".
>
> Yes, right now it's mostly just the security layer. But this really
> looks to me like now fsnotify will be the same kind of pain.
>
> And that location is STUPID. Dammit, it is even *documented* to be
> stupid. It's a "pre-content" hook that happens after the contents have
> already been truncated. WTF? That's no "pre".
>
> I tried to follow the deep chain of inlines to see what actually ends
> up happening, and it looks like if the *whole* filesystem has no
> notify events at all, the fsnotify_sb_has_watchers() check will make
> this mostly go away, except for all the D$ accesses needed just to
> check for it.
>
> But even *one* entirely unrelated event will now force every single
> open to typically call __fsnotify_parent() (or possibly "just"
> fsnotify), because there's no sane "nobody cares about this dentry"
> kind of thing.
>
> So effectively this is a new hook that gets called on every single
> open call that nobody else cares about than you, and that people have
> lived without for three decades.
>
> Stop it, or at least add the code to not do this all completely pointlessly.
>
> Because otherwise I will not take this kind of stuff any more. I just
> spent time trying to figure out how to avoid the pointless cache
> misses we did for every path component traversal.
>
> So I *really* don't want to see another pointless stupid fsnotify hook
> in my profiles.

Did you see the patch that added the
fsnotify_file_has_pre_content_watches() thing?  It'll look at the
inode/sb/mnt to see if there are any watches and carry on if there's
nothing.  This will be the cheapest thing open/write/whatever does.
If there's no watches, nothing happens.  I'm not entirely sure what
else you want me to do here, other than not add the feature, which I
suppose is a position to take.

The overhead here is purely for people who use HSM.  I'm telling you
that in production, with real world workloads, the overhead is far
outweighed by having to copy bytes around we'll never use.  Your
argument is similar to the argument against adding compression to
btrfs, "but it costs CPU!?", sure, but the benefit of writing
significantly less waaaaay outweighed the CPU cost and was a huge net
positive.

This is going to cost something if it's on, it costs nothing if it's
off.  If you want performance numbers that's reasonable, I'll run
fsperf tomorrow and get you a side by side comparison.  Thanks,

Josef
Linus Torvalds Nov. 11, 2024, 11:22 p.m. UTC | #3
On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote:
>
> Did you see the patch that added the
> fsnotify_file_has_pre_content_watches() thing?

No, because I had gotten to patch 6/11, and it added this open thing,
and there was no such thing in any of the patches before it.

It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17.

However, at no point does it look like you actually test it at open
time, so none of this seems to matter.

As far as I can see, even at the end of the series, you will call the
fsnotify hook at open time even if there are no content watches on the
file.

So apparently the fsnotify_file_has_pre_content_watches() is not
called when it should be, and when it *is* called, it's also doing
completely the wrong thing.

Look, for basic operations THAT DON'T CARE, you now added a function
call to fsnotify_file_has_pre_content_watches(), that function call
looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't
be done!), and then after that looks at the i_fsnotify_mask.

THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT.

This code has been written by somebody who NEVER EVER looked at
profiles. You're following chains of pointers when you never should.

Look, here's a very basic example of the kind of complete mis-design
I'm talking about:

 - we're doing a basic read() on a file that isn't being watched.

 - we want to maybe do read-ahead

 - the code does

        if (fsnotify_file_has_pre_content_watches(file))
                return fpin;

   to say that "don't do read-ahead".

Fine, I understand the concept. But keep in mind that the common case
is presumably that there are no content watches.

And even ignoring the "common case" issue, that's the one you want to
OPTIMIZE for. That's the case that matters for performance, because
clearly if there are content watches, you're going to go into "Go
Slow" mode anyway and not do pre-fetching. So even if content watches
are common on some load, they are clearly not the case you should do
performance optimization for.

With me so far?

So if THAT is the case that matters, then dammit, we shouldn't be
calling a function at all.

And when calling the function, we shouldn't start out with this
completely broken logic:

        struct inode *inode = file_inode(file);
        __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;

        if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM))
                return false;

that does random crap and looks up some "mount mask" and looks up the
superblock flags.

Why shouldn't we do this?

BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR
CONTENT MATCHES!

See why I'm shouting? You're doing insane things, and you're doing
them for all the cases that DO NOT MATTER. You're doing all of this
for the common case that doesn't want to see that kind of mindless
overhead.

You literally check for the "do I even care" *last*, when you finally
do that fsnotify_object_watched() check that looks at the inode. But
by then you have already wasted all that time and effort, and
fsnotify_object_watched() is broken anyway, because it's stupidly
designed to require that mnt_mask that isn't needed if you have
properly marked each object individually.

So what *should* you have?

You should have had a per-file flag saying "Do I need to even call
this crud at all", and have it in a location where you don't need to
look at anything else.

And fsnotify already basically has that flag, except it's mis-designed
too. We have FMODE_NONOTIFY, which is the wrong way around (saying
"don't notify", when that should just be the *default*), and the
fsnotify layer uses it only to mark its own internal files so that it
doesn't get called recursively. So that flag that *looks* sane and is
in the right location is actually doing the wrong thing, because it's
dealing with a rare special case, not the important cases that
actually matter.

So all of this readahead logic - and all of the read and write hooks -
should be behind a simple "oh, this file doesn't have any notification
stuff, so don't bother calling any fsnotify functions".

So I think the pattern should be

    static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
    {
        if (unlikely(file->f_mode & FMODE_NOTIFY))
                return out_of_line_crud(file);
        return false;
    }

so that the *only* thing that gets inlined is "does this file have any
fsnotify stuff at all", and in the common case where that isn't true,
we don't call any fsnotify functions, and we don't start looking at
inodes or superblocks or anything pointless like that.

THAT is the kind of code sequence we should have for unlikely cases.
Not the "call fsnotify code to check for something unlikely". Not
"look at file_inode(file)->i_sb->s_iflags" until it's *required*.

Similarly, the actual open-time code SHOULD NEVER BE CALLED, because
it should have the same kind of simple logic based on some simple flag
either in the dentry or maybe in the inode. So that all those *normal*
dentries and inodes that don't have watches don't get the overhead of
calling into fsnotify code.

Because yes, I do see all those function calls, and all those
unnecessary indirect pointer accesses when I do profiles. And if I see
fsnotify in my profiles when I don't have any notifications enabled,
that really really *annoys* me.

                 Linus
Amir Goldstein Nov. 11, 2024, 11:36 p.m. UTC | #4
On Mon, Nov 11, 2024 at 10:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd,
> > +       /*
> > +        * This permission hook is different than fsnotify_open_perm() hook.
> > +        * This is a pre-content hook that is called without sb_writers held
> > +        * and after the file was truncated.
> > +        */
> > +       return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
> >  }
>
> Stop adding sh*t like this to the VFS layer.
>
> Seriously. I spend time and effort looking at profiles, and then
> people who do *not* seem to spend the time and effort just willy nilly
> add fsnotify and security events and show down basic paths.
>
> I'm going to NAK any new fsnotify and permission hooks unless people
> show that they don't add any overhead.
>

FWIW we did work to eliminate overhead long before posting the new hooks.

This prep work has already been merged back in v6.9:
https://lore.kernel.org/linux-fsdevel/20240317184154.1200192-1-amir73il@gmail.com/

After working closely with Oliver and kernel test robot to eliminate
the overhead
of the pre-content hooks and on the way, also reduce overhead of the existing
fanotify permission hooks that were merged back in the day.

> Because I'm really really tired of having to wade through various
> permission hooks in the profiles that I can not fix or optimize,
> because those hoosk have no sane defined semantics, just "let user
> space know".
>
> Yes, right now it's mostly just the security layer. But this really
> looks to me like now fsnotify will be the same kind of pain.
>
> And that location is STUPID. Dammit, it is even *documented* to be
> stupid. It's a "pre-content" hook that happens after the contents have
> already been truncated. WTF? That's no "pre".
>

Yeh, I understand why it seems stupid and it may have been poorly
documented, but there is actually a good reason for the location of
this hook.

The problem with the existing fsnotify_open_perm() hook is that
with O_CREATE, open_last_lookups() takes sb_writers freeze
protection (regardless if file with really created), so we cannot
safely use this hook to start writing to file and fill its content.

So the important part of the comment is:
"This permission hook is different than fsnotify_open_perm() hook.
 This is a pre-content hook that is called without sb_writers held"

The part that follows "and after the file was truncated", simply
means that in case of O_TRUNC we won't need to fill file content,
because the file will have zero size anyway.

And for the record, it is not "pre-open" it is "pre-content-access", so the
name may be confusing but it is not wrong.

> I tried to follow the deep chain of inlines to see what actually ends
> up happening, and it looks like if the *whole* filesystem has no
> notify events at all, the fsnotify_sb_has_watchers() check will make
> this mostly go away, except for all the D$ accesses needed just to
> check for it.
>
> But even *one* entirely unrelated event will now force every single
> open to typically call __fsnotify_parent() (or possibly "just"
> fsnotify), because there's no sane "nobody cares about this dentry"
> kind of thing.
>
> So effectively this is a new hook that gets called on every single
> open call that nobody else cares about than you, and that people have
> lived without for three decades.
>

That's exactly the reason for the commit
a5e57b4d370c fsnotify: optimize the case of no permission event watchers

It is supposed to reduce overhead to bare minimum for hooks of
events from the class "permission" (FSNOTIFY_PRIO_CONTENT), which
are typically only watched for Anti-malware software, so
fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_CONTENT)
is typically false on any given fs.

And same for the new hooks for events of class "pre-content".
fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_PRE_CONTENT)
will only be true for fs of dedicated systems that run an HSM service, where
the overhead of the hooks is not going to be a concern.

> Stop it, or at least add the code to not do this all completely pointlessly.
>
> Because otherwise I will not take this kind of stuff any more. I just
> spent time trying to figure out how to avoid the pointless cache
> misses we did for every path component traversal.
>
> So I *really* don't want to see another pointless stupid fsnotify hook
> in my profiles.

I understand your frustration from crawling performance regressions, I
really do.
I am personally very grateful for the services of Oliver and his
kernel test robot
who test my development branches to catch regressions very soon in the
development
process.

We may have missed things along the way and you may yet find more issues
that justify another NAK, or more work, but you should know that a lot of care
was taken to try to avoid inflicting pain on the system.

I have been practically doing vfs prep and cleanup work together with
Josef and Jan and Christian for at least a year before we got to post v1 of the
pre-content patches.

So all I ask in return is that you consider the possibility that this is not
utter garbage and try to work with us to address your concerns.

Pre-content hooks (and HSM) is a very powerful feature that many people
are requesting and I believe that we will be able to deliver this
feature without
crapping all over the VFS.

Thanks,
Amir.
Linus Torvalds Nov. 11, 2024, 11:39 p.m. UTC | #5
On Mon, 11 Nov 2024 at 15:22, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> See why I'm shouting? You're doing insane things, and you're doing
> them for all the cases that DO NOT MATTER. You're doing all of this
> for the common case that doesn't want to see that kind of mindless
> overhead.

Side note: I think as filesystem people, you guys are taught to think
"IO is expensive, as long as you can avoid IO, things go fast".

And that's largely true at a filesystem level.

But on the VFS level, the common case is actually "everything is
cached in memory, we're never calling down to the filesystem at all".

And then IO isn't the issue.

So on a VFS level, to a very close approximation, the only thing that
matters is cache misses and mispredicted branches.

(Indirect calls have always had some overhead, and Spectre made it
much worse, so arguably indirect calls have become the third thing
that matters).

So in the VFS layer, we have ridiculous tricks like

        if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
                if (likely(inode->i_op->permission))
                        return inode->i_op->permission(idmap, inode, mask);

                /* This gets set once for the inode lifetime */
                spin_lock(&inode->i_lock);
                inode->i_opflags |= IOP_FASTPERM;
                spin_unlock(&inode->i_lock);
        }
        return generic_permission(idmap, inode, mask);

in do_inode_permission, because it turns out that the IOP_FASTPERM
flag means that we literally don't even need to dereference
inode->i_op->permission (nasty chain of D$ accesses), and we can
*only* look at accesses off the 'inode' pointer.

Is this an extreme example? Yes. But the whole i_opflags kind of thing
does end up mattering, exactly because it keeps the D$ footprint
smaller.

                  Linus
Amir Goldstein Nov. 11, 2024, 11:59 p.m. UTC | #6
On Tue, Nov 12, 2024 at 12:22 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > Did you see the patch that added the
> > fsnotify_file_has_pre_content_watches() thing?
>
> No, because I had gotten to patch 6/11, and it added this open thing,
> and there was no such thing in any of the patches before it.
>
> It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17.
>
> However, at no point does it look like you actually test it at open
> time, so none of this seems to matter.
>
> As far as I can see, even at the end of the series, you will call the
> fsnotify hook at open time even if there are no content watches on the
> file.
>
> So apparently the fsnotify_file_has_pre_content_watches() is not
> called when it should be, and when it *is* called, it's also doing
> completely the wrong thing.
>
> Look, for basic operations THAT DON'T CARE, you now added a function
> call to fsnotify_file_has_pre_content_watches(), that function call
> looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't
> be done!), and then after that looks at the i_fsnotify_mask.
>
> THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT.
>
> This code has been written by somebody who NEVER EVER looked at
> profiles. You're following chains of pointers when you never should.
>
> Look, here's a very basic example of the kind of complete mis-design
> I'm talking about:
>
>  - we're doing a basic read() on a file that isn't being watched.
>
>  - we want to maybe do read-ahead
>
>  - the code does
>
>         if (fsnotify_file_has_pre_content_watches(file))
>                 return fpin;
>
>    to say that "don't do read-ahead".
>
> Fine, I understand the concept. But keep in mind that the common case
> is presumably that there are no content watches.
>
> And even ignoring the "common case" issue, that's the one you want to
> OPTIMIZE for. That's the case that matters for performance, because
> clearly if there are content watches, you're going to go into "Go
> Slow" mode anyway and not do pre-fetching. So even if content watches
> are common on some load, they are clearly not the case you should do
> performance optimization for.
>
> With me so far?
>
> So if THAT is the case that matters, then dammit, we shouldn't be
> calling a function at all.
>
> And when calling the function, we shouldn't start out with this
> completely broken logic:
>
>         struct inode *inode = file_inode(file);
>         __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
>
>         if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM))
>                 return false;
>
> that does random crap and looks up some "mount mask" and looks up the
> superblock flags.
>
> Why shouldn't we do this?
>
> BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR
> CONTENT MATCHES!
>
> See why I'm shouting? You're doing insane things, and you're doing
> them for all the cases that DO NOT MATTER. You're doing all of this
> for the common case that doesn't want to see that kind of mindless
> overhead.
>
> You literally check for the "do I even care" *last*, when you finally
> do that fsnotify_object_watched() check that looks at the inode. But
> by then you have already wasted all that time and effort, and
> fsnotify_object_watched() is broken anyway, because it's stupidly
> designed to require that mnt_mask that isn't needed if you have
> properly marked each object individually.
>
> So what *should* you have?
>
> You should have had a per-file flag saying "Do I need to even call
> this crud at all", and have it in a location where you don't need to
> look at anything else.
>
> And fsnotify already basically has that flag, except it's mis-designed
> too. We have FMODE_NONOTIFY, which is the wrong way around (saying
> "don't notify", when that should just be the *default*), and the
> fsnotify layer uses it only to mark its own internal files so that it
> doesn't get called recursively. So that flag that *looks* sane and is
> in the right location is actually doing the wrong thing, because it's
> dealing with a rare special case, not the important cases that
> actually matter.
>
> So all of this readahead logic - and all of the read and write hooks -
> should be behind a simple "oh, this file doesn't have any notification
> stuff, so don't bother calling any fsnotify functions".
>
> So I think the pattern should be
>
>     static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
>     {
>         if (unlikely(file->f_mode & FMODE_NOTIFY))
>                 return out_of_line_crud(file);
>         return false;
>     }
>

I think that's a good idea for pre-content events, because it's fine
to say that if the sb/mount was not watched by a pre-content event listener
at the time of file open, then we do not care.

The problem is that legacy inotify/fanotify watches can be added after
file is open,
so that is allegedly why this optimization was not done for fsnotify
hooks in the past.

Thanks,
Amir.
Linus Torvalds Nov. 12, 2024, 12:37 a.m. UTC | #7
On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
>
> I think that's a good idea for pre-content events, because it's fine
> to say that if the sb/mount was not watched by a pre-content event listener
> at the time of file open, then we do not care.

Right.

> The problem is that legacy inotify/fanotify watches can be added after
> file is open, so that is allegedly why this optimization was not done for
> fsnotify hooks in the past.

So honestly, even if the legacy fsnotify hooks can't look at the file
flag, they could damn well look at an inode flag.

And I'm not even convinced that we couldn't fix them to just look at a
file flag, and say "tough luck, somebody opened that file before you
started watching, you don't get to see what they did".

So even if we don't look at a file->f_mode flag, the lergacy cases
should look at i_fsnotify_mask, and do that *first*.

IOW, not do it like fsnotify_object_watched() does now, which is just
broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely
pointlessly, but it also does it much too late - it gets called after
we've already called into the fsnotify() code and have messed up the
I$ etc.

The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection,
it should be very *literally* pointless. If some bit isn't set in
i_sb->s_fsnotify_mask, then there should be no way to set that bit in
inode->i_fsnotify_mask. So the only time we should access
i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when
it gets tested.

But even if that silly and pointless i_sb->s_fsnotify_mask thing is
removed, fsnotify_object_watched() is *still* wrong, because it
requires that mnt_mask as an argument, which means that the caller now
has to look it up - all this entirely pointless work that should never
be done if the bit wasn't set in inode->i_fsnotify_mask.

So I really think fsnotify is doing *everything* wrong.

And I most certainly don't want to add more runtime hooks to
*critical* code like open/read/write.

Right now, many of the fsnotify things are for "metadata", ie for
bigger file creation / removal / move etc. And yes, the "don't do this
if there are no fsnotify watchers AT ALL" does actually end up meaning
that most of the time I never see any of it in profiles, because the
fsnotify_sb_has_watchers() culls out that case.

And while the fsnotify_sb_has_watchers() thing is broken garbage and
does too many indirections and is not testing the right thing, at
least it's inlined and you don't get the function calls.

That doesn't make fsnotify "right", but at least it's not in my face.
I see the sb accesses, and I hate them, but it's usually at least
hidden. Admittedly not as well hidden as it *should* be, since it does
the access tests in the wrong order, but the old fsnotify_open()
doesn't strike me as "terminally broken".

It doesn't have a permission test after the open has already done
things, and it's inlined enough that it isn't actively offensive.

And most of the other fsnotify things have the same pattern - not
great, but not actively offensive.

These new patches make it in my face.

So I do require that the *new* cases at least get it right. The fact
that we have old code that is misdesigned and gets it wrong and should
also be improved isn't an excuse to add *more* badly coded stuff.

And yes, if somebody fixes the old fsnotify stuff to check just the
i_fsnotify_mask in the inline function, and moves all the other silly
checks out-of-line, that would be an improvement. I'd very much
applaud that. But it's a separate thing from adding new hooks.

                Linus
Amir Goldstein Nov. 12, 2024, 8:11 a.m. UTC | #8
On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I think that's a good idea for pre-content events, because it's fine
> > to say that if the sb/mount was not watched by a pre-content event listener
> > at the time of file open, then we do not care.
>
> Right.
>
> > The problem is that legacy inotify/fanotify watches can be added after
> > file is open, so that is allegedly why this optimization was not done for
> > fsnotify hooks in the past.
>
> So honestly, even if the legacy fsnotify hooks can't look at the file
> flag, they could damn well look at an inode flag.
>

Legacy fanotify has a mount watch (FAN_MARK_MOUNT),
which is the common way for Anti-malware to set watches on
filesystems, so I am not sure what you are saying.

> And I'm not even convinced that we couldn't fix them to just look at a
> file flag, and say "tough luck, somebody opened that file before you
> started watching, you don't get to see what they did".

That would specifically break tail -f (for inotify) and probably many other
tools, but as long as we also look at the inode flags (i_fsnotify_mask)
and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED),
then I think we may be able to get away with changing the semantics
for open files on a fanotify mount watch.

Specifically, I would really like to eliminate completely the cost of
FAN_ACCESS_PERM event, which could be gated on file flag, because
this is only for security/Anti-malware and I don't think this event is
practically
useful and it sure does not need to guarantee permission events to mount
watchers on already open files.

>
> So even if we don't look at a file->f_mode flag, the lergacy cases
> should look at i_fsnotify_mask, and do that *first*.
>
> IOW, not do it like fsnotify_object_watched() does now, which is just
> broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely
> pointlessly, but it also does it much too late - it gets called after
> we've already called into the fsnotify() code and have messed up the
> I$ etc.
>
> The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection,
> it should be very *literally* pointless. If some bit isn't set in
> i_sb->s_fsnotify_mask, then there should be no way to set that bit in
> inode->i_fsnotify_mask. So the only time we should access
> i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when
> it gets tested.
>

i_fsnotify_mask is the cumulative mask of all inode watchers
s_fsnotify_mask is *not* the cumulative of all i_fsnotify_mask
s_fsnotify_mask is the cumulative mask of all sb watchers
mnt_fsnotify_marks is the cumulative mask of all mount watchers

> But even if that silly and pointless i_sb->s_fsnotify_mask thing is
> removed, fsnotify_object_watched() is *still* wrong, because it
> requires that mnt_mask as an argument, which means that the caller now
> has to look it up - all this entirely pointless work that should never
> be done if the bit wasn't set in inode->i_fsnotify_mask.
>
> So I really think fsnotify is doing *everything* wrong.

Note the difference between fsnotify_sb_has_watchers()
and fsnotify_object_watched().

The former is an early optimization gate that checks if there are
any inode/mount/sb watchers (per class) on the filesystem, regardless
of specific events and specific target inode/file.

We could possibly further optimize fsnotify_sb_has_watchers() to avoid
access to ->s_fsnotify_info by setting sb flag (for each priority class).

The latter checks if any inode/mount/sb are interested in a specific
event on the said object.

In upstream code, fsnotify_object_watched() is always gated behind
fsnotify_sb_has_watchers(), which tries to prevent the indirect call.

The new fsnotify_file_has_pre_content_watches() helper could
have checked fsnotify_sb_has_priority_watchers(sb,
                                               FSNOTIFY_PRIO_CONTENT);
but it is better to gate by file flag as you suggested.


>
> And I most certainly don't want to add more runtime hooks to
> *critical* code like open/read/write.
>
> Right now, many of the fsnotify things are for "metadata", ie for
> bigger file creation / removal / move etc. And yes, the "don't do this
> if there are no fsnotify watchers AT ALL" does actually end up meaning
> that most of the time I never see any of it in profiles, because the
> fsnotify_sb_has_watchers() culls out that case.
>
> And while the fsnotify_sb_has_watchers() thing is broken garbage and
> does too many indirections and is not testing the right thing, at
> least it's inlined and you don't get the function calls.
>
> That doesn't make fsnotify "right", but at least it's not in my face.
> I see the sb accesses, and I hate them, but it's usually at least
> hidden. Admittedly not as well hidden as it *should* be, since it does
> the access tests in the wrong order, but the old fsnotify_open()
> doesn't strike me as "terminally broken".
>
> It doesn't have a permission test after the open has already done
> things, and it's inlined enough that it isn't actively offensive.
>
> And most of the other fsnotify things have the same pattern - not
> great, but not actively offensive.
>
> These new patches make it in my face.
>
> So I do require that the *new* cases at least get it right. The fact
> that we have old code that is misdesigned and gets it wrong and should
> also be improved isn't an excuse to add *more* badly coded stuff.
>
> And yes, if somebody fixes the old fsnotify stuff to check just the
> i_fsnotify_mask in the inline function, and moves all the other silly
> checks out-of-line, that would be an improvement. I'd very much
> applaud that. But it's a separate thing from adding new hooks.

We will do both.

Thanks,
Amir.
Jan Kara Nov. 12, 2024, 1:54 p.m. UTC | #9
On Tue 12-11-24 09:11:32, Amir Goldstein wrote:
> On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > I think that's a good idea for pre-content events, because it's fine
> > > to say that if the sb/mount was not watched by a pre-content event listener
> > > at the time of file open, then we do not care.
> >
> > Right.
> >
> > > The problem is that legacy inotify/fanotify watches can be added after
> > > file is open, so that is allegedly why this optimization was not done for
> > > fsnotify hooks in the past.
> >
> > So honestly, even if the legacy fsnotify hooks can't look at the file
> > flag, they could damn well look at an inode flag.
> 
> Legacy fanotify has a mount watch (FAN_MARK_MOUNT),
> which is the common way for Anti-malware to set watches on
> filesystems, so I am not sure what you are saying.
> 
> > And I'm not even convinced that we couldn't fix them to just look at a
> > file flag, and say "tough luck, somebody opened that file before you
> > started watching, you don't get to see what they did".
> 
> That would specifically break tail -f (for inotify) and probably many other
> tools, but as long as we also look at the inode flags (i_fsnotify_mask)
> and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED),
> then I think we may be able to get away with changing the semantics
> for open files on a fanotify mount watch.

Yes, I agree we cannot afford to generate FS_MODIFY event only if the mark
was placed after file open. There's too much stuff in userspace depending
on this since this behavior dates back to inotify interface sometime in
2010 or so.

> Specifically, I would really like to eliminate completely the cost of
> FAN_ACCESS_PERM event, which could be gated on file flag, because
> this is only for security/Anti-malware and I don't think this event is
> practically
> useful and it sure does not need to guarantee permission events to mount
> watchers on already open files.

For traditional fanotify permission events I agree generating them only if
the mark was placed before open is likely fine but we'll have to try and
see whether something breaks. For the new pre-content events I like the
per-file flag as Linus suggested. That should indeed save us some cache
misses in some fast paths.

								Honza
Jan Kara Nov. 12, 2024, 2:28 p.m. UTC | #10
On Mon 11-11-24 16:37:30, Linus Torvalds wrote:
> On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I think that's a good idea for pre-content events, because it's fine
> > to say that if the sb/mount was not watched by a pre-content event listener
> > at the time of file open, then we do not care.
> 
> Right.
> 
> > The problem is that legacy inotify/fanotify watches can be added after
> > file is open, so that is allegedly why this optimization was not done for
> > fsnotify hooks in the past.
> 
> So honestly, even if the legacy fsnotify hooks can't look at the file
> flag, they could damn well look at an inode flag.
> 
> And I'm not even convinced that we couldn't fix them to just look at a
> file flag, and say "tough luck, somebody opened that file before you
> started watching, you don't get to see what they did".
> 
> So even if we don't look at a file->f_mode flag, the lergacy cases
> should look at i_fsnotify_mask, and do that *first*.
> 
> IOW, not do it like fsnotify_object_watched() does now, which is just
> broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely
> pointlessly, but it also does it much too late - it gets called after
> we've already called into the fsnotify() code and have messed up the
> I$ etc.
> 
> The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection,
> it should be very *literally* pointless. If some bit isn't set in
> i_sb->s_fsnotify_mask, then there should be no way to set that bit in
> inode->i_fsnotify_mask. So the only time we should access
> i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when
> it gets tested.

Thanks for explanation but what you write here and below seems to me as if
you are not aware of some features fanotify API provides (for many years).
With fanotify you can place a mark not only on a file / directory but you
can place it also on a mount point or a superblock. In that case any
operation of appropriate type happening through that mount point or on that
superblock should deliver the event to the notification group that placed
the mark. So for example we check inode->i_sb->s_fsnotify_mask on open to
be able to decide whether somebody asked to get notifications about *all*
opens on that superblock and generate appropriate events. These features
are there since day 1 of fanotify (at least for mountpoints, superblocks
were added later) and quite some userspace depends on them for doing
filesystem-wide event monitoring.

We could somehow cache the fact that someone placed a mark on the
superblock in the inode / struct file but I don't see how to keep such
cache consistent with marks that are attached to the superblock without
incurring the very same cache misses we are trying to avoid.

I do like your idea of caching whether somebody wants the notification in
struct file as that way we can avoid the cache misses for the
new pre-content hooks and possibly in hooks for the traditional fanotify
permission events. But I don't see how we could possibly avoid these cache
misses for the classical notification events like FAN_OPEN, FAN_ACCESS,
FAN_MODIFY, ...

								Honza
Amir Goldstein Nov. 12, 2024, 2:42 p.m. UTC | #11
On Tue, Nov 12, 2024 at 2:55 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 12-11-24 09:11:32, Amir Goldstein wrote:
> > On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > I think that's a good idea for pre-content events, because it's fine
> > > > to say that if the sb/mount was not watched by a pre-content event listener
> > > > at the time of file open, then we do not care.
> > >
> > > Right.
> > >
> > > > The problem is that legacy inotify/fanotify watches can be added after
> > > > file is open, so that is allegedly why this optimization was not done for
> > > > fsnotify hooks in the past.
> > >
> > > So honestly, even if the legacy fsnotify hooks can't look at the file
> > > flag, they could damn well look at an inode flag.
> >
> > Legacy fanotify has a mount watch (FAN_MARK_MOUNT),
> > which is the common way for Anti-malware to set watches on
> > filesystems, so I am not sure what you are saying.
> >
> > > And I'm not even convinced that we couldn't fix them to just look at a
> > > file flag, and say "tough luck, somebody opened that file before you
> > > started watching, you don't get to see what they did".
> >
> > That would specifically break tail -f (for inotify) and probably many other
> > tools, but as long as we also look at the inode flags (i_fsnotify_mask)
> > and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED),
> > then I think we may be able to get away with changing the semantics
> > for open files on a fanotify mount watch.
>
> Yes, I agree we cannot afford to generate FS_MODIFY event only if the mark
> was placed after file open. There's too much stuff in userspace depending
> on this since this behavior dates back to inotify interface sometime in
> 2010 or so.
>
> > Specifically, I would really like to eliminate completely the cost of
> > FAN_ACCESS_PERM event, which could be gated on file flag, because
> > this is only for security/Anti-malware and I don't think this event is
> > practically
> > useful and it sure does not need to guarantee permission events to mount
> > watchers on already open files.
>
> For traditional fanotify permission events I agree generating them only if
> the mark was placed before open is likely fine but we'll have to try and
> see whether something breaks. For the new pre-content events I like the
> per-file flag as Linus suggested. That should indeed save us some cache
> misses in some fast paths.

FWIW, attached a patch that implements FMODE_NOTIFY_PERM
I have asked Oliver to run his performance tests to see if we can
observe an improvement with existing workloads, but is sure is going
to be useful for pre-content events.

For example, here is what the pre content helper looks like after
I adapted Josef's patches to use the flag:

static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
{
        if (!(file->f_mode & FMODE_NOTIFY_PERM))
                return false;

        if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM))
                return false;

        return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS);
}

Thanks,
Amir.
Josef Bacik Nov. 12, 2024, 3:24 p.m. UTC | #12
On Mon, Nov 11, 2024 at 03:22:06PM -0800, Linus Torvalds wrote:
> On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > Did you see the patch that added the
> > fsnotify_file_has_pre_content_watches() thing?
> 
> No, because I had gotten to patch 6/11, and it added this open thing,
> and there was no such thing in any of the patches before it.
> 
> It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17.
> 
> However, at no point does it look like you actually test it at open
> time, so none of this seems to matter.
> 
> As far as I can see, even at the end of the series, you will call the
> fsnotify hook at open time even if there are no content watches on the
> file.
> 
> So apparently the fsnotify_file_has_pre_content_watches() is not
> called when it should be, and when it *is* called, it's also doing
> completely the wrong thing.
> 
> Look, for basic operations THAT DON'T CARE, you now added a function
> call to fsnotify_file_has_pre_content_watches(), that function call
> looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't
> be done!), and then after that looks at the i_fsnotify_mask.
> 
> THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT.
> 
> This code has been written by somebody who NEVER EVER looked at
> profiles. You're following chains of pointers when you never should.
> 
> Look, here's a very basic example of the kind of complete mis-design
> I'm talking about:
> 
>  - we're doing a basic read() on a file that isn't being watched.
> 
>  - we want to maybe do read-ahead
> 
>  - the code does
> 
>         if (fsnotify_file_has_pre_content_watches(file))
>                 return fpin;
> 
>    to say that "don't do read-ahead".
> 
> Fine, I understand the concept. But keep in mind that the common case
> is presumably that there are no content watches.
> 
> And even ignoring the "common case" issue, that's the one you want to
> OPTIMIZE for. That's the case that matters for performance, because
> clearly if there are content watches, you're going to go into "Go
> Slow" mode anyway and not do pre-fetching. So even if content watches
> are common on some load, they are clearly not the case you should do
> performance optimization for.
> 
> With me so far?
> 
> So if THAT is the case that matters, then dammit, we shouldn't be
> calling a function at all.
> 
> And when calling the function, we shouldn't start out with this
> completely broken logic:
> 
>         struct inode *inode = file_inode(file);
>         __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
> 
>         if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM))
>                 return false;
> 
> that does random crap and looks up some "mount mask" and looks up the
> superblock flags.
> 
> Why shouldn't we do this?
> 
> BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR
> CONTENT MATCHES!
> 
> See why I'm shouting? You're doing insane things, and you're doing
> them for all the cases that DO NOT MATTER. You're doing all of this
> for the common case that doesn't want to see that kind of mindless
> overhead.
> 
> You literally check for the "do I even care" *last*, when you finally
> do that fsnotify_object_watched() check that looks at the inode. But
> by then you have already wasted all that time and effort, and
> fsnotify_object_watched() is broken anyway, because it's stupidly
> designed to require that mnt_mask that isn't needed if you have
> properly marked each object individually.
> 
> So what *should* you have?
> 
> You should have had a per-file flag saying "Do I need to even call
> this crud at all", and have it in a location where you don't need to
> look at anything else.
> 
> And fsnotify already basically has that flag, except it's mis-designed
> too. We have FMODE_NONOTIFY, which is the wrong way around (saying
> "don't notify", when that should just be the *default*), and the
> fsnotify layer uses it only to mark its own internal files so that it
> doesn't get called recursively. So that flag that *looks* sane and is
> in the right location is actually doing the wrong thing, because it's
> dealing with a rare special case, not the important cases that
> actually matter.
> 
> So all of this readahead logic - and all of the read and write hooks -
> should be behind a simple "oh, this file doesn't have any notification
> stuff, so don't bother calling any fsnotify functions".
> 
> So I think the pattern should be
> 
>     static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
>     {
>         if (unlikely(file->f_mode & FMODE_NOTIFY))
>                 return out_of_line_crud(file);
>         return false;
>     }
> 
> so that the *only* thing that gets inlined is "does this file have any
> fsnotify stuff at all", and in the common case where that isn't true,
> we don't call any fsnotify functions, and we don't start looking at
> inodes or superblocks or anything pointless like that.
> 
> THAT is the kind of code sequence we should have for unlikely cases.
> Not the "call fsnotify code to check for something unlikely". Not
> "look at file_inode(file)->i_sb->s_iflags" until it's *required*.
> 
> Similarly, the actual open-time code SHOULD NEVER BE CALLED, because
> it should have the same kind of simple logic based on some simple flag
> either in the dentry or maybe in the inode. So that all those *normal*
> dentries and inodes that don't have watches don't get the overhead of
> calling into fsnotify code.
> 
> Because yes, I do see all those function calls, and all those
> unnecessary indirect pointer accesses when I do profiles. And if I see
> fsnotify in my profiles when I don't have any notifications enabled,
> that really really *annoys* me.
> 

There are good suggestions here, and decent analysis, Amir has followed up with
a patch that will improve things, and that's good.

But this was an entirely inappropriate way to communicate your point, even with
people who have been here a while and are used to being yelled at.

Throughout this email you have suggested that myself, Amir, and Jan have never
looked at profiles.  You go so far as to suggest that we have no idea what we're
doing and don't understand common case optimizations.  These are the sort of
comments that are unhelpful and put most people on the defensive, and make them
unwilling to listen to your suggestions and feedback.  These are the sort of
comments that make people work very hard to exit this community.

Are you wrong?  No.  We all get tunnel vision, I know I was deeply focused on
making the page fault case have as little impact as possible, but I definitely
didn't consider the indirect access case.  I don't run with mitigations on, and
frankly I am a file system person, I know if you're here we're going to do IO
and that's going to be the bad part.  That's why we do code review, because
we're all human and we all miss things.

But being dressed down like this for missing a better way, because lets be
honest what I did wasn't earth shatteringly bad, is not helpful.  It is actively
harmful, and it makes me not want to work on this stuff anymore.

We constantly have discussions about how do we bring in new people and how do we
train up new maintainers, without looking at how do we keep maintainers and how
do we keep developers.  If you're losing people like me, gaining new people is
going to be an even taller order.  Thanks,

Josef
Linus Torvalds Nov. 12, 2024, 5:27 p.m. UTC | #13
On Tue, 12 Nov 2024 at 07:24, Josef Bacik <josef@toxicpanda.com> wrote:
>
> But this was an entirely inappropriate way to communicate your point, even with
> people who have been here a while and are used to being yelled at.

Fair enough.

I was probably way too frustrated, because I spent some of the last
few weeks literally trying to remove two cache misses from the
permission checking path.

And those two cache misses were for features that are in POSIX made
worse because of uid translations infrastructure used for containers.

Those are both things that people actually *use* in major ways.
Admittedly the particular optimization was exactly to _not_ bother
with looking up the exact uid details when we could tell early that it
was all unnecessary, but the point was that this was to optimize
something real, something important, and not some special case.

And here's an example of a profile I've been looking at (this is a
real load: the profile is literally my "make allmodconfig" build with
not a lot of actual rebuilding needed):

  1.30%  avc_has_perm_noaudit
  1.24%  selinux_inode_permission
  1.18%  link_path_walk.part.0.constprop.0
  0.99%  btrfs_getattr
  0.82%  clear_page_rep
  0.82%  security_inode_permission
  0.60%  dput
  0.60%  path_lookupat
  0.60%  __check_object_size
  0.54%  strncpy_from_user
  0.51%  inode_permission
  0.50%  generic_permission
  0.47%  kmem_cache_alloc_noprof
  0.47%  step_into
  0.46%  btrfs_permission
  0.45%  cp_new_stat
  0.44%  filename_lookup

and hey, you go "most of those are just around half a percent". Sure.
But add those things up, and you'll see that they add up to about 12%.

And yes, that 12% is 1/8th of the whole load. So it's not some "just
12% of the kernel footprint". It's literally 12% of one of the main
loads I run day-to-day, which is why I care about these paths so
deeply.

And look at the two top offenders. No, they aren't fsnotify. But guess
what they are? They are hooks in the VFS layer that the VFS code
cannot do anything about and cannot optimize as a result. They do
ABSOLUTELY NOTHING on this load, and they add no actual value.

If they had some kind of "I don't have any special security label"
test, the two top offenders (and down that list you can find a third
one) would likely just not exist at all. But they aren't "real" VFS
functions, they are just hooks into a black hole with random semantics
that is set by user-supplied tables, so we don't have that.

So this is why I'm putting my foot down. No more badly coded and badly
designed hooks that the VFS layer can't just optimize away for the
common case.

I can't fix the existing issues. But I can say "no more". If you want
new hooks, they had better have effectively *ZERO* overhead when they
aren't relevant.

And yes, I was too forceful. Sorry. But my foot stays down.

If you want a specialty hook for specialty uses, that hook needs to be
so *fundamentally* low-cost that I simply don't need to worry about
it.

It needs to have a flag that doesn't take a cache miss that says
"Nobody cares", and doesn't call out to random pointless functions
that cause I$ misses either.

I really wish I had made that requirement for the security people all
those years ago. Because 99% of the time, all that stupid selinux
noise is literally worthless and does nothing. But we don't have the
flag that says "nothing to see".

               Linus
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 4a4a22a08ac2..b49fb1f80c0c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3782,7 +3782,15 @@  static int do_open(struct nameidata *nd,
 	}
 	if (do_truncate)
 		mnt_drop_write(nd->path.mnt);
-	return error;
+	if (error)
+		return error;
+
+	/*
+	 * This permission hook is different than fsnotify_open_perm() hook.
+	 * This is a pre-content hook that is called without sb_writers held
+	 * and after the file was truncated.
+	 */
+	return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0);
 }
 
 /**
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 22150e5797c5..1e87a54b88b6 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -171,6 +171,8 @@  static inline int fsnotify_pre_content(const struct file *file,
 
 /*
  * fsnotify_file_area_perm - permission hook before access of file range
+ *
+ * Called post open with access range [0..0].
  */
 static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 					  const loff_t *ppos, size_t count)
@@ -185,7 +187,7 @@  static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
 	/*
 	 * read()/write and other types of access generate pre-content events.
 	 */
-	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) {
+	if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS | MAY_OPEN)) {
 		int ret = fsnotify_pre_content(file, ppos, count);
 
 		if (ret)