[1/8] networking/fanotify: declare fanotify socket numbers

Message ID	20090911052558.32359.18075.stgit@paris.rdu.redhat.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Eric Paris <eparis@redhat.com> Subject: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org Cc: davem@davemloft.net, viro@zeniv.linux.org.uk, alan@linux.intel.com, hch@infradead.org Date: Fri, 11 Sep 2009 01:25:58 -0400 Message-ID: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com> User-Agent: StGIT/0.14.3 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric Paris Sept. 11, 2009, 5:25 a.m. UTC

fanotify's user interface uses a custom socket (it doesn't use netlink
since work must be done in the context of the receive side of the socket)

This patch simply defines the fanotify socket number declarations.  The
actual implementation of the socket is in a later patch.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 include/linux/socket.h |    5 ++++-
 net/core/sock.c        |    6 +++---
 2 files changed, 7 insertions(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 11, 2009, 2:32 p.m. UTC | #1

The patches did apply and build against next-20090910. I wrote a small user-
space utility for testing (attached); see how painless the socket interface 
is. The patches seem to be working well, except that some required 
functionality is missing still.

Currently, the CAP_NET_RAW capability is needed for being able to create 
watches. This seems too strict to me; I don't see why I shouldn't be able to 
watch my own files, or files which I have read access to (like inotify).

There are some actions like creating hardlinks in directories or removing 
files which don't trigger events. From a user point of view, I would prefer to 
receive those events as well. (I notice that it's not easy to to pass file 
descriptors to listeners for those events.)

Thanks,
Andreas

Eric Paris Sept. 11, 2009, 4:04 p.m. UTC | #2

On Fri, 2009-09-11 at 16:32 +0200, Andreas Gruenbacher wrote:
> The patches did apply and build against next-20090910. I wrote a small user-
> space utility for testing (attached); see how painless the socket interface 
> is. The patches seem to be working well, except that some required 
> functionality is missing still.
> 
> Currently, the CAP_NET_RAW capability is needed for being able to create 
> watches. This seems too strict to me; I don't see why I shouldn't be able to 
> watch my own files, or files which I have read access to (like inotify).

I agree completely.  However since I don't yet have limits on how many
marks a user can add, nor the number of events they can hold in their
queue I'm leaving it as root only for the moment.  Patches have started
to flush out usability by non-root users.

> There are some actions like creating hardlinks in directories or removing 
> files which don't trigger events. From a user point of view, I would prefer to 
> receive those events as well. (I notice that it's not easy to to pass file 
> descriptors to listeners for those events.)

Yes, a number of event types hook into the vfs in places where we do not
have at least a struct path.  I'm certainly considering how to move
hoooks to get that ability.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller Sept. 11, 2009, 6:46 p.m. UTC | #3

From: Eric Paris <eparis@redhat.com>
Date: Fri, 11 Sep 2009 01:25:58 -0400

> fanotify's user interface uses a custom socket (it doesn't use netlink
> since work must be done in the context of the receive side of the socket)
> 
> This patch simply defines the fanotify socket number declarations.  The
> actual implementation of the socket is in a later patch.
> 
> Signed-off-by: Eric Paris <eparis@redhat.com>

I would really prefer if you worked on eliminating the problem that
prevents you from using netlink instead.

You're just duplicating tons of netlink functionality only because of
this apparent limitation, and that's not good.

If you use netlink you'll be able to define arbitrary attributes,
they'll be optional and thus you can allow notification application to
set filters on those attributes, as well as all sorts of other things.

You can reimplement that, but I really would rather see you make
netlink suit your needs.  This is how we make existing facilities
more powerful and useful to consumers, when someone tries to use
it in a new way and with new requirements.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 11, 2009, 7:33 p.m. UTC | #4

On Fri, 2009-09-11 at 11:46 -0700, David Miller wrote:
> From: Eric Paris <eparis@redhat.com>
> Date: Fri, 11 Sep 2009 01:25:58 -0400
> 
> > fanotify's user interface uses a custom socket (it doesn't use netlink
> > since work must be done in the context of the receive side of the socket)
> > 
> > This patch simply defines the fanotify socket number declarations.  The
> > actual implementation of the socket is in a later patch.
> > 
> > Signed-off-by: Eric Paris <eparis@redhat.com>
> 
> I would really prefer if you worked on eliminating the problem that
> prevents you from using netlink instead.

I'm not really sure if I can, although I'd love to hear input from
someone who knows the netlink code on how I can make it do what I need.
I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT
macros.  My code doesn't even use skbs and I'm not savy enough to really
know how I could.  I'm more than willing to work on it if someone can
point me to how it might work.

The basic problem is that process (a) opens a file.  In the context of
process(a) we need to take a reference on the struct path, store some
other information about what happened and then wait for the listener,
process(b), to run.  When process (b) runs it takes the struct path,
opens the file it points to, then writes up to userspace the information
collected back when (a) was running and the fd opened in (b)'s context.

Currently (aka this patch set) I do the storage of information and the
waiting for (b) to be ready using the same queuing mechanism as inotify.
the fsnotify backend creates a single 'event' which can be added to the
queues for multiple notification listeners.  So if 50 inotify watches
are on the same directory we only create one event.  I guess this could
be reimplemented to send an skb which contains the information that
fsnotify stores in an event rather than just queuing up the generic
event.  Then I could somehow hook the recv side of netlink to pull the
skb off and instead of dumping it's data to the receiver to actually
open an fd and rewrite the message.  I'm not sure what netlink is buying
me here thought.  I also don't know the code well enough to know
everywhere that an skb can be lost.  Since the skb is going to hold a
reference to a struct path and contain a pointer to a struct path I
certainly can't handle it being dropped.  The other major thing that I
lose or have a much harder time reimplementing is event merging.  After
process (a) drops an event on the queue if it does something else to the
same file fanotify will search the queue and merge events.  I certainly
don't know how to search the netlink queue to find old skbs edit them,
and all the while make sure I'm doing it race free.

I'm willing to try to make this happen, I'm just sure I see the benefit
and I don't know anyone who knows the net/netlink code well enough who
owuld be interested to help.....

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 11, 2009, 8:46 p.m. UTC | #5

Eric Paris wrote:
> > I would really prefer if you worked on eliminating the problem that
> > prevents you from using netlink instead.
> 
> I'm not really sure if I can, although I'd love to hear input from
> someone who knows the netlink code on how I can make it do what I need.
> I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT
> macros.  My code doesn't even use skbs and I'm not savy enough to really
> know how I could.  I'm more than willing to work on it if someone can
> point me to how it might work.

Let's turn the question around.

Since you're doing lots of non-sockety things, and can't tolerate
dropped packets - why isn't it a character device?  What's the reason
for using a socket at all?

(I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 11, 2009, 9:13 p.m. UTC | #6

On Fri, 2009-09-11 at 21:46 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > > I would really prefer if you worked on eliminating the problem that
> > > prevents you from using netlink instead.
> > 
> > I'm not really sure if I can, although I'd love to hear input from
> > someone who knows the netlink code on how I can make it do what I need.
> > I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT
> > macros.  My code doesn't even use skbs and I'm not savy enough to really
> > know how I could.  I'm more than willing to work on it if someone can
> > point me to how it might work.
> 
> Let's turn the question around.
> 
> Since you're doing lots of non-sockety things, and can't tolerate
> dropped packets - why isn't it a character device?  What's the reason
> for using a socket at all?
> 
> (I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-)

Originally it was a char device and I was told to use a socket protocol
so I could use get/set sockopt rather than ioctl, because ioctl is the
devil (even if those aren't THAT much better).

The queuing being done using events instead of skbs was done reusing
inotify code, reusing network code would be just as good with me.  What
I really need is a way to convey a pointer from one process to another.
That's why I claim loss is not an option, since I'm holding a reference
to the pointer I can't have that conveyance disappear under us.

If network people want me to get back out of the network system I can go
back to a char file with lots of ioctls.  I'd love to reuse code, I just
don't know what's possible...

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

jamal Sept. 11, 2009, 9:21 p.m. UTC | #7

On Fri, 2009-09-11 at 15:33 -0400, Eric Paris wrote:

> 
> I'm willing to try to make this happen, I'm just sure I see the benefit
> and I don't know anyone who knows the net/netlink code well enough who
> owuld be interested to help.....
> 

I can help you on the netlink side but maybe we need to define the
problem scope properly first. I dont know anything about fanotify, but
i then dont need to. Can I re-phrase your problem as follows:

You have kernel code which wishes to communicate information to multiple
listeners in user space. The listeners register some form of filter
(that you term as a monitor) for an object of interest.  For a specific
filter you wish to send them events when something happens to
the object they registered for.
Would the above be accurate?

One good place to start is the taskstats code in linux. Look at
Documentation/accounting in the kernel code. Ive also CCed Balbir who
was involved in that work since i think it bears similarity to what you
are trying to achieve conceptually.

To your other statements:
0) A single skb can be used to send to all the listners..
1) Netlink messages wont get lost unless the listener is not keeping up
and the kernel sending it messages ends up filling its queues. In such a
case your event message will be delivered to the 49 other users but
not the overloaded one. You can add sequence numbers to the event
messages you send to the listeners and any gaps in sequences on received
events imply lost events. You can add a mechanism to query your user
space kernel when something like that gets lost.
2) Your architecture has to take care of maintaining the state of what
you want to deliver. So your editing has nothing to do with skbs.
i.e an event happens, you update your state. If you need to send the
event to the listeners, you alloc an skb - populate it with the info;
multicast it to all the listeners. If something else happens, i would
suggest for sake of simplicity you rinse and repeat. Sure, the listener
may get contradicting events - but they should be able to handle it.

I hope this helps...

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 11, 2009, 9:27 p.m. UTC | #8

Eric Paris wrote:
> On Fri, 2009-09-11 at 21:46 +0100, Jamie Lokier wrote:
> > Eric Paris wrote:
> > > > I would really prefer if you worked on eliminating the problem that
> > > > prevents you from using netlink instead.
> > > 
> > > I'm not really sure if I can, although I'd love to hear input from
> > > someone who knows the netlink code on how I can make it do what I need.
> > > I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT
> > > macros.  My code doesn't even use skbs and I'm not savy enough to really
> > > know how I could.  I'm more than willing to work on it if someone can
> > > point me to how it might work.
> > 
> > Let's turn the question around.
> > 
> > Since you're doing lots of non-sockety things, and can't tolerate
> > dropped packets - why isn't it a character device?  What's the reason
> > for using a socket at all?
> > 
> > (I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-)
> 
> Originally it was a char device and I was told to use a socket protocol
> so I could use get/set sockopt rather than ioctl, because ioctl is the
> devil (even if those aren't THAT much better).
> 
> The queuing being done using events instead of skbs was done reusing
> inotify code, reusing network code would be just as good with me.  What
> I really need is a way to convey a pointer from one process to another.
> That's why I claim loss is not an option, since I'm holding a reference
> to the pointer I can't have that conveyance disappear under us.

It's fine as long as the disappearing knows to releas the reference.
But I suspect fanotify would be awfully hard to use if messages were
unreliable.

> If network people want me to get back out of the network system I can go
> back to a char file with lots of ioctls.  I'd love to reuse code, I just
> don't know what's possible...

Ok.  I understand you're pushed in different directions by different
schools of thought.

Let's look at some history.  What happened to /dev/epoll.  It worked
very well (and several OSes have /dev/poll which is similar).  There
was no technical reason to change the interface.

But when it came to mainlining it, Linus objected, and forced it to
become a small set of system calls.  It's quite a nice interface to
use now.

Then /dev/inotify.  You know what happened.  The history was similar:
Linux objected to the device, and forced it to use a few system calls.

More recently, people skipped over the /dev path, having seen how it
went before, and just implemented things like timerfd, eventfd and
signalfd system calls.

That seems to be the Linux way - if the interface can be exposed as a
small set of sensible system calls, and it's really a core kernel
facility.

Does fanotify need "lots of ioctls", or could it fit comfortably into
say 2-5 strongly typed system calls, like inotify and epoll do?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 11, 2009, 9:42 p.m. UTC | #9

jamal wrote:
> 1) Netlink messages wont get lost unless the listener is not keeping up
> and the kernel sending it messages ends up filling its queues. In such a
> case your event message will be delivered to the 49 other users but
> not the overloaded one. You can add sequence numbers to the event
> messages you send to the listeners and any gaps in sequences on received
> events imply lost events. You can add a mechanism to query your user
> space kernel when something like that gets lost.

One of the uses of fanotify is as a security or auditing mechanism.
That can't tolerate gaps.

It's fundemantally different from inotify in one important respect:
inotify apps can recover from losing events by checking what they are
watching.

The fanotify application will know that it missed events, but what
happens to the other application which _caused_ those events?  Does it
get to do things it shouldn't, or hide them from the fanotify app, by
simply overloading the system?  Or the opposite, does it get access
denied - spurious file errors when the system is overloaded?

There's no way to handle that by dropping events.  A transport
mechanism can be dropped (say skbs), but the event itself has to be
kept, and then retried.

Since you have to keep an event object around until it's handled,
there's no point tying it to an unreliable delivery mechanism which
you'd have to wrap a retry mechanism around.

In other words, that part of netlink is a poor match.  It would match
inotify much better.

> 2) Your architecture has to take care of maintaining the state of what
> you want to deliver. So your editing has nothing to do with skbs.
> i.e an event happens, you update your state. If you need to send the
> event to the listeners, you alloc an skb - populate it with the info;
> multicast it to all the listeners. If something else happens, i would
> suggest for sake of simplicity you rinse and repeat. Sure, the listener
> may get contradicting events - but they should be able to handle it.

Speaking of skbs, how fast and compact are they for this?

Eric's explained that it would be normal for _every_ file operation on
some systems to trigger a fanotify event and possibly wait on the
response, or at least in major directory trees on the filesystem.
Even if it's just for the fanotify app to say "oh I don't care about
that file, carry on".

File performance is one of those things which really needs to be fast
for a good user experience - and it's not unusual to grep the odd
10,000 files here or there (just think of what a kernel developer
does), or to replace a few thousand quickly (rpm/dpkg) and things like
that.

While skbs and netlink aren't that slow, I suspect they're an order of
magnitude or two slower than, say, epoll or inotify at passing events
around.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 11, 2009, 9:51 p.m. UTC | #10

On Fri, 2009-09-11 at 22:27 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > On Fri, 2009-09-11 at 21:46 +0100, Jamie Lokier wrote:
> > > Eric Paris wrote:
> > > > > I would really prefer if you worked on eliminating the problem that
> > > > > prevents you from using netlink instead.
> > > > 
> > > > I'm not really sure if I can, although I'd love to hear input from
> > > > someone who knows the netlink code on how I can make it do what I need.
> > > > I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT
> > > > macros.  My code doesn't even use skbs and I'm not savy enough to really
> > > > know how I could.  I'm more than willing to work on it if someone can
> > > > point me to how it might work.
> > > 
> > > Let's turn the question around.
> > > 
> > > Since you're doing lots of non-sockety things, and can't tolerate
> > > dropped packets - why isn't it a character device?  What's the reason
> > > for using a socket at all?
> > > 
> > > (I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-)
> > 
> > Originally it was a char device and I was told to use a socket protocol
> > so I could use get/set sockopt rather than ioctl, because ioctl is the
> > devil (even if those aren't THAT much better).
> > 
> > The queuing being done using events instead of skbs was done reusing
> > inotify code, reusing network code would be just as good with me.  What
> > I really need is a way to convey a pointer from one process to another.
> > That's why I claim loss is not an option, since I'm holding a reference
> > to the pointer I can't have that conveyance disappear under us.
> 
> It's fine as long as the disappearing knows to releas the reference.

Absolutely.

> But I suspect fanotify would be awfully hard to use if messages were
> unreliable.

For some things yes, some things no.  I'd have to understand where loss
can happen to know if it's feasible.  If I know loss happens in the
sender context that's great.  If it's somewhere in the middle and the
sender doesn't immediately know it'll never be delivered, yes, I don't
think it can solve all my needs.  How many places can and skb get lost
between the sender and the receiver?

> Does fanotify need "lots of ioctls", or could it fit comfortably into
> say 2-5 strongly typed system calls, like inotify and epoll do?

Certainly not a lot.  I do it now with 5 setsockopt calls.  I could do
it with a limited number of syscalls (what I have today in about 4) but
I still don't believe I know all the future use cases or how to define
them in syscalls.  People are still coming up with things they want
inotify to do but we can't implement them in the syscalls we have.  I
think we all know it's a hell of a lot easier to implement a new
setsockopt call than a new syscall.

I don't think I know everything everyone is going to want out of a new
notification system yet, so I'm reticent to go down that path.  I know
how to solve some of the suck of inotify, I know how to solve a number
of people's problems (and have all the patches to do it) but when it
comes to the organic nature of things and what I hope to be the growing
future of fanotify to solve everything for everyone, I don't think I can
define it all yet.  I just don't want to end up with another inotify, it
works, the syscalls do what they do very very well, but people want new
things out of it, and we can't reasonably give it to them.

Maybe syscalls are the right thing, and I know I can solve a lot of use
cases with syscalls, but I don't even know if I know all the use cases.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

jamal Sept. 11, 2009, 10:52 p.m. UTC | #11

On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote:

> One of the uses of fanotify is as a security or auditing mechanism.
> That can't tolerate gaps.
> 
> It's fundemantally different from inotify in one important respect:
> inotify apps can recover from losing events by checking what they are
> watching.
> 
> The fanotify application will know that it missed events, but what
> happens to the other application which _caused_ those events?  Does it
> get to do things it shouldn't, or hide them from the fanotify app, by
> simply overloading the system?  Or the opposite, does it get access
> denied - spurious file errors when the system is overloaded?
> 
> There's no way to handle that by dropping events.  A transport
> mechanism can be dropped (say skbs), but the event itself has to be
> kept, and then retried.
> 
>
> Since you have to keep an event object around until it's handled,
> there's no point tying it to an unreliable delivery mechanism which
> you'd have to wrap a retry mechanism around.
> 
> In other words, that part of netlink is a poor match.  It would match
> inotify much better.
> 

Reliability is something that you should build in. Netlink provides you
all the necessary tools. What you are asking for here is essentially 
reliable multicasting.  You dont have infinite memory, therefore there
will be times when you will overload one of the users, and they wont
have sufficient buffer space and then you have to retransmit. You have
to factor in processing speed mismatch between the different listeners.
As an example, you could ensure that all users receive the message and
if user #49 didnt, then you wait until they do before multicasting the
next message to all 50 listeners.
Is the current proposed mechanism capable of reliably multicasting
without need for retransmit?

> Speaking of skbs, how fast and compact are they for this?

They are largish relative to say if you trimmed down to basic necessity.
But then you get a lot of the buffer management aspects for free.
In this case, the concept of multicasting is built in so for one event
to be sent to X users - you only need one skb.

> Eric's explained that it would be normal for _every_ file operation on
> some systems to trigger a fanotify event and possibly wait on the
> response, or at least in major directory trees on the filesystem.
> Even if it's just for the fanotify app to say "oh I don't care about
> that file, carry on".
> 

That doesnt sound very scalable. Should it not be you get nothing unless
you register for interest in something?

> File performance is one of those things which really needs to be fast
> for a good user experience - and it's not unusual to grep the odd
> 10,000 files here or there (just think of what a kernel developer
> does), or to replace a few thousand quickly (rpm/dpkg) and things like
> that.
> 

So grepping 10000 files would cause 10000 events? I am not sure how the
scheme works; filtering of what events get delivered sounds more
reasonable if it happens in the kernel.

> While skbs and netlink aren't that slow, I suspect they're an order of
> magnitude or two slower than, say, epoll or inotify at passing events
> around.
> 

not familiar with inotify. Theres a difference between events which are
abbreviated in the form "hey some read happened on fd you are listening
on" vs "hey a read of file X for 16 bytes at offset 200 by process Y
just occured while at the same time process Z was writting at offset
2000". The later (which netlink will give you) includes a lot more
attribute details which could be filtered or can be extended to include
a lot more. The former(what epoll will give you) is merely a signal.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Evgeniy Polyakov Sept. 12, 2009, 9:41 a.m. UTC | #12

On Fri, Sep 11, 2009 at 05:51:42PM -0400, Eric Paris (eparis@redhat.com) wrote:
> For some things yes, some things no.  I'd have to understand where loss
> can happen to know if it's feasible.  If I know loss happens in the
> sender context that's great.  If it's somewhere in the middle and the
> sender doesn't immediately know it'll never be delivered, yes, I don't
> think it can solve all my needs.  How many places can and skb get lost
> between the sender and the receiver?

When queue is full or you do not have enough RAM. Both are reported at
'sending' time.

As of your description of netlink/socket usage - you will have to peek
skb queue, which is rather error-prone operation. Also you will have to
implement own skb destructor to mess with private reference counters and
netlink bits.

Evgeniy Polyakov Sept. 12, 2009, 9:47 a.m. UTC | #13

On Fri, Sep 11, 2009 at 10:42:54PM +0100, Jamie Lokier (jamie@shareable.org) wrote:
> While skbs and netlink aren't that slow, I suspect they're an order of
> magnitude or two slower than, say, epoll or inotify at passing events
> around.

Skbs are those tiny bits which allow 100 Gbit/s forwarding between multiple
interfaces. But of course it can not be as fast as plain memory copy :)

Having one skb allocation per IO syscall will be challenging but after
all we have this for send/recv calls and achieve high performance
numbers. Idea with merging events in the same skb will be also very non-trivial.

Jamie Lokier Sept. 14, 2009, 12:03 a.m. UTC | #14

jamal wrote:
> On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote:
> 
> > One of the uses of fanotify is as a security or auditing mechanism.
> > That can't tolerate gaps.
> > 
> > It's fundemantally different from inotify in one important respect:
> > inotify apps can recover from losing events by checking what they are
> > watching.
> > 
> > The fanotify application will know that it missed events, but what
> > happens to the other application which _caused_ those events?  Does it
> > get to do things it shouldn't, or hide them from the fanotify app, by
> > simply overloading the system?  Or the opposite, does it get access
> > denied - spurious file errors when the system is overloaded?
> > 
> > There's no way to handle that by dropping events.  A transport
> > mechanism can be dropped (say skbs), but the event itself has to be
> > kept, and then retried.
> > 
> >
> > Since you have to keep an event object around until it's handled,
> > there's no point tying it to an unreliable delivery mechanism which
> > you'd have to wrap a retry mechanism around.
> > 
> > In other words, that part of netlink is a poor match.  It would match
> > inotify much better.
> 
> Reliability is something that you should build in. Netlink provides you
> all the necessary tools. What you are asking for here is essentially 
> reliable multicasting.

Almost.  It's reliable multicasting plus unicast responses which must
be waited for.  That changes things.

> You dont have infinite memory, therefore there
> will be times when you will overload one of the users, and they wont
> have sufficient buffer space and then you have to retransmit.

If you have enough memory to remember _what_ to retransmit, then you
have enough memory to buffer a fixed-size message.  It just depends on
how you do the buffering.  To say netlink drops the message and you
can retry is just saying that the buffering is happening one step
earlier, before netlink.  That's what I mean by netlink being a
pointless complication for this, because you can just as easily write
code which gets to the message to userspace without going through
netlink and with no chance of it being dropped.

> Is the current proposed mechanism capable of reliably multicasting
> without need for retransmit?

Yes.  It uses positive acknowledge and flow control, because these
match naturally with what fanotify does at the next higher level.

The process generating the multicast (e.g. trying to write a file) is
blocked until the receiver gets the message, handles it and
acknowledges with a "yes you can" or "no you can't" response.

That's part of fanotify's design.  The pattern conveniently has no
issues with using unbounded memory for message, because the sending
process is blocked.

> > Speaking of skbs, how fast and compact are they for this?
> 
> They are largish relative to say if you trimmed down to basic necessity.
> But then you get a lot of the buffer management aspects for free.
> In this case, the concept of multicasting is built in so for one event
> to be sent to X users - you only need one skb.

True you only need one skb.  But netlink doesn't handle waiting for
positive acknowledge responses from every receiver, and combining
their value, does it?  You can't really take advantage of netlink's
built in multicast, because to known when it has all the responses,
the fanotify layer has to track the subscriber list itself anyway.

What I'm saying is perhaps skbs are useful for fanotify, but I don't
know that netlink's multicasting is useful.  But storing the messages
in skbs for transmission, and using parts of netlink to manage them,
and to provide some of the API, that might be useful.

> > Eric's explained that it would be normal for _every_ file operation on
> > some systems to trigger a fanotify event and possibly wait on the
> > response, or at least in major directory trees on the filesystem.
> > Even if it's just for the fanotify app to say "oh I don't care about
> > that file, carry on".
> > 
> 
> That doesnt sound very scalable. Should it not be you get nothing unless
> you register for interest in something?

You do get nothing unless you register interest.  The problem is
there's no way to register interest on just a subtree, so the fanotify
approach is let you register for events on the whole filesystem, and
let the userspace daemon filter paths.  At least it's decisions can be
cached, although I'm not sure how that works when multiple processes
want to monitor overlapping parts of the filesystem.

It doesn't sound scalable to me, either, and that's why I don't like
this part, and described a solution to monitoring subtrees - which
would also solve the problem for inotify.  (Both use fsnotify under
the hood, and that's where subtree notification would go).

Eric's mentioned interest in a way to monitor subtrees, but that
hasn't gone anywhere as far as I know.  He doesn't seem convinced by
my solution - or even that scalability will be an issue.  I think
there's a bit of vision lacking here, and I'll admit I'm more
interested in the inotify uses of fsnotify (being able to detect
changes) than the fanotify uses (being able to _block_ or _modify_
changes).  I think both inotify and fanotify ought to benefit from the
same improvements to file monitoring.

> > File performance is one of those things which really needs to be fast
> > for a good user experience - and it's not unusual to grep the odd
> > 10,000 files here or there (just think of what a kernel developer
> > does), or to replace a few thousand quickly (rpm/dpkg) and things like
> > that.
> > 
> 
> So grepping 10000 files would cause 10000 events? I am not sure how the
> scheme works; filtering of what events get delivered sounds more
> reasonable if it happens in the kernel.

I believe it would cause 10000 events, yes, even if they are files
that userspace policy is not interested in.  Eric, is that right?

However I believe after the first grep, subsequent greps' decisions
would be cached by marking the inodes.  I'm not sure what happens if
two fanotify monitors both try marking the inodes.

Arguably if a fanotify monitor is running before those files are in
page cache anyway, then I/O may dominate, and when the files are
cached, fanotify has already cached it's decisions in the kernel.
However fanotify is synchronous: each new file access involves a round
trip to the fanotify userspace and back before it can proceed, so
there's quite a lot of IPC and scheduling too.  Without testing, it's
hard to guess how it'll really perform.

> > While skbs and netlink aren't that slow, I suspect they're an order of
> > magnitude or two slower than, say, epoll or inotify at passing events
> > around.
> 
> not familiar with inotify.

inotify is like dnotify, and like a signal or epoll: a message that
something happened.  You register interest in individual files or
directories only, and inotify does not (yet) provide a way to monitor
the whole filesystem or a subtree.

fanotify is different: it provides access control, and can _refuse_
attempts to read file X, or even modify the file before permitting the
file to be read.

> Theres a difference between events which are abbreviated in the form
> "hey some read happened on fd you are listening on" vs "hey a read
> of file X for 16 bytes at offset 200 by process Y just occured while
> at the same time process Z was writting at offset 2000". The later
> (which netlink will give you) includes a lot more attribute details
> which could be filtered or can be extended to include a lot
> more. The former(what epoll will give you) is merely a signal.

Firstly, it's really hard to retain the ordering of userspace events
like that in a useful way, given the non-determinstic parallelism
going on with multiple processes doing I/O do the same file :-)

Second, you can't really pump messages with that much detail into
netlink and let _it_ filter them to userspace; that would be too much
processing.  You'd have to have some way of not generating that much
detail except when it's been requested, and preferably only for files
you want it for.

But this part is irrelevant to fanotify, because there's no plan or
intention to provide that much detail about I/O.

If you want, feel free to provide a stracenotify subsystem to track
everything in detail :-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 14, 2009, 12:17 a.m. UTC | #15

Evgeniy Polyakov wrote:
> On Fri, Sep 11, 2009 at 05:51:42PM -0400, Eric Paris (eparis@redhat.com) wrote:
> > For some things yes, some things no.  I'd have to understand where loss
> > can happen to know if it's feasible.  If I know loss happens in the
> > sender context that's great.  If it's somewhere in the middle and the
> > sender doesn't immediately know it'll never be delivered, yes, I don't
> > think it can solve all my needs.  How many places can and skb get lost
> > between the sender and the receiver?
> 
> When queue is full or you do not have enough RAM. Both are reported at
> 'sending' time.

Can you ->poll() and wait reliably until the queue will accept an skb?
(A few spurious EAGAINs/ENOBUFs is ok, as long as it's not the norm).

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 14, 2009, 1:26 a.m. UTC | #16

On Mon, 2009-09-14 at 01:03 +0100, Jamie Lokier wrote:
> jamal wrote:
> > On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote:

First let me state the 2 main new things fanotify gives so neither is
lost.

#1 fanotify implements the same basic thing as inotify except rather
than an arbitrary number (inotify speak a watch descriptor) which
userspace has to somehow convert back into a file, fanotify gives
userspace an open file descriptor to that object.  (this is the part
that requires recv side processing)

#2 fanotify allows the userspace 'listener' or 'group' (I use group to
describe it's actions in kernel and listener to describe it's action in
userspace) to request that it be allowed to arbitrate open and access
(read) security decisions.

> > > Eric's explained that it would be normal for _every_ file operation on
> > > some systems to trigger a fanotify event and possibly wait on the
> > > response, or at least in major directory trees on the filesystem.
> > > Even if it's just for the fanotify app to say "oh I don't care about
> > > that file, carry on".
> > > 
> > 
> > That doesnt sound very scalable. Should it not be you get nothing unless
> > you register for interest in something?
> 
> You do get nothing unless you register interest.  The problem is
> there's no way to register interest on just a subtree, so the fanotify
> approach is let you register for events on the whole filesystem, and
> let the userspace daemon filter paths.  At least it's decisions can be
> cached, although I'm not sure how that works when multiple processes
> want to monitor overlapping parts of the filesystem.

fanotify provides 3 options to register.
1) this inode
2) this dir and it's children
3) all files on the whole fscking system

this patch only does #1 and #2.  After it's in I'm going to take a
serious look at #4 subtrees.

Responses to access decisions are cached and checked in kernel per
fanotify listener.  So listener 1 can ignore requests for a given inode
while listener 2 still gets notification and forces the original process
to block.

> It doesn't sound scalable to me, either, and that's why I don't like
> this part, and described a solution to monitoring subtrees - which
> would also solve the problem for inotify.  (Both use fsnotify under
> the hood, and that's where subtree notification would go).

Subtree checking hasn't seen and work from me but it is something I plan
to work on.  And one thing that makes me scared to tie myself to
syscalls when I already have something that works relatively cleanly and
easily.

> Eric's mentioned interest in a way to monitor subtrees, but that
> hasn't gone anywhere as far as I know.  He doesn't seem convinced by
> my solution - or even that scalability will be an issue.  I think
> there's a bit of vision lacking here, and I'll admit I'm more
> interested in the inotify uses of fsnotify (being able to detect
> changes) than the fanotify uses (being able to _block_ or _modify_
> changes).  I think both inotify and fanotify ought to benefit from the
> same improvements to file monitoring.

sort of agree with you here.  anything that gets added to support
subtrees would have to be in the generic code.  Although I question how
inotify could be used, as a wd is not (in my mind) a reasonable way to
tell userspace about files.  (and with subtrees it would be a wd and a
pathname....)    I think fanotify with notification only (what I'm
giving in this patch series is a much better fit for subtree
notification.

> > > File performance is one of those things which really needs to be fast
> > > for a good user experience - and it's not unusual to grep the odd
> > > 10,000 files here or there (just think of what a kernel developer
> > > does), or to replace a few thousand quickly (rpm/dpkg) and things like
> > > that.
> > > 
> > 
> > So grepping 10000 files would cause 10000 events? I am not sure how the
> > scheme works; filtering of what events get delivered sounds more
> > reasonable if it happens in the kernel.
> 
> I believe it would cause 10000 events, yes, even if they are files
> that userspace policy is not interested in.  Eric, is that right?

If fanotify wants it, yes, that's exactly what happens.

> However I believe after the first grep, subsequent greps' decisions
> would be cached by marking the inodes.  I'm not sure what happens if
> two fanotify monitors both try marking the inodes.

Each can mark individually.

> Arguably if a fanotify monitor is running before those files are in
> page cache anyway, then I/O may dominate, and when the files are
> cached, fanotify has already cached it's decisions in the kernel.
> However fanotify is synchronous: each new file access involves a round
> trip to the fanotify userspace and back before it can proceed, so
> there's quite a lot of IPC and scheduling too.  Without testing, it's
> hard to guess how it'll really perform.

As I recall my old old tests on a 32 way system showed around a 10%
performance penalty when building a kernel when making userspace
arbitrate decisions and the cache was blank.  So yes, there is a serious
performance hit to making a userspace application control access
decisions.  Then again, I'd rather not have those people who need this
system wide access controls to do it in the kernel (anti-malware
vendors)

I believe that people who chose to use this interface will have to
realize there is a severe up front performance penalty.  On a steady
state system like a web server you'd see near 0% performance (a new srcu
lock, inode->i_lock, and running a short list)  But yes, controling
access to every file on a system eats performance, that's the nature of
the beast.

> > Theres a difference between events which are abbreviated in the form
> > "hey some read happened on fd you are listening on" vs "hey a read
> > of file X for 16 bytes at offset 200 by process Y just occured while
> > at the same time process Z was writting at offset 2000". The later
> > (which netlink will give you) includes a lot more attribute details
> > which could be filtered or can be extended to include a lot
> > more. The former(what epoll will give you) is merely a signal.

> But this part is irrelevant to fanotify, because there's no plan or
> intention to provide that much detail about I/O.

We have ZERO plan to include ordering.  ZERO.  inotify sorta pretends it
deals with ordering by only dropping a notification if it is the same as
the last one in the queue.  fanotify will gladly merge events which
exist anywhere in the queue clearly throwing ordering to the wind.

We do plan to include the pid, uid, and gid of the process making the
original request.  We also plan to include the f_flags of the file in
the original process when possible.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

jamal Sept. 14, 2009, 1:15 p.m. UTC | #17

On Mon, 2009-09-14 at 01:03 +0100, Jamie Lokier wrote:

> If you have enough memory to remember _what_ to retransmit, then you
> have enough memory to buffer a fixed-size message.  It just depends on
> how you do the buffering.  To say netlink drops the message and you
> can retry is just saying that the buffering is happening one step
> earlier, before netlink.  

it is the receiver that drops the message because of overruns
e.g. when receiver doesnt keep up..

> That's what I mean by netlink being a
> pointless complication for this, because you can just as easily write
> code which gets to the message to userspace without going through
> netlink and with no chance of it being dropped.
> 

Sure you can do that with netlink too. Whether it is overcomplicated
needs to be weighed out.

> Yes.  It uses positive acknowledge and flow control, because these
> match naturally with what fanotify does at the next higher level.
> 
> The process generating the multicast (e.g. trying to write a file) is
> blocked until the receiver gets the message, handles it and
> acknowledges with a "yes you can" or "no you can't" response.
> 
> That's part of fanotify's design.  The pattern conveniently has no
> issues with using unbounded memory for message, because the sending
> process is blocked.
> 

Ok, I understand better i think;-> So it is a synchronous type of
operation whereas in netlink type multicast, the optimization is to make
the operation async.

> True you only need one skb.  But netlink doesn't handle waiting for
> positive acknowledge responses from every receiver, and combining
> their value, does it?  

It is not netlink perse. It is how you use netlink for your app.
Classical one-to-many operations have the sender (kernel mostly)
do async sends to the listeners. It is up to the listener to catch
up if there are any holes. But this seems not what you want for
fanotify.

> You can't really take advantage of netlink's
> built in multicast, because to known when it has all the responses,
> the fanotify layer has to track the subscriber list itself anyway.

True, given my understanding so far fanotify has to track the subscriber
list. i.e something along the lines of:
- send a single multicast message to a set of listeners
- wait for response from all subscribers
- if no response from all subscribers given timeout then retransmit upto
max retransmit times

The chance of loosing a message in such a case is zero if the socket
buffer on each listener/receiver is larger than one fanotify event
message. You still have to alloc the message - and that may fail.

> What I'm saying is perhaps skbs are useful for fanotify, but I don't
> know that netlink's multicasting is useful.  But storing the messages
> in skbs for transmission, and using parts of netlink to manage them,
> and to provide some of the API, that might be useful.

The multicast-a-single-skb part is useful. Of course, its usefulness
diminishes as the number of listeners per subtree goes down (because it
reduces to one skb per listener). So all this depends on how fanotify is
going to be used.
The one thing i am not sure of is how you map a multicast group to a
subtree. In netlink groups to which multiple listeners subscribe to
are 32-bit identifiers. I suppose, one approach could be to register
for the event of interest, get an ID then use the ID to listen to a
multicast group of that ID. This way whoever is issuing the ID can also
factor in permissions and subtree overlap of the listener (and whether
the events are already being listened to in a known ID). 
Alternatively, to your statement above, if fanotify is keeping track of
all subsribers then it can replicast a single event instead and just
bump the refcount on the skb for each sent-to-user (and still use one
skb).. 

> You do get nothing unless you register interest.  The problem is
> there's no way to register interest on just a subtree, so the fanotify
> approach is let you register for events on the whole filesystem, and
> let the userspace daemon filter paths.  At least it's decisions can be
> cached, although I'm not sure how that works when multiple processes
> want to monitor overlapping parts of the filesystem.

I guess if the non-optimal part happens only once and subsequent cached
filters happen faster, then one could look at that as cost of setup.
I think, given that you are capable of creating such a cache, seems that
it would be cheaper to make such decision at registration time.

> It doesn't sound scalable to me, either, and that's why I don't like
> this part, and described a solution to monitoring subtrees - which
> would also solve the problem for inotify.  (Both use fsnotify under
> the hood, and that's where subtree notification would go).
>
> Eric's mentioned interest in a way to monitor subtrees, but that
> hasn't gone anywhere as far as I know.  He doesn't seem convinced by
> my solution - or even that scalability will be an issue.  I think
> there's a bit of vision lacking here, and I'll admit I'm more
> interested in the inotify uses of fsnotify (being able to detect
> changes) than the fanotify uses (being able to _block_ or _modify_
> changes).  I think both inotify and fanotify ought to benefit from the
> same improvements to file monitoring.
> 

The subtree overlap problem seems to invoke some well known computer
science algorithms, no? i.e tell me oracle given the event on nodeX of
this tree, which subscriber needs to be notified?

> I believe it would cause 10000 events, yes, even if they are files
> that userspace policy is not interested in.  Eric, is that right?
> 
> However I believe after the first grep, subsequent greps' decisions
> would be cached by marking the inodes.  I'm not sure what happens if
> two fanotify monitors both try marking the inodes.
>
> Arguably if a fanotify monitor is running before those files are in
> page cache anyway, then I/O may dominate, and when the files are
> cached, fanotify has already cached it's decisions in the kernel.
> However fanotify is synchronous: each new file access involves a round
> trip to the fanotify userspace and back before it can proceed, so
> there's quite a lot of IPC and scheduling too.  Without testing, it's
> hard to guess how it'll really perform.
> 

So if you can mark inodes, why not do it at register time?

> > > While skbs and netlink aren't that slow, I suspect they're an order of
> > > magnitude or two slower than, say, epoll or inotify at passing events
> > > around.
> > 
> > not familiar with inotify.
> 
> inotify is like dnotify, and like a signal or epoll: a message that
> something happened.  You register interest in individual files or
> directories only, and inotify does not (yet) provide a way to monitor
> the whole filesystem or a subtree.
> 
> fanotify is different: it provides access control, and can _refuse_
> attempts to read file X, or even modify the file before permitting the
> file to be read.
> 

Ok, I think i understood more about fanotify now. It is more of an
access control than a mass notification scheme (which is what i thought
of earlier).
Hrm, it does sound like something closer to selinux if it is simple
enough to require answers to simple questions like "should this
operation continue?"

> > Theres a difference between events which are abbreviated in the form
> > "hey some read happened on fd you are listening on" vs "hey a read
> > of file X for 16 bytes at offset 200 by process Y just occured while
> > at the same time process Z was writting at offset 2000". The later
> > (which netlink will give you) includes a lot more attribute details
> > which could be filtered or can be extended to include a lot
> > more. The former(what epoll will give you) is merely a signal.
> 
> Firstly, it's really hard to retain the ordering of userspace events
> like that in a useful way, given the non-determinstic parallelism
> going on with multiple processes doing I/O do the same file :-)
> 

Bad example ;->
That was not meant to be anything clever - rather to demonstrate that
netlink allows you to send many attributes with events and that you can
add as many as you want over a period of time (instead of hardcoding it
at design/coding time).

On a tangent: I would love to get more than simple events
(read/write/exception) on a file. Probably more on the writes
than on reads; example "offset X, length Y has been deleted" etc.
I would still love the option to exercise my rights to simple
events like read/write/exception

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Evgeniy Polyakov Sept. 14, 2009, 2:07 p.m. UTC | #18

On Mon, Sep 14, 2009 at 01:17:59AM +0100, Jamie Lokier (jamie@shareable.org) wrote:
> > When queue is full or you do not have enough RAM. Both are reported at
> > 'sending' time.
> 
> Can you ->poll() and wait reliably until the queue will accept an skb?
> (A few spurious EAGAINs/ENOBUFs is ok, as long as it's not the norm).

Not that simple and for memory allocation error just can't.

There is no direct access to remote peer sockets, i.e. to userspace
ones, so one will have to lock netlink table and run over listeners and
check whether they can accept the message and wait/poll for queue size
to become big enough. Netlink table and its locking is not exported,
so effectively there is no simple way to do this.

Eric Paris Sept. 14, 2009, 7:08 p.m. UTC | #19

Long ago I implemented fanotify as basically a /dev interface using
ioctls().  Alan suggested I use a socket protocol and could then make
use of get/setsockopt() which although still not great is light years
better than ioctl.  Currently the fanotify interface as I want to push
it to Linus and as I've been requesting comments on for the last 1.25
years is just that.  It really makes no use of the networking system
other than bind() and setsockopt() and everyone tends to agree the
things I want to do can't reasonably be done using network hooks and a
'real' socket protocol.  I like this interface, setsockopt() makes it so
easy to add new functionality as we flush out other users.  fanotify as
it stands today has a number of groups who will port to it, has a nubmer
of advantages over inotify, and I have been told privately meets the
needs of the original group of people who paid me to work on it (two
very large anti-malware companies who currently unprotect and hack the
syscall table of their users)

Just this week I got another request to look at syscalls.  So I did, I
haven't prototyped it, but I can do it with syscalls, they would look
like this:

int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);

int fanotify_add_mark(int fanotify_fd, char *path, __u64 mask, __u64 ignored_mask);
int fanotify_add_mark_fd(int fanotify_fd, int fd, __u64 mask, __u64 ignored_mask);
int fanotify_rm_mark(int fanotify_fd, char *path, __u64 mask);
int fanotify_rm_mark_fd(int fanotify_fd, int fd, __u64 mask);
	Those above 4 could probably be squashed into 2 syscalls with an extra
flags field.

int fanotify_clear_marks(int fanotify_fd);

int fanotify_perm_response(int fanotify_fd, __u64 cookie, int response);

int fanotify_ignore_sb(int fanotify_fd, long f_type);
int fanotify_ignore_fsid(int fanotify_fd, fsid_t f_fsid);
	These 2 are the most questionable, they would honestly only be used for
things that wanted system wide notification, I can't imagine that being
many things other than AV vendors.  But they really need a way to
exclude notification when people open/close/read/write to /proc (which
is the point of the ignore_sb.)

Since I don't have a solution to subtree notification I don't know if it
will work in this syscall framework.  I know people want subtree
notification and I'm willing to take a stab at it after the fscking all
notification is accepted.  That's one of the main reasons I like
setsockopt over tons of syscalls.  I can add a new one very easily.  I
also can easily expand arguments by just creating a new sockopt.  No
userspace headaches.

Are there demands that I convert to syscalls?  Do I really gain anything
using 9 new inextensible syscalls over socket(), bind(), and 8
setsockopt() calls?

I'd like to send these patches along, so a ruling from on high would be
great....

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Evgeniy Polyakov Sept. 15, 2009, 8:16 p.m. UTC | #20

On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis@redhat.com) wrote:
> Just this week I got another request to look at syscalls.  So I did, I
> haven't prototyped it, but I can do it with syscalls, they would look
> like this:
> 
> int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);

...

> Are there demands that I convert to syscalls?  Do I really gain anything
> using 9 new inextensible syscalls over socket(), bind(), and 8
> setsockopt() calls?

In my personal opinion sockets are way much simpler and obvious than
syscalls. Also one should not edit hundred of arch-dependant headers
conflicting with other pity syscallers.

But implementing af_fanotify opens a door for zillions of other
af_something which can be implemented using existing infrastructure
namely netlink will solve likely 99% of potential usage cases.

And frankly I did not find it way too convincing that using netlink is
impossible in your scenario if some things will be simplified, namely
event merging. Moreover you can implement a pool of working threads and
postpone all the work to them and appropriate event queues, which will
allow to use rlimits for the listeners and open files 'kind of' on
behalf of those processes.

But it is quite diferent from the approach you selected and which is
more obvious indeed. So if you ask a question whether fanotify should
use sockets or syscalls, I would prefer sockets, but I still recommend
to rethink your decision to move away from netlink to be 100% sure that
it will not solve your needs.

Eric Paris Sept. 15, 2009, 9:54 p.m. UTC | #21

On Wed, 2009-09-16 at 00:16 +0400, Evgeniy Polyakov wrote:
> On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis@redhat.com) wrote:
> > Just this week I got another request to look at syscalls.  So I did, I
> > haven't prototyped it, but I can do it with syscalls, they would look
> > like this:
> > 
> > int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);
> 
> ...
> 
> > Are there demands that I convert to syscalls?  Do I really gain anything
> > using 9 new inextensible syscalls over socket(), bind(), and 8
> > setsockopt() calls?
> 
> In my personal opinion sockets are way much simpler and obvious than
> syscalls. Also one should not edit hundred of arch-dependant headers
> conflicting with other pity syscallers.
> 
> But implementing af_fanotify opens a door for zillions of other
> af_something which can be implemented using existing infrastructure
> namely netlink will solve likely 99% of potential usage cases.
> 
> And frankly I did not find it way too convincing that using netlink is
> impossible in your scenario if some things will be simplified, namely
> event merging.

Nothing's impossible, but is netlink a square peg for this round hole?
One of the great benefits of netlink, the attribute matching and
filtering, although possibly useful isn't some panacea as we have to do
that well before netlink to have anything like decent performance.
Imagine every single fs event creating an skb and sending it with
netlink only to have most of them dropped.

The only other benefit to netlink that I know of is the reasonable,
easy, and clean addition of information later in time with backwards
compatibility as needed.  That's really cool, I admit, but with the
limited amount of additional info that users have wanted out of inotify
I think my data type extensibility should be enough.

>  Moreover you can implement a pool of working threads and
> postpone all the work to them and appropriate event queues, which will
> allow to use rlimits for the listeners and open files 'kind of' on
> behalf of those processes.

I'm sorry, I don't userstand.  I don't see how worker threads help
anything here.  Can you explain what you are thinking?

> But it is quite diferent from the approach you selected and which is
> more obvious indeed. So if you ask a question whether fanotify should
> use sockets or syscalls, I would prefer sockets

I've heard someone else off list say this as well.  I'm not certain why.
I actually spent the day yesterday and have fanotify working over 5 new
syscalls (good thing I wrote the code with separate back and and front
ends for just this purpose)  And I really don't hate it.  I think 3
might be enough.

fanotify_init() ---- very much like inotify_init
fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch
fanotify_modify_mark_fd() --- same but with an fd instead of a path
fanotify_response() --- userspace tells the kernel what to do if requested/allowed
   (could probably be done using write() to the fanotify fd)
fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed....

If there is a strong need for syscalls I'm ready and willing.  I'd love
to hear Linus say socket+setsockopt() is a merge blocker and I have to
go to syscalls if he sees it that way.  If davem and friends say I'm not
networky enough to use socket()+setsockopt() I guess I have to look at
netlink (which I'm still not convinced buys us anything vs the crazy
complexity) or go with syscalls.  I'm perfectly willing to admit this is
a /dev+ioctl type interface only implemented on socket+setsockopt().  If
that's a no go, someone say it now please.

> but I still recommend
> to rethink your decision to move away from netlink to be 100% sure that
> it will not solve your needs.

I don't see what's gained using netlink.  I am already reusing the
fsnotify code to do all my queuing.  Someone help me understand the
benefit of netlink and help me understand how we can reasonably meet the
needs and I'll try to prototype it.

1) fd's must be opened in the recv process
2) reliability, if loss must know on the send side

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 15, 2009, 11:49 p.m. UTC | #22

On Tue, 15 Sep 2009, Eric Paris wrote:
> 
> I don't see what's gained using netlink.

I'm personally not a big believer in netlink. What's the point, really? If 
you are sending datagrams back-and-forth, go wild. But if it's more 
structured than that, netlink has no actual upsides as far as I can tell.

Same goes for sockets in this case, actually. What's the upside?

I'll throw out a couple of upsides of actual system calls, people can feel 
free to comment:

 - things like 'strace' _work_ and the traces make sense, and you 
   generally see what the app is trying to do from the traces (sure, it 
   takes some time for strace to learn new system calls, but even when it 
   only gives a system call number, it's never any worse than some 
   "made-up packet interface".

 - if you have a system call definition, it tends to be a much stricter 
   interface than "let's send some packets around with a network 
   interface".

 - No unnecessary infrastructure.

That said, maybe the netlink/socket people can argue for their 
standpoints.

(And btw, I still want to know what's so wonderful about fanotify that we 
would actually want yet-another-filesystem-notification-interface. So I'm 
not sayying that I'll take a system call interface. I just don't think 
that hiding interfaces behind some random packet interface is at all any 
better)

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 16, 2009, 1:26 a.m. UTC | #23

On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote:

> And btw, I still want to know what's so wonderful about fanotify that we 
> would actually want yet-another-filesystem-notification-interface. So I'm 
> not sayying that I'll take a system call interface.

The real thing that fanotify provides is an open fd with the event
rather than some arbitrary 'watch descriptor' that userspace must
somehow magically map back to data on disk.  This means that it could be
used to provide subtree notification, which inotify is completely
incapable of doing.  And it can be used to provide system wide
notification.  We all know who wants that.

It provides an extensible data format which allows growth impossible in
inotify.  I don't know if anyone remember the inotify patches which
wanted to overload the inotify cookie field for some other information,
but inotify information extension is not reasonable or backwards
compatible.

fanotify also allows userspace to make access decisions and cache those
in the kernel.  Useful for integrity checkers (anti-malware people) and
for hierarchical storage management people.

I've got private commitments for two very large anti malware companies,
both of which unprotect and hack syscall tables in their customer's
kernels, that they would like to move to an fanotify interface.  Both
Red Hat and Suse have expressed interest in these patches and have
contributed to the patch set.

The patch set is actually rather small (entire set of about 20 patches
is 1800 lines) as it builds on the fsnotify work already in 2.6.31 to
reuse code from inotify rather than reimplement the same things over and
over (like we previously had with inotify and dnotify)

Don't know what else to say.....

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 16, 2009, 7:52 a.m. UTC | #24

Eric Paris wrote:
> On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote:
> > And btw, I still want to know what's so wonderful about fanotify that we 
> > would actually want yet-another-filesystem-notification-interface. So I'm 
> > not sayying that I'll take a system call interface.
> 
> The real thing that fanotify provides is an open fd with the event
> rather than some arbitrary 'watch descriptor' that userspace must
> somehow magically map back to data on disk.  This means that it could be
> used to provide subtree notification, which inotify is completely
> incapable of doing.

That's a bit of a spurious claim.

- fanotify does not provide subtree notification in it's
  present form.  When it is extended to do that, why wouldn't
  inotify be as well?  That's an fsnotify feature, common to both.

- fanotify does not provide notification at all for some events that
  you get with inotify.  It is not a superset, so you can't use
  fanotify to provide a subtree-capable equivalent to inotify.  What
  a mess when you need the combination of both features!

- fanotify requires you call readlink(/proc/fd/N) for every event to
  get the path.  It's not a particularly efficient way to get it,
  especially when an apps wants to know if it's something in it's
  region of interest but doesn't care about the actual path.
  When an apps knows it needs the map back to to path, why make it
  slow to get it?  That "extensible data format" is being
  underutilised...

- fanotify's descriptor may be race-prone as a way to get the subtree
  used for access, because any of the parent directories could have
  moved and even been deleted before the app calls
  readlink(/proc/fd/N).  I don't know if a _reliable_ way to track
  changes in a subtree can be built on it.  Maybe it can but it
  appears this hasn't been analysed.  It depends on
  readlink(/proc/fd/N)'s behaviour when the dentry's have been
  changed, among other things.

- Does the descriptor cause umount to fail when user does "do some
  stuff in baz; umount baz", or does it serialise nicely?  That's one
  of inotify's nice features - it doesn't cause umounts to fail.

> And it can be used to provide system wide notification.  We all know
> who wants that.

People who want to break out of chroot/namespace jails using the
conveniently provided open file descriptor? :-)

Seriously, what does system-wide fanotify do when run from a
chroot/namespace/cgroup, and a file outside them is accessed?

If the event is delivered with file desciptor, that's a security hole.
If it's not delivered, that sounds like working subtree support?

I'd expect anti-malware to want to be run inside VMs quite often...

Note that there's no such thing as "the real system root" any more.

> It provides an extensible data format which allows growth impossible in
> inotify.  I don't know if anyone remember the inotify patches which
> wanted to overload the inotify cookie field for some other information,
> but inotify information extension is not reasonable or backwards
> compatible.

I agree with this (although that's what flags are for -- see clone).

I don't have a problem with the next interface being fanotify (despite
arguing a lot); I just want to see the next one being useful for the
things I would otherwise be proposing my own yet-another-interface
for.  So we don't need a fourth one soon after the third due to
easily foreseen limitations.

> I've got private commitments for two very large anti malware companies,
> both of which unprotect and hack syscall tables in their customer's
> kernels, that they would like to move to an fanotify interface.  Both
> Red Hat and Suse have expressed interest in these patches and have
> contributed to the patch set.
> 
> The patch set is actually rather small (entire set of about 20 patches
> is 1800 lines) as it builds on the fsnotify work already in 2.6.31 to
> reuse code from inotify rather than reimplement the same things over and
> over (like we previously had with inotify and dnotify)

I don't have any problem with either of these, and _fs_notify
generally seems like an improvement.  I don't have a problem with
fanotify either.  For what it does, it's ok.

> Don't know what else to say.....

Answer questions about use-cases that you're not interested in?  Why
block them?  What about Evigny's request for an event without an open
fd - because he needs the pid information (inotify doesn't provide)
but not the fd?

Sorry to be so harsh.  I'm really trying to make sure we don't repeat
the mistakes of dnotify and inotify, and end up with a third interface
which also is too restrictive (because it's good enough for your
anti-malware and HSM customers) so that a fourth interface will be
needed soon after.

I'd like to be able to use it from some applications to accelerate
userspace caching of things (faster Make, faster Samba) without
penalising all other applications touching unrelated parts of the
filesystem.  The attitude "you can live with 10% slowdown" worries me.
I'm sure that can be fixed with a bit of care.

If the intention is to maintain fanotify and inotify side-by-side for
different uses (because fanotify returns open descriptors and blocks
the accessing process until acked), that's ok with me.  It makes
sense.  But then it's messy that neither offers a superset of the
other regarding which files and events are tracked.

If it's right that inotify has no room for extensibility (I'm not sure
about this), than it appears we already made a mess with dnotify and
inotify, so it would be a shame to repeat the same mistakes again.
Let's get the next one right, even it takes a bit longer, ok?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 16, 2009, 9:48 a.m. UTC | #25

On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote:
> > > And btw, I still want to know what's so wonderful about fanotify that we 
> > > would actually want yet-another-filesystem-notification-interface. So I'm 
> > > not sayying that I'll take a system call interface.
> > 
> > The real thing that fanotify provides is an open fd with the event
> > rather than some arbitrary 'watch descriptor' that userspace must
> > somehow magically map back to data on disk.  This means that it could be
> > used to provide subtree notification, which inotify is completely
> > incapable of doing.
> 
> That's a bit of a spurious claim.

My claim that a watch descriptor plus pathname segment sucks is
spurious?  You've got to be kidding me.  You think that a number which
represents the pathname of an object at some point in the past is a
reasonable piece of information?  If a watch descriptor actually
provided any information about the object on which an event just
happened it would be useful.  Sadly, it doesn't.

> - fanotify does not provide subtree notification in it's
>   present form.  When it is extended to do that, why wouldn't
>   inotify be as well?  That's an fsnotify feature, common to both.

Because I don't believe inotify can be reasonably extended in this way.
It's already clear that an arbitrary watch descriptor which userspace
has to somehow know how to correctly map back to an object (impossible
task) is difficult to use and I personally don't see how watch
descriptor + long path name component is somehow better or even
reasonable.  Path names are such crap and passing a pathname to
userspace is really just telling userspace, something happened to
something that used to be at this location but is possibly long since
gone.  I don't believe that's a good interface or one we should be
allowing to be {ab,}used.

> - fanotify does not provide notification at all for some events that
>   you get with inotify.  It is not a superset, so you can't use
>   fanotify to provide a subtree-capable equivalent to inotify.  What
>   a mess when you need the combination of both features!

This is true, although where possible I plan to rectify this situation
on an ongoing basis.  Absolutely nothing preventing that from happening.

> - fanotify requires you call readlink(/proc/fd/N) for every event to
>   get the path.

And inotify requires magic.  Which one sounds better?

>   It's not a particularly efficient way to get it,

What is?

>   especially when an apps wants to know if it's something in it's
>   region of interest but doesn't care about the actual path.
>   When an apps knows it needs the map back to to path, why make it
>   slow to get it?  That "extensible data format" is being
>   underutilised...

You convince Al Viro that the vfs should give us a path name for an
arbitrary object that honestly might not have one and I'll consider
giving it to userspace in the event notification.  Probably should read
some of the AppArmour arguments before you do though.  You're asking for
something that's impossible and is at best incredibly race prone crap.
At worst is a total lie.

> - fanotify's descriptor may be race-prone as a way to get the subtree
>   used for access, because any of the parent directories could have
>   moved and even been deleted before the app calls
>   readlink(/proc/fd/N).  I don't know if a _reliable_ way to track
>   changes in a subtree can be built on it.  Maybe it can but it
>   appears this hasn't been analysed.  It depends on
>   readlink(/proc/fd/N)'s behaviour when the dentry's have been
>   changed, among other things.

Huh?  Your saying that by the time userspace deals with an event it
might have moved?  True.  So what?  Spurious and irrelevant.  You got an
event that happened to this object.  If subtree notification is added at
some later point you may get an event for an object that was in the
subtree and is now not in the subtree.  That sounds fine to me....

> - Does the descriptor cause umount to fail when user does "do some
>   stuff in baz; umount baz", or does it serialise nicely?  That's one
>   of inotify's nice features - it doesn't cause umounts to fail.

Ah yes, you didn't read the code.  Seeing as how that's an fsnotify
property and fanotify notify is built on fsnotify....

> Seriously, what does system-wide fanotify do when run from a
> chroot/namespace/cgroup, and a file outside them is accessed?

At the moment an fanotify global listener is system wide.  Truely system
wide.  A gentleman from suse is looking rectify the problem so that if
run inside a namespace it stays inside the namespace.  Note that this
particular little tidbit is not in the 8 patches I proposed.  At the
moment those just include the UI and basic notification.

> If the event is delivered with file desciptor, that's a security hole.
> If it's not delivered, that sounds like working subtree support?

What?

> I'd expect anti-malware to want to be run inside VMs quite often...

What?

> Answer questions about use-cases that you're not interested in?  Why
> block them?  What about Evigny's request for an event without an open
> fd - because he needs the pid information (inotify doesn't provide)
> but not the fd?

It's not blocked, it's on the list of things to look at.  It's a
relatively simple change to have the mark add code return a watch
descriptor and use that value instead of an fd.  I believe I already
said that.  It certainly doesn't seem like a merge blocker that some
other feature might come along one day.  There's nothing preventing it
from coming out.  "Here is a patch which implements TCP."  "No, don't
merge that, it doesn't implement UDP as well."  "WTF?"

> I'd like to be able to use it from some applications to accelerate
> userspace caching of things (faster Make, faster Samba) without
> penalising all other applications touching unrelated parts of the
> filesystem.  The attitude "you can live with 10% slowdown" worries me.
> I'm sure that can be fixed with a bit of care.

I've said that using ANYTHING which allows userspace to arbitrate access
decisions for an entries system is going to eat performance.  If you
can't handle that, don't build that damn userspace decision making
portion into your kernel!  Don't buy anti-malware snake oil.  

fanotify as a notification system isn't some performance monster here to
eat your babies.  fanotify as an access control system is, and I can't
really do anything about that, it's the nature of that part of the
horrible beast.  Lets not confuse things...

> If the intention is to maintain fanotify and inotify side-by-side for
> different uses (because fanotify returns open descriptors and blocks
> the accessing process until acked)

I know I've told you this before.  fanotify provides TWO separate
things.  Notification with a fd.  A method to provide access control and
blocking.  Lets not lead people to believe you can't you one without the
other.  They are very distinct.

> , that's ok with me.  It makes
> sense.  But then it's messy that neither offers a superset of the
> other regarding which files and events are tracked.

inotify isn't going away any time soon....

> If it's right that inotify has no room for extensibility (I'm not sure
> about this), than it appears we already made a mess with dnotify and
> inotify, so it would be a shame to repeat the same mistakes again.
> Let's get the next one right, even it takes a bit longer, ok?

Calling inotify a "mess" or "completely inextensible" would probably be
too harsh.  There are ways with inotify_init1() that we could provide a
better or more forward looking data format.  But you are still left with
inotify's fatal flaw: userspace is required to map a random number to an
object in the filesystem.  inotify starts with a reasonably impossible
premise and builds from there.

You show me anything about my proposal that limits any use case you have
in mind and I'll fix it.  I'll fix it today.  My proposal does what it
does very well and leaves the door open to build upon it for whatever
has been mentioned over the last 1.25 years I've been trying to get
comments.  I'm certainly not going to wait until fanotify can leap out
of my computer and make a pot of coffee before I ask for it to be merged
(and no, it doesn't have that forward looking extensibility, I'll add a
syscall flag for it right now, O_COFFEE)

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alan Cox Sept. 16, 2009, 10:41 a.m. UTC | #26

> - fanotify does not provide subtree notification in it's
>   present form.  When it is extended to do that, why wouldn't
>   inotify be as well?  That's an fsnotify feature, common to both.

Because inotify gives you no reliable access to the object monitored as
the name passed back is not an object reference and is racy. Inotify is
fine for making pretty icons pop up on desktops and making file
selectors update, but it is somewhat inadequate for indexers and
completely useless for stuff like HSM.

> - fanotify requires you call readlink(/proc/fd/N) for every event to
>   get the path.  It's not a particularly efficient way to get it,

IFF you want the path, but the path isn't usually the most valuable bit.
Plus you'll find the readlink is extremely quick anyway.

> People who want to break out of chroot/namespace jails using the
> conveniently provided open file descriptor? :-)

chroot isn't a security model. You can already do this with AF_UNIX
sockets (and there are apps that intentionally use fchdir that way)

> I'd expect anti-malware to want to be run inside VMs quite often...

Inside of containers - unlikely. Inside of guests sure but thats not
going to relevant to fanotify()

> the accessing process until acked), that's ok with me.  It makes
> sense.  But then it's messy that neither offers a superset of the
> other regarding which files and events are tracked.

Agreed.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann Sept. 16, 2009, 11:30 a.m. UTC | #27

On Tuesday 15 September 2009, Eric Paris wrote:
> fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch
> fanotify_modify_mark_fd() --- same but with an fd instead of a path

I think these two can be merged into one without adding complexity,
in the same way that sys_utimensat can take a file descriptor or
a path or both.

> fanotify_response() --- userspace tells the kernel what to do if requested/allowed
>    (could probably be done using write() to the fanotify fd)
> fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed....

Please don't use an ioctl here. While ioctl is fine for character devices
and sort of fine for sockets, I think it would be very bad style to use
it on file descriptors that you get back from specialized syscalls like
fanotify_init. Do one or the other, but do it consistently.

Why is it not strongly typed anyway? Something like

int fanotify_ignore_sb(int fanotify_fd, unsigned int flag,
                       long f_type, fsid_t f_fsid);

would be type safe, although I think it would be better to only handle
one of the two cases. Can you think of a case that you can't handle if
you have to decide between them and only do one interface (f_type or fsid)?

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 16, 2009, 11:41 a.m. UTC | #28

Alan Cox wrote:
> > - fanotify does not provide subtree notification in it's
> >   present form.  When it is extended to do that, why wouldn't
> >   inotify be as well?  That's an fsnotify feature, common to both.
> 
> Because inotify gives you no reliable access to the object monitored as
> the name passed back is not an object reference and is racy. Inotify is
> fine for making pretty icons pop up on desktops and making file
> selectors update, but it is somewhat inadequate for indexers and
> completely useless for stuff like HSM.

That was my point.  (Why do people keep not getting it?)

You can't rely on the name being non-racy, but you _can_ reliably
invalidate application-level caches from the sequence of events
including file writes, creates, renames, links, unlinks, mounts.  And
revalidate such caches by the absence of pending events.

(There is one obscure case which inotify is missing, though, which
means it cannot detect file changes in certain cases with hard links.
I intend to fix that one.)

For that, an inode isn't useful, a descriptor isn't useful, a
directory descriptor/inode and pathname isn't useful, and file write
events by themselves aren't useful.  None of them quite do it by
themselves.

But with the correct combination of events, you can maintain very
efficient application-level caching of file data / directory listing
and lookups / stat results you have previously read from the
filesystem.  That's because the information you have previously
depended upon, including path lookups, are all notified as one sort of
inotify event or another when changed.

Which doesn't sound all that special until you realise you can very
quickly revalidate application-caches of any data structure calculated
from reading things from the filesystem, no matter how many
prerequisites or how complex the data structures, in a single system
call.  Amortised over many revalidations if you have them in parallel.

That can apply to things like git, make, ccache, samba, rsync, httpd
path walks, and virtually any "web templating" framework.  Of course
it takes userspace support as well, but that's where I'm coming from
regarding "acceleration" and the essential kernel infrastructure.

Clearly, I'm going to have to explain with working code :-)

> but it is somewhat inadequate for indexers

For indexers, the real inadequacy is the need to attach inotify
watches to every directory at system startup, and to stat() everything
to check it hasn't changed since the indexer was last running.  Both
are very slow on a large directory tree.  The former can be dealt with
using subtree watches (yes, even with hard links - I have proposed an
algorithm for this but I think nobody understood it ;-).  The latter
needs filesystem support for a persistent change attribute.

> > - fanotify requires you call readlink(/proc/fd/N) for every event to
> >   get the path.  It's not a particularly efficient way to get it,
> 
> IFF you want the path, but the path isn't usually the most valuable bit.
> Plus you'll find the readlink is extremely quick anyway.

I agree, you don't usually want the whole path.

So what was the point about fanotify making subtree tracking possible
with it's file descriptor, if not by readlink(/proc/fd/N)?
Descriptors don't tell you which subtree a file is in any better than
inotify watches.  I.e. they do, if you track them and their containing
directories all individually.

> > People who want to break out of chroot/namespace jails using the
> > conveniently provided open file descriptor? :-)
> 
> chroot isn't a security model. You can already do this with AF_UNIX
> sockets (and there are apps that intentionally use fchdir that way)

Ah, no.  AF_UNIX works with explicit sender cooperation.

fanotify gives you access to files without sender cooperation, as it
intercepts every open().

> > I'd expect anti-malware to want to be run inside VMs quite often...
> 
> Inside of containers - unlikely.

Why not?  Some people run entire distributions in containiners, and
present them as VMs to the world for other people to admin.

> > the accessing process until acked), that's ok with me.  It makes
> > sense.  But then it's messy that neither offers a superset of the
> > other regarding which files and events are tracked.
> 
> Agreed.

In the end this is my main gripe.

-- Jamie

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alan Cox Sept. 16, 2009, 12:01 p.m. UTC | #29

> You can't rely on the name being non-racy, but you _can_ reliably
> invalidate application-level caches from the sequence of events
> including file writes, creates, renames, links, unlinks, mounts.  And
> revalidate such caches by the absence of pending events.

You can't however create the caches reliably because you've no idea if
you are referencing the right object in the first place - which is why
you want a handle in these cases. I see fanotify as a handle producing
addition to inotify, not as a replacement (plus some other bits around
open blocking for HSM etc)

> Clearly, I'm going to have to explain with working code :-)

Always a good demo

> > but it is somewhat inadequate for indexers
> 
> For indexers, the real inadequacy is the need to attach inotify
> watches to every directory at system startup, and to stat() everything
> to check it hasn't changed since the indexer was last running.  Both

stat doesn't help you - inode numbers are only guaranteed unqiue (and
constant) while a reference to the object is held.

> Descriptors don't tell you which subtree a file is in any better than
> inotify watches.  I.e. they do, if you track them and their containing
> directories all individually.

Don't get me wrong - I don't think fanotify is sufficient on its own -
and this is one reason. Some things care about the namespace, some about
getting the exact content.

> > chroot isn't a security model. You can already do this with AF_UNIX
> > sockets (and there are apps that intentionally use fchdir that way)
> 
> Ah, no.  AF_UNIX works with explicit sender cooperation.
> 
> fanotify gives you access to files without sender cooperation, as it
> intercepts every open().

and is currently not general user accessible for this reason.

> > Inside of containers - unlikely.
> 
> Why not?  Some people run entire distributions in containiners, and
> present them as VMs to the world for other people to admin.

In a word - performance. In two words performance and security. It isn't
a sensible setup because you want to scan the most efficient way possible
and you want to keep your malware scan as far away from attackers, so it
makes sense to keep it outside of the containers and do the job once
- one scan, one database of things to keep current.

Alan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Evgeniy Polyakov Sept. 16, 2009, 12:05 p.m. UTC | #30

On Tue, Sep 15, 2009 at 05:54:59PM -0400, Eric Paris (eparis@redhat.com) wrote:
> Nothing's impossible, but is netlink a square peg for this round hole?
> One of the great benefits of netlink, the attribute matching and
> filtering, although possibly useful isn't some panacea as we have to do
> that well before netlink to have anything like decent performance.
> Imagine every single fs event creating an skb and sending it with
> netlink only to have most of them dropped.

There is no problem with performance even with single IO per skb.
Consider usual send/recv calls which may end up with the same skb per
syscall - most of the overhead comes from data copy or syscall machinery
(for small writes) and not from allocation path.

I have a 3.5 years old performance graph at
http://www.ioremap.net/gallery/netlink_perf.png

which shows 400 MB/s of bandwidth for 4k writes, I'm pretty sure it is
limited by copy performance only.

> The only other benefit to netlink that I know of is the reasonable,
> easy, and clean addition of information later in time with backwards
> compatibility as needed.  That's really cool, I admit, but with the
> limited amount of additional info that users have wanted out of inotify
> I think my data type extensibility should be enough.

I want alot from inotify which I'm afraid will not be easy with fanotify
either, but its existing model just does not allow its extension. I
would not be 100% sure that there will be no additional needs in a year
or so for fanotify.

> >  Moreover you can implement a pool of working threads and
> > postpone all the work to them and appropriate event queues, which will
> > allow to use rlimits for the listeners and open files 'kind of' on
> > behalf of those processes.
> 
> I'm sorry, I don't userstand.  I don't see how worker threads help
> anything here.  Can you explain what you are thinking?

I meant that it could be possible to postpone all the work of queueing,
event allocation, fd opening and population all be done on behalf of
some other threads in the system and only original process credentials
would be checked to satisfy various limits. In this case there will be
no questions in which context given fd was created and it is possible to
use async netlink nature.

I do not force you to do this of course, but there is already quite huge
infrastructure for similar tasks and it could be worth to
change/reconsider things to use existing models and not invent own.
Of course this is a matter of overall benefit.

> > But it is quite diferent from the approach you selected and which is
> > more obvious indeed. So if you ask a question whether fanotify should
> > use sockets or syscalls, I would prefer sockets
> 
> I've heard someone else off list say this as well.  I'm not certain why.
> I actually spent the day yesterday and have fanotify working over 5 new
> syscalls (good thing I wrote the code with separate back and and front
> ends for just this purpose)  And I really don't hate it.  I think 3
> might be enough.
> 
> fanotify_init() ---- very much like inotify_init
> fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch
> fanotify_modify_mark_fd() --- same but with an fd instead of a path

Those two can be combined I think.

> fanotify_response() --- userspace tells the kernel what to do if requested/allowed
>    (could probably be done using write() to the fanotify fd)
> fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed....

It all sounds good and simple, but what if you will need modify command
with new arguments? Instead of adding new typed option you will need to
add another syscall. I already did that for inotify but via ioctl and
pretty sure there will be such need for much wider fanotify some time in
the future.

> I don't see what's gained using netlink.  I am already reusing the
> fsnotify code to do all my queuing.  Someone help me understand the
> benefit of netlink and help me understand how we can reasonably meet the
> needs and I'll try to prototype it.
> 
> 1) fd's must be opened in the recv process

Or just injected into registered process' fd table with appropriate
limit checks? In this case it can be done on behalf of whatever other
worker.

> 2) reliability, if loss must know on the send side

You have this knowledge at netlink sending time, but there is no way to
wait until 'fail' condition is removed like when you can block writing
into socket waiting for buffer space to become large enough.

And there is no way to tell how many listeners got message and how many
was dropped in multicast deliver except that there were drops.
This can be trivially extended though.

Jamie Lokier Sept. 16, 2009, 12:17 p.m. UTC | #31

Eric Paris wrote:
> On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote:
> > Eric Paris wrote:
> > > On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote:
> > > rather than some arbitrary 'watch descriptor' that userspace must
> > > somehow magically map back to data on disk.  This means that it could be
> > > used to provide subtree notification, which inotify is completely
> > > incapable of doing.
> > 
> > That's a bit of a spurious claim.
> 
> My claim that a watch descriptor plus pathname segment sucks is
> spurious?  You've got to be kidding me.  You think that a number which
> represents the pathname of an object at some point in the past is a
> reasonable piece of information?  If a watch descriptor actually
> provided any information about the object on which an event just
> happened it would be useful.  Sadly, it doesn't.

It's subtler than that.  Too subtle, apparently.

For directory changes, the event represents the path in that
directory, which is useful, yes, even though the path may not be valid
by the time you receive the message.

It's useful because it tells userspace to invalidate any cached
information about that path, and because the _lack_ of such a message
after reading the inotify descriptor tells you that the path was valid
before the point where you started the read (if not memory-coherent,
then good enough for some applications).

I know, I know, it's sounding a bit obscure and questionable.

This is going to need explaining with real userspace code, so I'll
hack on that (when I have time - next few days) and get back to you.

I see now that my arguments aren't helping without clear working code,
mainly because you believe I'm talking poo and haven't a clue what I'm
talking about.

> It's already clear that an arbitrary watch descriptor which userspace
> has to somehow know how to correctly map back to an object (impossible
> task) is difficult to use and I personally don't see how watch
> descriptor + long path name component is somehow better or even
> reasonable.  Path names are such crap and passing a pathname to
> userspace is really just telling userspace, something happened to
> something that used to be at this location but is possibly long since
> gone.  I don't believe that's a good interface or one we should be
> allowing to be {ab,}used.

It's subtler than that.  Application caches depend on lookups as well
as inode operations.  A simple stat("/foo/bar") does.  Those path
components in events are the only way to invalidate/revalidate data
dependent on lookups.  Inodes (fstat->st_ino from the descriptor
returned by fanotify) are insufficient and cannot be used to
revalidate path lookups or data dependent on them.

Yes, it works even though the paths in events are not valid by the
time you get them.  In fact it _depends_ on getting those paths which
aren't valid by the time you get them.

> >   especially when an apps wants to know if it's something in it's
> >   region of interest but doesn't care about the actual path.
> >   When an apps knows it needs the map back to to path, why make it
> >   slow to get it?  That "extensible data format" is being
> >   underutilised...
> 
> You convince Al Viro that the vfs should give us a path name for an
> arbitrary object that honestly might not have one and I'll consider
> giving it to userspace in the event notification.  Probably should read
> some of the AppArmour arguments before you do though.  You're asking for
> something that's impossible and is at best incredibly race prone crap.
> At worst is a total lie.

No, I'm asking for something that clearly is not being understood here

I'm well aware of AppArmour and it's races etc.; it doesn't apply.

> > Seriously, what does system-wide fanotify do when run from a
> > chroot/namespace/cgroup, and a file outside them is accessed?
> 
> At the moment an fanotify global listener is system wide.  Truely system
> wide.  A gentleman from suse is looking rectify the problem so that if
> run inside a namespace it stays inside the namespace.  Note that this
> particular little tidbit is not in the 8 patches I proposed.  At the
> moment those just include the UI and basic notification.

I'll be really interested in the gentleman's solution.

In general bind mounts complicate cache maintenance with inotify
rather a lot.  That's another corner case to tidy up.  Bind mounts and
namespaces have a lot in common.

> Because I don't believe inotify can be reasonably extended in this way.

I do, so give me a few days to code something and explain / settle it.

I don't think I can code it using fanotify - the descriptor doesn't
provide enough context.  I think that's why we see them so differently
- the different ways to use the events means that sometimes the
inotify info does not work, and sometimes the fanotify descriptor does
not work, because each provides some critical information (subtly)
that the other does not.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 16, 2009, 12:27 p.m. UTC | #32

Evgeniy Polyakov wrote:
> It all sounds good and simple, but what if you will need modify command
> with new arguments? Instead of adding new typed option you will need to
> add another syscall. I already did that for inotify but via ioctl and
> pretty sure there will be such need for much wider fanotify some time in
> the future.

Ew, that's unpleasant (adding to inotify via ioctl).

You're right, it's not easy to add new syscalls, but it's not that
hard either.

I'd forgotten about Linus' strace argument.  That's a good one.

Of course everything should be a syscall by that argument :-)

And strace can trace some ioctls and setsockopts.  (But it's never
pretty to see isatty() showing in strace as SNDCTL_TMR_TIMEBASE :-)

Strace never shows structure of reads and writes to devices, so
although efficient (you can batch), it's not nice for tracing.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 16, 2009, 12:56 p.m. UTC | #33

Alan Cox wrote:
> > You can't rely on the name being non-racy, but you _can_ reliably
> > invalidate application-level caches from the sequence of events
> > including file writes, creates, renames, links, unlinks, mounts.  And
> > revalidate such caches by the absence of pending events.
> 
> You can't however create the caches reliably because you've no idea if
> you are referencing the right object in the first place - which is why
> you want a handle in these cases. I see fanotify as a handle producing
> addition to inotify, not as a replacement (plus some other bits around
> open blocking for HSM etc)

There are two sets of events getting mixed up here.  Inode events -
reads, writes, truncates, chmods; and directory events - renames,
links, creates, unlinks.

Inode events alone _not enough_ to maintain caches, and here's why.

With a file descriptor for an _inode_ event, that's fine.  If you have
{ int fd1 = open("/foo/bar"), fd2 = open("/foo/baz"); } early in your
program, and later cached_file_read(fd1) and cached_file_read(fd2),
you have to recognise the inode number and invalidate both.

You have to call fstat() on the event's descriptor and then look up a
device+inode number in your own table.  (The inotify way doesn't need
the fstat() but is otherwise the same).

That's fine for files you're keeping open and only want to know if the
content changes _of an open file_.

But that's not so useful.

More often, you want to validate cached_file_read("/foo/bar").  That
is, validate what you'd get if you opened that path _now_ and read it.
Same for cached_stat("/foo/bar") to cache permissions, and other
things like that.

That needs to validate the path lookup _and_ the inode state.

For that, we need directory events, and they must include the name in
the directory that's affected.  If you receive a directory event
involving name "bar" in directory (identified by inode) "/foo", you
invalidate cached_file_read("/foo/bar") and cached_stat("/foo/bar").

Oh, but wait, how do we know the inode for the directory in our event
still refers to "/foo"?  Answer: We're also watching it's parent
directory "/".  Assuming no reordering of certain events, that's ok.

That way, by watching "/", "/foo" and "/foo/bar", when you receive no
events you validate the results of cached_file_read("/foo/bar") and
cached_stat("/foo/bar").  A lot to set up, but fast to check.  Worth
it if you're checking a lot of things that rarely change.

If you receive inode events while watching the parent directory of the
path used to access the inode, then you can avoid watching "/foo/bar",
and just watch the path of parent directories.  That saves an order of
magnitude of watches typically.  fanotify offers something similar,
and in this case the event is probably more useful than inotify's.

(The above is even hard-link-safe, if you do it right.  I won't
complicate the explanation with details).

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 16, 2009, 3:53 p.m. UTC | #34

On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> Alan Cox wrote:
> > > You can't rely on the name being non-racy, but you _can_ reliably
> > > invalidate application-level caches from the sequence of events
> > > including file writes, creates, renames, links, unlinks, mounts.  And
> > > revalidate such caches by the absence of pending events.
> > 
> > You can't however create the caches reliably because you've no idea if
> > you are referencing the right object in the first place - which is why
> > you want a handle in these cases. I see fanotify as a handle producing
> > addition to inotify, not as a replacement (plus some other bits around
> > open blocking for HSM etc)
> 
> There are two sets of events getting mixed up here.  Inode events -
> reads, writes, truncates, chmods; and directory events - renames,
> links, creates, unlinks.

My understanding of you argument is that fanotify does not yet provide
all inotify events, namely those of directories operations and thus is
not suitable to wholesale replace everything inotify can do.  I've
already said that working towards that goal is something I plan to
pursue, but for now, you still have inotify.

The mlocate/updatedb people ask me about fanotify and it's on the todo
list to allow global reception of of such events.  The fd you get would
be of the dir where the event happened.  They didn't care, and I haven't
decided if we would provide the path component like inotify does.  Most
users are perfectly happy to stat everything in the dir.

It's hopefully feasible, but it's going to take some fsnotify hook
movements and possibly so arguments with Al to get the information I
want where I want it.  But there is nothing about the interface that
precludes it and it has been discussed and considered.

Am I still missing it?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 16, 2009, 9:49 p.m. UTC | #35

Eric Paris wrote:
> On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> > Alan Cox wrote:
> > > > You can't rely on the name being non-racy, but you _can_ reliably
> > > > invalidate application-level caches from the sequence of events
> > > > including file writes, creates, renames, links, unlinks, mounts.  And
> > > > revalidate such caches by the absence of pending events.
> > > 
> > > You can't however create the caches reliably because you've no idea if
> > > you are referencing the right object in the first place - which is why
> > > you want a handle in these cases. I see fanotify as a handle producing
> > > addition to inotify, not as a replacement (plus some other bits around
> > > open blocking for HSM etc)
> > 
> > There are two sets of events getting mixed up here.  Inode events -
> > reads, writes, truncates, chmods; and directory events - renames,
> > links, creates, unlinks.
> 
> My understanding of you argument is that fanotify does not yet provide
> all inotify events, namely those of directories operations and thus is
> not suitable to wholesale replace everything inotify can do.

Largely that, plus seeing fanotify look like it'll acquire some
capabilities that would be useful with those inotify events and
inotify not getting them.  Bothered by the apparent direction of
development, really.

Btw, I'm not sure you can use inotify+fanotify together simultaneously
in this way, which may be of benefit - caching might help the
anti-malware-style access controls.  I'll have to think carefully
about ordering of some events, and using fanotify and inotify
independently at the same time loses that ordering.

> I've already said that working towards that goal is something I plan
> to pursue,

Sorry, I missed that, just as I didn't find a reply to Evigny's "I
need pids".  And from another mail, I thought you were stopping at the
things with file descriptors.

> but for now, you still have inotify.

That's right.  And it sucks for subtrees, so that's why I'd like to
absorb improvements on subtree inclusions, and exclusion nodes look
useful too.

> The mlocate/updatedb people ask me about fanotify and it's on the todo
> list to allow global reception of of such events.  The fd you get would
> be of the dir where the event happened.  They didn't care, and I haven't
> decided if we would provide the path component like inotify does.  Most
> users are perfectly happy to stat everything in the dir.

mlocate/updatedb is relatively low performance and of course wants to
be system-wide.  It's not looking so good if a user wants an indexer
just on their /home, and the administrator does not want everyone else
to pay the cost.

But I think we're quite agreed on how useful subtrees would be.
System-wide events won't be needed if we can monitor the / subtree
to get the same effect, and that'll also sort out namespaces and chroots.

Stat'ing every entry in a dir event.  Thinking out loud:

   1. Stat'ing everything in a dir just to find out which 1 file was
      deleted can be quite expensive for some uses (if you have a
      large dir and it happens a lot), and is unpleasant just because
      each change _should_ result in about O(1) work.  Taste, style,
      elegance ;-)

      For POSIX filesystems, I don't see any logical problem with
      this, actually.  You don't need to call stat()!  It's enough to
      call readdir() and look at d_ino to track
      renames/links/creates/unlinks - assuming only directory-change
      events are relevant here.

      Just an unpleasant O(n) scaling with directory size.

      (Note that I ignore mount points not returning the correct
      d_ino, because apps can track the mount list and
      compensate; they should be doing this anyway).

   2. updatedb-style indexing apps don't care about the
      readdir/stat-all-entries cost, because they don't need to read
      the directory after every change, they only need to do it once
      every 24 hours if any events were received in that interval!

      (Obviously this isn't the same for pseudo-real-time indexers.)

      For Samba-style caching, on the other hand, the cost of
      rescanning a large directory when one file is being read often
      and another file in it is changing often might be prohibitive,
      forcing it to use heuristics to decide when to monitor a
      directory and when not to to cache it, depending on directory
      size.  I'd rather avoid that.

   3. Non-POSIX filesystems don't always have stable inode numbers.
      You can't tell that foo was renamed to bar by reading the
      directory and looking at d_ino, or by calling stat on each entry.

      You can assume stable inode numbers for inodes where there's an
      open file descriptor; that *might* be just enough to squeeze
      through the logic of a cache.  I'm not sure right now.

   4. You can't tell when file contents are changed from stat info.

      That means you have to receive an inode event, not a directory
      event for data changes, but that's not a problem of course - the
      name-used-for-access isn't useful for data changes anyway
      (except for debugging perhaps).

   5. stat() doesn't tell you about xattr and ACL changes.  xattrs can
      be large and slow to read on a whole directory.  But as point 4,
      if attribute changes count as inode changes, there's no problem.

   6. Calling stat() pulls a lot into cache that doesn't need to be in
      cache: all those inodes.  But as I mentioned in points 1, 4 and
      5, provided only directory name operations pass the directory to
      be scanned, and inode operations always pass the inode, you can
      use readdir() and avoid stat(), so the inodes don't have to be
      pulled into cache after all.

      Except for non-POSIX inode instability.  Would be good to work
      out if that breaks the algorithm.

In summary, calling readdir() and maybe stat/getxattr on each entry in
a directory might be workable, but I'd rather it was avoidable.
Simple apps may prefer to do it anyway - and let multiple events in a
directory be merged as a result.

While I'm here it would be nice to receive one event instead of two
for operations which involve two paths: link, rename and bind mount.
Having to pair up two events from inotify isn't helpful in any way.

Imho an API that satisfies everything we've talking about would let
you specify which fields you want to receive in the event when you
bind a listener.  Not _everything_ is selectable of course, but
whether you want:

    For inode events (data read/write, attribute/ACL/xattr changes):

    - Open file descriptor of the affected file [Optional].
    - The inode number and device number (always?).
    - A way to identify the vfsmount (due to bind mounts making the
      device number insufficient to identify a directory; always?).

    For directory events (create/unlink/link/rename/reflink/mkdir/rmdir
    /mount/umount):

    - Same as inode above, for the object created/linked/deleted.

    - Same as inode above, for the directory containing the source name.
    - Source name [Optional].
    - Same as above, for the directory containing the target name
    - Target name [Optional]

    Source and target are the two names for
    rename/link/reflink/bind-mount operations.  Otherwise there is
    only one name to include.

Ironically, it begins to look a bit like netlink ;-)

As you can see, I've made the open descriptors optional, and the names
for directory events optional.  For directory events, the object
descriptor option should be independent from the source/target
directory descriptor option.

Add one more option: wait for Ack before file accessing process can
proceed, or don't require Ack.  That basically distinguishes inotify
behaviour from fsnotify behaviour.

It's not obvious, but that option's useful for directory events too,
if you think about it: Think like an anti-malware or other access
control manager, and ask: what if I have to block something which
depends on the layout of files?  Just as directory events are enough
for caching, they are enough for complete access control of
layout-dependent state too.  For example, some line of text is no
problem in a random file, but might be forbidden by the access manager
from appearing in .bash_login, including by "mv harmless .bash_login".

The above is not a final proposal, but I'd be delighted if you'd take
a look at whether it's suitable.  I realise some things may not work
out for implementation reasons.

Meanwhile, I'll take a look at userspace code for my caching algorithm
and see how well that works out.  I think we'll get subtree monitors
out of this before the month is over...

> It's hopefully feasible, but it's going to take some fsnotify hook
> movements and possibly so arguments with Al to get the information I
> want where I want it.

That may, indeed, be a sticking point :-)

> But there is nothing about the interface that
> precludes it and it has been discussed and considered.
> 
> Am I still missing it?

No I think we're on the same wavelength now.  Thanks for being
patient.  (And thanks, Alan, for stepping in and making me describe
what I had in mind better).

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 16, 2009, 10:33 p.m. UTC | #36

On Wed, 2009-09-16 at 22:49 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> > > Alan Cox wrote:
> > > > > You can't rely on the name being non-racy, but you _can_ reliably
> > > > > invalidate application-level caches from the sequence of events
> > > > > including file writes, creates, renames, links, unlinks, mounts.  And
> > > > > revalidate such caches by the absence of pending events.
> > > > 
> > > > You can't however create the caches reliably because you've no idea if
> > > > you are referencing the right object in the first place - which is why
> > > > you want a handle in these cases. I see fanotify as a handle producing
> > > > addition to inotify, not as a replacement (plus some other bits around
> > > > open blocking for HSM etc)
> > > 
> > > There are two sets of events getting mixed up here.  Inode events -
> > > reads, writes, truncates, chmods; and directory events - renames,
> > > links, creates, unlinks.
> > 
> > My understanding of you argument is that fanotify does not yet provide
> > all inotify events, namely those of directories operations and thus is
> > not suitable to wholesale replace everything inotify can do.
> 
> Largely that, plus seeing fanotify look like it'll acquire some
> capabilities that would be useful with those inotify events and
> inotify not getting them.  Bothered by the apparent direction of
> development, really.
> 
> Btw, I'm not sure you can use inotify+fanotify together simultaneously
> in this way, which may be of benefit - caching might help the
> anti-malware-style access controls.  I'll have to think carefully
> about ordering of some events, and using fanotify and inotify
> independently at the same time loses that ordering.

As soon as you say "ordering" you lose   :)

> 
> > I've already said that working towards that goal is something I plan
> > to pursue,
> 
> Sorry, I missed that, just as I didn't find a reply to Evigny's "I
> need pids".  And from another mail, I thought you were stopping at the
> things with file descriptors.
> 
> > but for now, you still have inotify.
> 
> That's right.  And it sucks for subtrees, so that's why I'd like to
> absorb improvements on subtree inclusions, and exclusion nodes look
> useful too.
> 
> > The mlocate/updatedb people ask me about fanotify and it's on the todo
> > list to allow global reception of of such events.  The fd you get would
> > be of the dir where the event happened.  They didn't care, and I haven't
> > decided if we would provide the path component like inotify does.  Most
> > users are perfectly happy to stat everything in the dir.
> 
> mlocate/updatedb is relatively low performance and of course wants to
> be system-wide.  It's not looking so good if a user wants an indexer
> just on their /home, and the administrator does not want everyone else
> to pay the cost.
> 
> But I think we're quite agreed on how useful subtrees would be.
> System-wide events won't be needed if we can monitor the / subtree
> to get the same effect, and that'll also sort out namespaces and chroots.
> 
> Stat'ing every entry in a dir event.  Thinking out loud:
> 
>    1. Stat'ing everything in a dir just to find out which 1 file was
>       deleted can be quite expensive for some uses (if you have a
>       large dir and it happens a lot), and is unpleasant just because
>       each change _should_ result in about O(1) work.  Taste, style,
>       elegance ;-)
>
>       For POSIX filesystems, I don't see any logical problem with
>       this, actually.  You don't need to call stat()!  It's enough to
>       call readdir() and look at d_ino to track
>       renames/links/creates/unlinks - assuming only directory-change
>       events are relevant here.
> 
>       Just an unpleasant O(n) scaling with directory size.
> 
>       (Note that I ignore mount points not returning the correct
>       d_ino, because apps can track the mount list and
>       compensate; they should be doing this anyway).

I think we generally agree here.

> 
>    2. updatedb-style indexing apps don't care about the
>       readdir/stat-all-entries cost, because they don't need to read
>       the directory after every change, they only need to do it once
>       every 24 hours if any events were received in that interval!
> 
>       (Obviously this isn't the same for pseudo-real-time indexers.)

See, that's why the mlocate/updatedb contacted me.  They don't want it
to be a 24 hour thing.  They was pseudo real time without thrashing the
hd.  Turns out, the interface is there, and the backend just needs
struct path information available to give it to them.

>       For Samba-style caching, on the other hand, the cost of
>       rescanning a large directory when one file is being read often
>       and another file in it is changing often might be prohibitive,
>       forcing it to use heuristics to decide when to monitor a
>       directory and when not to to cache it, depending on directory
>       size.  I'd rather avoid that.

This seems like you are confusing the events that happen to a directory
(mv/rename) and the events that happen to a file (read/write.)   (The
former I do not have code for, the latter I do) Which is surprising
since you so clearly delineate them later, so maybe I'm
misunderstanding.

>    3. Non-POSIX filesystems don't always have stable inode numbers.
>       You can't tell that foo was renamed to bar by reading the
>       directory and looking at d_ino, or by calling stat on each entry.
> 
>       You can assume stable inode numbers for inodes where there's an
>       open file descriptor; that *might* be just enough to squeeze
>       through the logic of a cache.  I'm not sure right now.
> 
>    4. You can't tell when file contents are changed from stat info.
> 
>       That means you have to receive an inode event, not a directory
>       event for data changes, but that's not a problem of course - the
>       name-used-for-access isn't useful for data changes anyway
>       (except for debugging perhaps).
> 
>    5. stat() doesn't tell you about xattr and ACL changes.  xattrs can
>       be large and slow to read on a whole directory.  But as point 4,
>       if attribute changes count as inode changes, there's no problem.
> 
>    6. Calling stat() pulls a lot into cache that doesn't need to be in
>       cache: all those inodes.  But as I mentioned in points 1, 4 and
>       5, provided only directory name operations pass the directory to
>       be scanned, and inode operations always pass the inode, you can
>       use readdir() and avoid stat(), so the inodes don't have to be
>       pulled into cache after all.
> 
>       Except for non-POSIX inode instability.  Would be good to work
>       out if that breaks the algorithm.
> 
> In summary, calling readdir() and maybe stat/getxattr on each entry in
> a directory might be workable, but I'd rather it was avoidable.
> Simple apps may prefer to do it anyway - and let multiple events in a
> directory be merged as a result.
> 
> While I'm here it would be nice to receive one event instead of two
> for operations which involve two paths: link, rename and bind mount.
> Having to pair up two events from inotify isn't helpful in any way.

What do you propose the format of the event should be.  Is this
precluded in what's been proposed?

> Imho an API that satisfies everything we've talking about would let
> you specify which fields you want to receive in the event when you
> bind a listener.  Not _everything_ is selectable of course, but
> whether you want:
> 
>     For inode events (data read/write, attribute/ACL/xattr changes):
> 
>     - Open file descriptor of the affected file [Optional].

Could be added and I've agreed to take a look.  I'm just not sure
bringing back the major flaw of inotify is really moving us forward.

>     - The inode number and device number (always?).

hmmm, if you have the fd you have both, if you choose to get a wd like
inotify, I say you're still own your own to do the magic map.  I don't
want to copy tons of almost always useless data into userspace.

>     - A way to identify the vfsmount (due to bind mounts making the
>       device number insufficient to identify a directory; always?).

Now you want reliable path names?  I need a vfsmount in kernel to open
the fd for userspace, but I don't see how that's translatable to
anything even remotely useable by userspace....

>     For directory events (create/unlink/link/rename/reflink/mkdir/rmdir
>     /mount/umount):
> 
>     - Same as inode above, for the object created/linked/deleted.
> 
>     - Same as inode above, for the directory containing the source name.
>     - Source name [Optional].
>     - Same as above, for the directory containing the target name
>     - Target name [Optional]
> 
>     Source and target are the two names for
>     rename/link/reflink/bind-mount operations.  Otherwise there is
>     only one name to include.
> 
> Ironically, it begins to look a bit like netlink ;-)
> 
> As you can see, I've made the open descriptors optional, and the names
> for directory events optional.  For directory events, the object
> descriptor option should be independent from the source/target
> directory descriptor option.
> 
> Add one more option: wait for Ack before file accessing process can
> proceed, or don't require Ack.  That basically distinguishes inotify
> behaviour from fsnotify behaviour.

Those 2 things are completely independent.  If you request read with
blocking you get read with blocking.  If you request just read, you get
just read.

> It's not obvious, but that option's useful for directory events too,
> if you think about it: Think like an anti-malware or other access
> control manager, and ask: what if I have to block something which
> depends on the layout of files?  Just as directory events are enough
> for caching, they are enough for complete access control of
> layout-dependent state too.  For example, some line of text is no
> problem in a random file, but might be forbidden by the access manager
> from appearing in .bash_login, including by "mv harmless .bash_login".

I'd say they can realize the bad data when .bash_login is next opened
and deny access then  :)

No, it's honestly a good idea, and one that is going to likely take
serious hook movement.  A lot of these things can go on the todo list,
but don't sound like show stoppers to me....

> > It's hopefully feasible, but it's going to take some fsnotify hook
> > movements and possibly so arguments with Al to get the information I
> > want where I want it.
> 
> That may, indeed, be a sticking point :-)

My silent comrade from suse is looking at moving some hooks as we speak
so hopefully directory events can get added shortly after a merge.  I
don't think we should wait until every conceived (but not necessarily
needed) possibility is coded.  We have things that work, meet needs, and
hopefully you'll agree leave us with a place to go in the future.  Do
you?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Sept. 17, 2009, 4:40 p.m. UTC | #37

On Wed, 16 Sep 2009, Jamie Lokier wrote:
> 
> I'd forgotten about Linus' strace argument.  That's a good one.
> 
> Of course everything should be a syscall by that argument :-)

Oh yes, everything _should_ be a syscall.

The problem is that many things are too "amorphous" to be system calls, 
and don't have any sane generic semantics (ie they only act on a specific 
device node). So we have ioctl's etc for those things.

And then we have page faults. I've long wished that from a system call 
tracing standpoint we could show page faults as pseudo-system-calls (at 
least as long as they happen from user space - trying to handle nesting is 
not worth it). It would make it _so_ much more obvious what the 
performance patterns are if you could just do

	strace -ttT firefox

for the cold-cache case and you'd see where the time is really spent.

(yeah, yeah, you can get that kind of information other ways, but it's a 
hell of a lot less convenient than just getting a nice trace with 
timestamps).

> And strace can trace some ioctls and setsockopts.  (But it's never
> pretty to see isatty() showing in strace as SNDCTL_TMR_TIMEBASE :-)

Yes, strace can fix things up, and show "send a packet" as "fanotify". But 
it's nasty and hard. 

Quite frankly, I have _never_ever_ seen a good reason for talking to the 
kernel with some idiotic packet interface. It's just a fancy way to do 
ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy 
packet interfaces suddenly much better?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arjan van de Ven Sept. 17, 2009, 5:35 p.m. UTC | #38

On Thu, 17 Sep 2009 09:40:16 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> And then we have page faults. I've long wished that from a system
> call tracing standpoint we could show page faults as
> pseudo-system-calls (at least as long as they happen from user space
> - trying to handle nesting is not worth it). It would make it _so_
> much more obvious what the performance patterns are if you could just
> do
> 
> 	strace -ttT firefox
> 
> for the cold-cache case and you'd see where the time is really spent.
> 
> (yeah, yeah, you can get that kind of information other ways, but
> it's a hell of a lot less convenient than just getting a nice trace
> with timestamps).

ohhh I should add pagefaults to timechart
good one.

Eric Paris Sept. 17, 2009, 6:53 p.m. UTC | #39

On Thu, 2009-09-17 at 09:40 -0700, Linus Torvalds wrote:
> 
> On Wed, 16 Sep 2009, Jamie Lokier wrote:
> > 
> > I'd forgotten about Linus' strace argument.  That's a good one.
> > 
> > Of course everything should be a syscall by that argument :-)
> 
> Oh yes, everything _should_ be a syscall.

I rewrote the interface and prototyped out a working fanotify like so:

SYSCALL_DEFINE4(fanotify_init, unsigned int, flags, int, event_f_flags,
		__u64, mask, int, priority)

int flags indicates - things like global or directed, fd's or wd's,
could include fail allow vs fail deny, O_CLOEXEC, O_NONBLOCK, etc
int event_f_flags - flags used when opening an fd for the listener
__u64 mask - in global mode the events of interest
int priority - the order fanotify listeners should be checked (so HSM
		can be before AV scanners)

Do we need a timeout for access decisions?  I left room for that in the
bind address, but we can't just leave room to spare with a syscall...

SYSCALL_DEFINE6(fanotify_modify_mark, int, fanotify_fd,
		unsigned int, flags, int, fd,
		const char  __user *, pathname, __u64, mask,
		__u64, ignored_mask)

int fanotify_fd - duh
int flags - add, remove, flush, events on child, event on subtree?
int fd - either fd to object or fd to dir for relative pathname
const char __user * pathname - either pathname or null if only use fd
__u64 mask - events this inode cares about
__u64 ignored_mask - events this inode does NOT care about

(not yet done, would someone like to comment?)
fanotify_response(int fanotify_fd, __u64 cookie, __u32 response);
__u64 cookie - which of our permission requests we are waiting on
__u32 response - allow, deny, wait longer

Could be done using write(), but I think the strace argument clearly
says that this should be a syscall that can be easily found and reported

(not settled in my mind)
int fanotify_ignore_sb(int fanotify_fd, unsigned int flags,
                       long f_type, fsid_t f_fsid)
int fanotify_fd - duh
unsigned int flags - f_type or fsid?
long f_type - statfs(2) f_type
fsid_t f_fsid - statfs(2) f_fsid

Reads from the fd would return data of this structure:

struct fanotify_event_metadata {
	__u32 event_len;
	__u32 vers;
	__s32 fd;
	__u32 mask;
	__u32 f_flags;
	__s32 pid;
	__s32 uid;
	__s32 tgid;
	__u64 cookie;
}  __attribute__((packed));

Thanks to event_len and vers, we could extend it to include

__u32 filename1_len,
char filename1[filename1_len]
__u32 filename2_len,
char filename2[filename2_len]

This can all take shape as that work is completed and I don't believe
should block merging.

Do my syscalls look pretty enough?  I'm down to 3 or 4.
Jamie, you tend to agree that the interface and the event types are nice
enough that we can build out (if we get the right hooks in the right
places) everywhere we need to go?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 17, 2009, 8:07 p.m. UTC | #40

On Wednesday, 16 September 2009 14:17:08 Jamie Lokier wrote:
> Eric Paris wrote:
> > On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote:
> > > Seriously, what does system-wide fanotify do when run from a
> > > chroot/namespace/cgroup, and a file outside them is accessed?
> >
> > At the moment an fanotify global listener is system wide.  Truely system
> > wide.  A gentleman from suse is looking rectify the problem so that if
> > run inside a namespace it stays inside the namespace.  Note that this
> > particular little tidbit is not in the 8 patches I proposed.  At the
> > moment those just include the UI and basic notification.
>
> I'll be really interested in the gentleman's solution.

I guess Eric meant me.

From my point of view, "global" events make no sense, and fanotify listeners 
should register which directories they are interested in (e.g., include "/", 
exclude "/proc"). This takes care of chroots and namespaces as well.

I think we want to register for events on objects rather than in the 
namespace, i.e., for inodes visible in multiple places because of hardlinks 
or bind mounts, we get the same kinds of events no matter which path is used. 
(The path actually used would still show up in /proc/self/fd/x.) When moving 
registered inodes, the registrations would move with them. This is how 
inotify works, except that inotify watches are not recursive.

The difficulty with this is that in the worst case, this would require walking 
the entire namespace and all cached inodes. I don't see how this could be 
done for two reasons:

 * First, we can't take the vfsmount_lock and dcache_lock for the entire time.

 * Second, we would need to pin almost all the inodes, which is a clear no-go.

   [Why pin?  At least we would need to remember which objects a listener has
    registered interest in, so we need to pin the inodes.  We could still
    allow unregistered directory inodes to be thrown out because we can
    recreate their registration status from the parent. We can't recreate the
    registration status of non-directories because of hardlinks, though.]

The only other idea I could come up with is to only allow recursive 
registrations at mount points: instead of inodes, the vfsmounts would be 
included or excluded (probably automatically including bind mounts). This has 
one big drawback though: users would no longer be able to watch arbitrary 
subtrees anymore. Privileged users could still arrange to watch almost all 
subtrees with bind mounts (mount --bind /foo/bar /foo/bar).

Any ideas?

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 18, 2009, 8:52 p.m. UTC | #41

On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote:

> From my point of view, "global" events make no sense, and fanotify listeners 
> should register which directories they are interested in (e.g., include "/", 
> exclude "/proc"). This takes care of chroots and namespaces as well.

While I completely agree that most users don't want global events, the
antimalware vendors who today, unprotect and hack the syscall table on
their unsuspecting customer's machines to intercept every read, write,
open, close, mmap, etc syscall want EXACTLY that.  They'd been asking
for a way to get this information for quite some time now.  The largest
vendors in this market have agreed the interface (well, when it was a
socket interface that I talked about for so long) should meet their
needs.

Subtree watching / isn't any different or better, just harder and more
complex to implement.  You still have to exclude /proc and /sys and
everything else.  Just like one must with a global listener.  Still
though, this sounds like an issue for the f_type and f_fsid exclusion
syscall I say I'm still not settled on.  Not and issue with the basis of
fanotify or with the 3 proposed syscalls.

Jamie, do you see a problem with what I have been asking for review on
or see a problem with extending it moving forward?

Linus, do you see the value of 'yet another notification scheme' ?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 18, 2009, 10 p.m. UTC | #42

On Friday, 18 September 2009 22:52:08 Eric Paris wrote:
> On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote:
> > From my point of view, "global" events make no sense, and fanotify
> > listeners should register which directories they are interested in (e.g.,
> > include "/", exclude "/proc"). This takes care of chroots and namespaces
> > as well.
>
> While I completely agree that most users don't want global events, the
> antimalware vendors who today, unprotect and hack the syscall table on
> their unsuspecting customer's machines to intercept every read, write,
> open, close, mmap, etc syscall want EXACTLY that.

I understand that "global" is what those guys get today for lack of a 
reasonable mechanism, but it's not what anybody can ge given by fanotify: it 
conflicts with filesystem namespaces.

Consider running several "virtual machines" in separate namespaces on the same 
kernel. With "global" you are forced to run the same global fanotify 
listeners everywhere; with per-mount-point listeners, you can choose 
between "global" and something more fine-grained by identifying which 
vfsmounts you are interested in. (Filesystem namespaces correspond to 
vfsmount hierarchies.)

> [...] You still have to exclude /proc and /sys and everything else.

Those are mount points, and so convenient to handle with a per-mount-point 
mechanism. No additional kernel code needed.

> [...] Still though, this sounds like an issue for the f_type and f_fsid
> exclusion syscall I say I'm still not settled on.

Those are also obsolete with a per-mount-point mechanism.

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 19, 2009, 3:04 a.m. UTC | #43

On Sat, 2009-09-19 at 00:00 +0200, Andreas Gruenbacher wrote:
> On Friday, 18 September 2009 22:52:08 Eric Paris wrote:
> > On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote:
> > > From my point of view, "global" events make no sense, and fanotify
> > > listeners should register which directories they are interested in (e.g.,
> > > include "/", exclude "/proc"). This takes care of chroots and namespaces
> > > as well.
> >
> > While I completely agree that most users don't want global events, the
> > antimalware vendors who today, unprotect and hack the syscall table on
> > their unsuspecting customer's machines to intercept every read, write,
> > open, close, mmap, etc syscall want EXACTLY that.
> 
> I understand that "global" is what those guys get today for lack of a 
> reasonable mechanism, but it's not what anybody can ge given by fanotify: it 
> conflicts with filesystem namespaces.
> 
> Consider running several "virtual machines" in separate namespaces on the same 
> kernel. With "global" you are forced to run the same global fanotify 
> listeners everywhere; with per-mount-point listeners, you can choose 
> between "global" and something more fine-grained by identifying which 
> vfsmounts you are interested in. (Filesystem namespaces correspond to 
> vfsmount hierarchies.)

Let me start by saying I am agreeing I should pursue subtree
notification.  It's what I think everyone really wants.  It's a great
idea, and I think you might have a simple way to get close.  Clearly
these are avenues I'm willing and hoping to pursue.  Also I say it
again, I believe the interface as proposed (except maybe some of my
exclusion stuff) is flexible enough to implement any of these ideas.
Does anyone disagree?

BUT to solve one of the main problems fanotify is intending to solve it
needs a way to be the 'fscking all notifier.'  It needs to be the whole
damn system.  I totally agree that what I have in my tree today (yet
unposted) restricting global notification (CAP_SYS_ADMIN) is highly
inadequate.  If any root task in any namespace could easily hop on out
of it's namespace using fanotify, that's a problem.  No arguments with
me.

But there must be a way for fanotify to globally get everything.  That's
one of the main points of fanotify.  It needs to be a fscking all
notifier, even of things in a completely detached namespace.  AV vendors
are going to get it.  Their customers our users are going to load kernel
modules that do horrible things.  These are the realities of the world
in which we live.  Do we really throw 10's or 100's of thousands of our
users under the bus because we don't like the software they are using on
philosophical grounds?

I'm sure namespace people are calling me an idiot and tell me to stay in
my namespace.  I want to stay in my namespace for 'most' root users, but
I need a way to get a global scanner.  I want to know what is the sanest
way?  And for people who feel it's insane, just don't compile it in.
I'll make global listeners a build option.  But global listeners is an
absolute requirement.  I was considering saying you needed cap_sys_admin
and you needed current->ns_proxy->mnt_ns == the original init task's
mnt_ns.  Maybe this isn't a great way to determine if a task should be
allowed to use global listeners.  Is there a better way to restrict it?

Think about your web hosting company.  They sell 'cheap' vm's to
customers in a private name.  The web hosting company want to run an AV
scanner that scans every file on the computer, their files, their
customer's files, everything.  Certainly we don't want the customer to
break out of their namespace.  So, what is the sanest, even if you hate
the idea so much you compile it out, way to let the hosting company get
information about files in their customer's detached namespace which not
letting their customers get information about each other?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 21, 2009, 8:04 p.m. UTC | #44

On Saturday, 19 September 2009 5:04:31 Eric Paris wrote:
> Let me start by saying I am agreeing I should pursue subtree
> notification.  It's what I think everyone really wants.  It's a great
> idea, and I think you might have a simple way to get close.  Clearly
> these are avenues I'm willing and hoping to pursue.  Also I say it
> again, I believe the interface as proposed (except maybe some of my
> exclusion stuff) is flexible enough to implement any of these ideas.
> Does anyone disagree?

It does seem flexible enough. However, the current interface assumes "global" 
listeners (the mask argument of fanotify_init):

  int fanotify_init(int flags, int f_flags, __u64 mask,
		    unsigned int priority);

Once subtree support is added, this parameter becomes obsolete. That's pretty 
broken for a syscall yet to be introduced.

> BUT to solve one of the main problems fanotify is intending to solve it
> needs a way to be the 'fscking all notifier.'  It needs to be the whole
> damn system.

Think of a system after boot, with a single global namespace. Whatever you 
access by filename is reachable from the namespace root. At this point, 
nothing more global exists. A listener can watch the mount points of 
interest, and everything's fine.

What's a bit more tricky is to ensure that this listener will continue to 
receive all events from whatever else is mounted anywhere, irrespective of 
namespaces. I think we can get there.

By the way, Documentation/filesystems/sharedsubtree.txt describes how 
filesystem namespaces work.

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 21, 2009, 8:28 p.m. UTC | #45

Andreas Gruenbacher wrote:
> On Saturday, 19 September 2009 5:04:31 Eric Paris wrote:
> > Let me start by saying I am agreeing I should pursue subtree
> > notification.  It's what I think everyone really wants.  It's a great
> > idea, and I think you might have a simple way to get close.  Clearly
> > these are avenues I'm willing and hoping to pursue.  Also I say it
> > again, I believe the interface as proposed (except maybe some of my
> > exclusion stuff) is flexible enough to implement any of these ideas.
> > Does anyone disagree?
> 
> It does seem flexible enough. However, the current interface assumes "global" 
> listeners (the mask argument of fanotify_init):
> 
>   int fanotify_init(int flags, int f_flags, __u64 mask,
> 		    unsigned int priority);
> 
> Once subtree support is added, this parameter becomes obsolete. That's pretty 
> broken for a syscall yet to be introduced.
> 
> > BUT to solve one of the main problems fanotify is intending to solve it
> > needs a way to be the 'fscking all notifier.'  It needs to be the whole
> > damn system.
> 
> Think of a system after boot, with a single global namespace. Whatever you 
> access by filename is reachable from the namespace root. At this point, 
> nothing more global exists. A listener can watch the mount points of 
> interest, and everything's fine.
> 
> What's a bit more tricky is to ensure that this listener will continue to 
> receive all events from whatever else is mounted anywhere, irrespective of 
> namespaces. I think we can get there.

I think so to, and that'd be a great all round solution.

We _have_ to receive mount & umount events to do this.  But even
inotify-style tracking needs those if it's to be accurate, so it's not
an additional burden.

It would be logical if fanotify could block and ack those in the same
way as it can block and ack other accesses (with the usual filtering
rules on which inodes trigger events, and which don't or are cached).

As in to prevent: mount --bind innocent .bash_login, but also to
ensure it always knows what's mounted when another event occurs.

> By the way, Documentation/filesystems/sharedsubtree.txt describes how 
> filesystem namespaces work.

Fortunately, after making a new namespace you can read the mounts in
the new namespace from /proc/self/mount* (I think) without having to
know anything about the shared subtree rules.

So to follow monitoring/checking across all namespaces, it would (I
think) be enough to receive a fanotify "new namespace" event, and Ack
that event to allow the CLONE_NS to proceed.  It's still tricky stuff
though.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 21, 2009, 9:27 p.m. UTC | #46

On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote:
> It would be logical if fanotify could block and ack those [mount & umount
> events] in the same way as it can block and ack other accesses (with the
> usual filtering rules on which inodes trigger events, and which don't or are
> cached).

Hmm. To me, fanotify is about file contents first of all: this is what 
fanotify wants to be able to veto. Directory events seem reasonable to add 
for inotify compatibility, but I see no need for access decisions on them. 
Even less so for mounts and unmounts. (Besides, we can't hold any vfs locks 
while asking fanotify so those operations wouldn't be atomic, anyway.)

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 21, 2009, 10 p.m. UTC | #47

Andreas Gruenbacher wrote:
> On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote:
> > It would be logical if fanotify could block and ack those [mount & umount
> > events] in the same way as it can block and ack other accesses (with the
> > usual filtering rules on which inodes trigger events, and which don't or are
> > cached).
> 
> Hmm. To me, fanotify is about file contents first of all: this is what 
> fanotify wants to be able to veto.

Surely you don't assume that what constitutes malicious content is
independent of it's location and/or name?

(See also "echo 'run_virus&' >>.bash_login).

Wait a minute.  You don't assume that, otherwise why the interest in
subtrees? :-)

> Directory events seem reasonable to add for inotify compatibility,

Did you see may point about userspace caches and how directory events
are fundamental to that - there's no way to build a cache without them?

> but I see no need for access decisions on them. 

Please excuse me; I'm a bit confused.  Is fanotify intended just for
use by access decision programs, or is the plan now for it to also be
a replacement for inotify?  I'm getting conflicting signals about
that.

If it's just for access decision programs, and if those aren't going
to care about location, then there's no need to add directory events
to fanotify at all.  But then I'll be demanding subtree support in
inotify, please :-)

> Even less so for mounts and unmounts.

   (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat /bin/cat

If fanotify doesn't react to that, which is just a fancy way of saying
"zcat virus.gz >/bin/cat" in a way which doesn't cause any writes or
opens, what's the point in it?  Is fanotify only for checking files
written by non-root users?

> (Besides, we can't hold any vfs locks 
> while asking fanotify so those operations wouldn't be atomic, anyway.)

Indeed, good point.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Davide Libenzi Sept. 21, 2009, 10:18 p.m. UTC | #48

On Mon, 21 Sep 2009, Jamie Lokier wrote:

> I think so to, and that'd be a great all round solution.

If this is for anti-malware vendors to intercept userspace accesses 
they're currently doing it by hacking the syscall table, why don't we 
offer a way to monitor syscalls (kernel side) in a non racy way?
Modules can [un]register themselves for syscall intercaption, and receive 
the syscall number and parameters. They'd be able to change paramters, 
return error codes, and so on.
The cost of the check in the syscall path could even be under an 
alternative-like patching, if really neeeded.
The Pros of this would be:

- The kernel code to implement this would be trivially small, with no 
  I-need-this-feature-too growth potential

- There won't be any externally visible API to maintain (and its kernel 
  counter part) and expand

- Any system call can be intercepted, allowing it to be flexible while 
  leaving the burden of the interception handling, and communication with 
  userspace policy enforcers, to the anti-malware (or whoever really) 
  companies modules

The anti-malware are already doing this (intercepting syscall), they 
already have code for it, and they always did (writing kernel 
modules/drivers, that is) for Windows.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 21, 2009, 11:09 p.m. UTC | #49

On Tuesday, 22 September 2009 0:00:02 Jamie Lokier wrote:
> Andreas Gruenbacher wrote:
> > On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote:
> > > It would be logical if fanotify could block and ack those [mount &
> > > umount events] in the same way as it can block and ack other accesses
> > > (with the usual filtering rules on which inodes trigger events, and
> > > which don't or are cached).
> >
> > Hmm. To me, fanotify is about file contents first of all: this is what
> > fanotify wants to be able to veto.
>
> Surely you don't assume that what constitutes malicious content is
> independent of it's location and/or name?

If the antimalware vendors want to base their decisions on pathnames then 
that's their decision, and they can check /proc/self/fd/N. We should be able 
to treat directory events the same.

> (See also "echo 'run_virus&' >>.bash_login).
>
> Wait a minute.  You don't assume that, otherwise why the interest in
> subtrees? :-)
>
> > Directory events seem reasonable to add for inotify compatibility,
>
> Did you see may point about userspace caches and how directory events
> are fundamental to that - there's no way to build a cache without them?

Yes, there were some doubts about this appoach. Waiting for your code to 
demonstrate; an object based cache (e.g., st_dev + st_ino) rather than a 
pathname based cache would seem more reasonable.

> > but I see no need for access decisions on them.
>
> Please excuse me; I'm a bit confused.  Is fanotify intended just for
> use by access decision programs, or is the plan now for it to also be
> a replacement for inotify?  I'm getting conflicting signals about
> that.

Inotify doesn't support access decisions. So where's the problem with 
having "notify only" events for directory / mount / unmount events?

> If it's just for access decision programs, and if those aren't going
> to care about location, then there's no need to add directory events
> to fanotify at all.  But then I'll be demanding subtree support in
> inotify, please :-)
>
> > Even less so for mounts and unmounts.
>
>    (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat
> /bin/cat

... and then someone accesses /bin/cat, which triggers a fanotify access 
decision.

Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 21, 2009, 11:12 p.m. UTC | #50

Davide Libenzi wrote:
> On Mon, 21 Sep 2009, Jamie Lokier wrote:
> 
> > I think so to, and that'd be a great all round solution.
> 
> If this is for anti-malware vendors

Personally I'm not interested in anti-malware, and am simply
interested in leveraging fsnotify improvements to accelerate userspace
caches of information which depends on files (indexes, templates,
compiler caches, stat caches etc.).  Basically make inotify better,
and sufficiently correct for that purpose.

My sticking my oar in lately is to ensure the fsnotify improvements
are going in the (imho) right direction.  There's a lot of interesting
apps waiting in the wings on this.  It doesn't have to be complicated,
just... sensible.

> to intercept userspace accesses 
> they're currently doing it by hacking the syscall table, why don't we 
> offer a way to monitor syscalls (kernel side) in a non racy way?
> Modules can [un]register themselves for syscall intercaption, and receive 
> the syscall number and parameters. They'd be able to change paramters, 
> return error codes, and so on.
> The cost of the check in the syscall path could even be under an 
> alternative-like patching, if really neeeded.
> The Pros of this would be:
> 
> - The kernel code to implement this would be trivially small, with no 
>   I-need-this-feature-too growth potential

(Fwiw, the {fa,fs,i}notify thing looks to me like it's getting simpler
as we go.  Good design = decrease complexity + increase versatility.
E.g. see epoll.)

> - There won't be any externally visible API to maintain (and its kernel 
>   counter part) and expand
> 
> - Any system call can be intercepted, allowing it to be flexible while 
>   leaving the burden of the interception handling, and communication with 
>   userspace policy enforcers, to the anti-malware (or whoever really) 
>   companies modules
> 
> The anti-malware are already doing this (intercepting syscall), they 
> already have code for it, and they always did (writing kernel 
> modules/drivers, that is) for Windows.

I don't mind at all if fanotify is replaced by a general purpose "take
over the system call table" solution for anti-malware, and I still get
to keep the fsnotify improvements :-)

But I can't help noticing that we _already_ have quite well placed
hooks for intercepting system calls, called security_this and
security_that (SELinux etc), albeit they can't redirect things so much.

However, being a little kinder, I suspect even the anti-malware
vendors would rather not slow down everything with race-prone
complicated tracking of everything every process does...  which is why
fanotify allows it's "interest set" to be reduced from everything to a
subset of files, and it's results to be cached, and let the races be
handled in the normal way by VFS.

Once you have an "interest set" and focus on files, it looks somewhat
reasonable to use the fsnotify hooks.

...That is, if you believe monitoring files is the best approach to
anti-malware.  I can't help noticing that on (ahem) Windows, running
just a "virus checker" which generically scans every file independent
of it's location looking for signatures and keeping up with patches is
no longer considered good enough.

-- Jamie

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 21, 2009, 11:56 p.m. UTC | #51

Andreas Gruenbacher wrote:
> If the antimalware vendors want to base their decisions on pathnames then 
> that's their decision, and they can check /proc/self/fd/N.

Race hazards and loopholes.  It doesn't work.

> Waiting for your code to demonstrate; an object based cache (e.g.,
> st_dev + st_ino) rather than a pathname based cache would seem more
> reasonable.

Nearly everything that people do with files involves paths.  The point
is to cache what people (or their programs) do.  Apache does not
consult inodes by number, and rsync does not write inodes by number :-)
Yes, to the code...

> > > but I see no need for access decisions on them.
> >
> > Please excuse me; I'm a bit confused.  Is fanotify intended just for
> > use by access decision programs, or is the plan now for it to also be
> > a replacement for inotify?  I'm getting conflicting signals about
> > that.
> 
> Inotify doesn't support access decisions. So where's the problem with 
> having "notify only" events for directory / mount / unmount events?

No problem here.

You seemed to be saying you want to add directory events to fanotify.
But if fanotify is only intended for access decisions?  Something I
must have misunderstood in that.

> > If it's just for access decision programs, and if those aren't going
> > to care about location, then there's no need to add directory events
> > to fanotify at all.  But then I'll be demanding subtree support in
> > inotify, please :-)
> >
> > > Even less so for mounts and unmounts.
> >
> >    (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat
> > /bin/cat
> 
> ... and then someone accesses /bin/cat, which triggers a fanotify access 
> decision.

That's fine as long as there was no location-awareness in the logic
which checked foo/innocent.txt and set that inode's "read-ok,cache-me" bit.

Mount only matters if you're sensitive to location.  If you think
location-independent checks make good anti-malware
I_have_a_bridge_to_sell^H^H^H^H^H^H^H^H^H^H^Hfine with me :-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Sept. 22, 2009, 12:15 a.m. UTC | #52

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Quite frankly, I have _never_ever_ seen a good reason for talking to the 
> kernel with some idiotic packet interface. It's just a fancy way to do 
> ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy 
> packet interfaces suddenly much better?

For working with the networking stack there are a lot of advantages because
netlink is the interface to everything in the network stack.

There are nice things like the packet to create a new interface is the same
packet the kernel sends everyone to report a new interface etc.

netlink also seems to get the structured data thing right.  You can
parse the packet even if you don't understand everything.  Each tag is
well defined like a syscall, taking exactly one kind of argument.
Which avoids the worst failure of ioctl in that you can't even parse
everything, and the argument may be a linked list in the calling
process or something else atrocious.

All of that said syscalls are good, and I would not recommend netlink
to anything not in the network stack.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Randy Dunlap Sept. 22, 2009, 12:22 a.m. UTC | #53

On Mon, 21 Sep 2009 17:15:28 -0700 Eric W. Biederman wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > Quite frankly, I have _never_ever_ seen a good reason for talking to the 
> > kernel with some idiotic packet interface. It's just a fancy way to do 
> > ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy 
> > packet interfaces suddenly much better?
> 
> For working with the networking stack there are a lot of advantages because
> netlink is the interface to everything in the network stack.
> 
> There are nice things like the packet to create a new interface is the same
> packet the kernel sends everyone to report a new interface etc.
> 
> netlink also seems to get the structured data thing right.  You can
> parse the packet even if you don't understand everything.  Each tag is
> well defined like a syscall, taking exactly one kind of argument.
> Which avoids the worst failure of ioctl in that you can't even parse
> everything, and the argument may be a linked list in the calling
> process or something else atrocious.
> 
> All of that said syscalls are good, and I would not recommend netlink
> to anything not in the network stack.

like CONFIG_SCSI_NETLINK and CONFIG_QUOTA_NETLINK_INTERFACE  :(


---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Davide Libenzi Sept. 22, 2009, 2:51 p.m. UTC | #54

On Tue, 22 Sep 2009, Jamie Lokier wrote:

> I don't mind at all if fanotify is replaced by a general purpose "take
> over the system call table" solution ...

That was not what I meant ;)
You'd register/unregister as syscall interceptor, receiving syscall number 
and parameters, you'd be able to return status/error codes directly, and 
you'd have the ability to eventually change the parameters. All this 
should be pretty trivial code, and at the same time give full syscall 
visibility to the modules.
The complexity would be left to the interceptors, as they already do it 
today.

> But I can't help noticing that we _already_ have quite well placed
> hooks for intercepting system calls, called security_this and
> security_that (SELinux etc), ...

That has "some" limits WRT non-GPL modules and relative static linkage.

> However, being a little kinder, I suspect even the anti-malware
> vendors would rather not slow down everything with race-prone
> complicated tracking of everything every process does...  which is why
> fanotify allows it's "interest set" to be reduced from everything to a
> subset of files, and it's results to be cached, and let the races be
> handled in the normal way by VFS.

They are already doing it today, since they are forced to literally find 
and hack the syscall table.
They already have things like process whitelists, path whitelists, scan 
caches, and all the whistles, in their code.
Of course, some of them might be interested in pushing given complexity 
inside the kernel, since they won't have to maintain it.
Some other OTOH, might be interested in keeping a syscall-based access, 
since they already have working code based on that abstraction.
The good part of this would be that all the userspace communication API, 
whitelists, caches, etc...  would be left to the module implementors, and 
not pushed inside the kernel.
That, and the flexibility of being able to intercept all the userspace 
entrances into the kernel.

- Davide

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 22, 2009, 3:31 p.m. UTC | #55

On Tuesday, 22 September 2009 16:51:39 Davide Libenzi wrote:
> On Tue, 22 Sep 2009, Jamie Lokier wrote:
> > I don't mind at all if fanotify is replaced by a general purpose "take
> > over the system call table" solution ...
>
> That was not what I meant ;)
> You'd register/unregister as syscall interceptor, receiving syscall number
> and parameters, you'd be able to return status/error codes directly, and
> you'd have the ability to eventually change the parameters. All this
> should be pretty trivial code, and at the same time give full syscall
> visibility to the modules.

The fatal flaw of syscall interception is race conditions: you look up a 
pathname in your interception layer; then when you call into the proper 
syscall, the kernel again looks up the same pathname. There is no way to 
guarantee that you end up at the same object in both lookups. The security 
and fsnotify hooks are placed in the appropriate spots to avoid exactly that.

Andreas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Davide Libenzi Sept. 22, 2009, 4:04 p.m. UTC | #56

On Tue, 22 Sep 2009, Andreas Gruenbacher wrote:

> The fatal flaw of syscall interception is race conditions: you look up a 
> pathname in your interception layer; then when you call into the proper 
> syscall, the kernel again looks up the same pathname. There is no way to 
> guarantee that you end up at the same object in both lookups. The security 
> and fsnotify hooks are placed in the appropriate spots to avoid exactly that.

Fatal? You mean, for this corner case that the anti-malware industry lived 
with for so much time (in Linux and Windows), you're prepared in pushing 
all the logic that is currently implemented into their modules, into the 
kernel?
This includes process whitelisting, path whitelisting, caches, userspace 
access API definition, and so on? On top of providing a generally more 
limited interception.
Why don't we instead offer a lower and broader level of interception, 
letting the users decide if such fatal flaw needs to be addressed or 
not, in their modules?
They get a broader inteception layer, with the option to decide if or if 
not address certain scenarios, and we get less code inside the kernel.
A win/win situation, if you ask me.

- Davide

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 22, 2009, 4:11 p.m. UTC | #57

On Tue, 2009-09-22 at 17:31 +0200, Andreas Gruenbacher wrote:
> On Tuesday, 22 September 2009 16:51:39 Davide Libenzi wrote:
> > On Tue, 22 Sep 2009, Jamie Lokier wrote:
> > > I don't mind at all if fanotify is replaced by a general purpose "take
> > > over the system call table" solution ...
> >
> > That was not what I meant ;)
> > You'd register/unregister as syscall interceptor, receiving syscall number
> > and parameters, you'd be able to return status/error codes directly, and
> > you'd have the ability to eventually change the parameters. All this
> > should be pretty trivial code, and at the same time give full syscall
> > visibility to the modules.
> 
> The fatal flaw of syscall interception is race conditions: 

That's not the fatal flaw.  The fatal flaw is that I am not going to
write 90% of a rootkit and make it easy to use.  Not going to happen.
There's a reason we went to the trouble to mark the syscall call RO, we
don't export it, and we don't want people playing with it.  It clearly
would have been the quickest, easiest, and fastest way to make
anti-virus companies happy, but it doesn't really solve a good problem
and it leaves all of us in a worse position than we are today.  Easy !=
Good.

-Eric 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier Sept. 22, 2009, 4:27 p.m. UTC | #58

Eric Paris wrote:
> That's not the fatal flaw.  The fatal flaw is that I am not going to
> write 90% of a rootkit and make it easy to use.

I hate to point out the obvious, but fanotify's ability to intercept
every file access and rewrite the file before the access proceeds is
also 90% of a rootkit...

But fortunately both fanotify and syscall rewriting require root in
the first place.

I think that makes the rootkit argument moot.  As long as fanotify
doesn't have a non-root flavour... which really would be handy for
rootkits :-)

> Easy != Good.

I agree.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 22, 2009, 9:06 p.m. UTC | #59

On Mon, 2009-09-21 at 22:04 +0200, Andreas Gruenbacher wrote:
> On Saturday, 19 September 2009 5:04:31 Eric Paris wrote:
> > Let me start by saying I am agreeing I should pursue subtree
> > notification.  It's what I think everyone really wants.  It's a great
> > idea, and I think you might have a simple way to get close.  Clearly
> > these are avenues I'm willing and hoping to pursue.  Also I say it
> > again, I believe the interface as proposed (except maybe some of my
> > exclusion stuff) is flexible enough to implement any of these ideas.
> > Does anyone disagree?
> 
> It does seem flexible enough. However, the current interface assumes "global" 
> listeners (the mask argument of fanotify_init):
> 
>   int fanotify_init(int flags, int f_flags, __u64 mask,
> 		    unsigned int priority);
> 
> Once subtree support is added, this parameter becomes obsolete. That's pretty 
> broken for a syscall yet to be introduced.

Absolutely not obsolete.  Subtree notification cannot do fscking all
notification.  

> > BUT to solve one of the main problems fanotify is intending to solve it
> > needs a way to be the 'fscking all notifier.'  It needs to be the whole
> > damn system.
> 
> Think of a system after boot, with a single global namespace. Whatever you 
> access by filename is reachable from the namespace root. At this point, 
> nothing more global exists. A listener can watch the mount points of 
> interest, and everything's fine.

this is true, if there is only one namespace subtree notification works
the same as global notification.

> What's a bit more tricky is to ensure that this listener will continue to 
> receive all events from whatever else is mounted anywhere, irrespective of 
> namespaces. I think we can get there.

Lets say I want the subtree under / to get every event on the system.  A
process comes along and clones the namespace.  Then lets say that
process mounts something inside his new namespace.  There is absolutely
no path between my / and that new mount.  How can subtree checking
possibly find and indicate it wants notification about this mount?  I
don't see how subtree checking could do it.  There can be completely
disjoint trees with no overlap.

mount -t tmpfs none /to_umount
clone namespace
mount -t tmpfs none /to_umount/private
pivot_root /tmp_umount/private
Something else umounts /to_umount

That process is in an completely detached namespace?  right?

Heck, there could be operations on files that aren't in ANY namespace.

a = open(/path/to/dir/);
umount -l /path/to/
openat(a, "filename");

I don't see how subtree notification can possibly solve the global
notification problem.

I've been thinking that checking CAP_SYS_RAWIO as well as CAP_SYS_ADMIN
might be reasonable when trying to use a global listener.  If you can
CAP_SYS_RAWIO I sorta feel like you can break out of a namespace anyway,
right?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Gruenbacher Sept. 22, 2009, 9:38 p.m. UTC | #60

On Tuesday, 22 September 2009 23:06:16 Eric Paris wrote:
> this is true, if there is only one namespace subtree notification works
> the same as global notification.
>
> [...]
>
> I don't see how subtree notification can possibly solve the global
> notification problem.

I'm thinking of is something like this: A listener registers interest in "/", 
recursively. The kernel sets a FSNOTIFY_WATCH_RECURSIVE flag on "/" and each 
mount point below. Afterwards when something is mounted anywhere, same 
namespace or not, the kernel sets the new mount's FSNOTIFY_WATCH_RECURSIVE 
flag if the parent mount has this flag set.

(Of course we need per fsnotify_group flags and not global ones, but this 
doesn't change the principle.)

Andreas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Davide Libenzi Sept. 22, 2009, 11:43 p.m. UTC | #61

On Tue, 22 Sep 2009, Jamie Lokier wrote:

> Eric Paris wrote:
> > That's not the fatal flaw.  The fatal flaw is that I am not going to
> > write 90% of a rootkit and make it easy to use.
> 
> I hate to point out the obvious, but fanotify's ability to intercept
> every file access and rewrite the file before the access proceeds is
> also 90% of a rootkit...

Obvious, but worth noticing.
Indeed, the syscall table has been RO to make it harder for RKs to 
exploit it, not to make it impossible. RO syscall table makes perfect 
sense.
But, once you are root, with very few lines of code, you can find, 
prot-fix the page, and patch the table.

> But fortunately both fanotify and syscall rewriting require root in
> the first place.                          ^^^^^^^^^

Again, maybe I wasn't clear about how this would work, but the syscall 
table would continue to remain RO ;)
And as I said before, if we want to bring the cost of the interception for 
non-users pretty close to zero (a few NOPs to run onto), we could even 
adopt an alternative-like patching triggered by a kernel boot option.

- Davide

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tvrtko Ursulin Sept. 23, 2009, 8:39 a.m. UTC | #62

On Tuesday 22 September 2009 17:04:44 Davide Libenzi wrote:
> On Tue, 22 Sep 2009, Andreas Gruenbacher wrote:
> > The fatal flaw of syscall interception is race conditions: you look up a
> > pathname in your interception layer; then when you call into the proper
> > syscall, the kernel again looks up the same pathname. There is no way to
> > guarantee that you end up at the same object in both lookups. The
> > security and fsnotify hooks are placed in the appropriate spots to avoid
> > exactly that.
>
> Fatal? You mean, for this corner case that the anti-malware industry lived
> with for so much time (in Linux and Windows), you're prepared in pushing
> all the logic that is currently implemented into their modules, into the
> kernel?

Lived with it because there was no other option. We used LSM while it was 
available for modules but then it was taken away. 

And not all vendors even use syscall interception, not even across platforms, 
of which you sound so sure about. You can't even scan something which is not 
in your namespace if you are at the syscall level. And you can't catch things 
like kernel nfsd. No, syscall interception is not really appropriate at all.

Tvrtko
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Sept. 23, 2009, 11:20 a.m. UTC | #63

On Wed, Sep 23, 2009 at 09:39:33AM +0100, Tvrtko Ursulin wrote:
> Lived with it because there was no other option. We used LSM while it was 
> available for modules but then it was taken away. 
> 
> And not all vendors even use syscall interception, not even across platforms, 
> of which you sound so sure about. You can't even scan something which is not 
> in your namespace if you are at the syscall level. And you can't catch things 
> like kernel nfsd. No, syscall interception is not really appropriate at all.

The "Anti-Malware" industry is just snake oil anyway.  I think the
proper approach to support it is just to add various no-op exports claim
to do something and all the people requiring anti-virus on Linux will be
just as happy with it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arjan van de Ven Sept. 23, 2009, 11:32 a.m. UTC | #64

On Wed, 23 Sep 2009 09:39:33 +0100
Tvrtko Ursulin <tvrtko.ursulin@sophos.com> wrote:

> Lived with it because there was no other option. We used LSM while it
> was available for modules but then it was taken away. 

... at which point you could have submitted your LSM module for
inclusion... you'd be the first (and only?) Anti Virus vendor that
would be in the mainline kernel.. speaking of competitive advantage,
coming out of the box in all distributions.

sadly this road hasn't been chosen....

Davide Libenzi Sept. 23, 2009, 3:26 p.m. UTC | #65

On Wed, 23 Sep 2009, Tvrtko Ursulin wrote:

> Lived with it because there was no other option. We used LSM while it was 
> available for modules but then it was taken away. 
> 
> And not all vendors even use syscall interception, not even across platforms, 
> of which you sound so sure about. You can't even scan something which is not 
> in your namespace if you are at the syscall level. And you can't catch things 
> like kernel nfsd. No, syscall interception is not really appropriate at all.

Really?
And *if* namespaces were the problem for the devices you were targeting, 
what prevented you to resolving the object and offering a stream to 
userspace?
In *your* module, hosting at the same time all the other logic required 
for it (caches, whitelists, etc...), instead of pushing this stuff into 
the kernel.
WRT to the "other" system, never said they were using syscall 
interception, if you read carefully. I said that minifilters typically 
sends path names to userspace, which might drive you in the pitfall 
Andreas was describing.

- Davide

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Davide Libenzi Sept. 23, 2009, 3:35 p.m. UTC | #66

On Wed, 23 Sep 2009, hch@infradead.org wrote:

> On Wed, Sep 23, 2009 at 09:39:33AM +0100, Tvrtko Ursulin wrote:
> > Lived with it because there was no other option. We used LSM while it was 
> > available for modules but then it was taken away. 
> > 
> > And not all vendors even use syscall interception, not even across platforms, 
> > of which you sound so sure about. You can't even scan something which is not 
> > in your namespace if you are at the syscall level. And you can't catch things 
> > like kernel nfsd. No, syscall interception is not really appropriate at all.
> 
> The "Anti-Malware" industry is just snake oil anyway.  I think the
> proper approach to support it is just to add various no-op exports claim
> to do something and all the people requiring anti-virus on Linux will be
> just as happy with it.

The fear is that this becomes a trojan horse (no pun intended) for more 
and more hooks and "stuff", driven by we-really-need-this-too and 
we-really-need-that-too. And once something it's in, it's harder to say no, 
under the pressure of offering a "limited solution".
This ws the reason I threw the syscall tracing thing in, so they have a 
low level generic hook, and they cam knock themselves out in their module 
(might need a few exports, but that's about it).

- Davide

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tvrtko Ursulin Sept. 23, 2009, 3:42 p.m. UTC | #67

On Wednesday 23 September 2009 12:32:32 Arjan van de Ven wrote:
> On Wed, 23 Sep 2009 09:39:33 +0100
>
> Tvrtko Ursulin <tvrtko.ursulin@sophos.com> wrote:
> > Lived with it because there was no other option. We used LSM while it
> > was available for modules but then it was taken away.
>
> ... at which point you could have submitted your LSM module for
> inclusion... you'd be the first (and only?) Anti Virus vendor that
> would be in the mainline kernel.. speaking of competitive advantage,
> coming out of the box in all distributions.
>
> sadly this road hasn't been chosen....

It has, but since what we had wasn't acceptable the road was long and it 
turned into fanotify. Which is not only about anti-malware (hello Christoph) 
so I think this sub-thread is going off-topic.

Tvrtko
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tvrtko Ursulin Sept. 23, 2009, 3:45 p.m. UTC | #68

On Wednesday 23 September 2009 16:26:49 Davide Libenzi wrote:
> On Wed, 23 Sep 2009, Tvrtko Ursulin wrote:
> > Lived with it because there was no other option. We used LSM while it was
> > available for modules but then it was taken away.
> >
> > And not all vendors even use syscall interception, not even across
> > platforms, of which you sound so sure about. You can't even scan
> > something which is not in your namespace if you are at the syscall level.
> > And you can't catch things like kernel nfsd. No, syscall interception is
> > not really appropriate at all.
>
> Really?
> And *if* namespaces were the problem for the devices you were targeting,
> what prevented you to resolving the object and offering a stream to
> userspace?

You are right, nothing really, we even do it like that today. But what about 
other interested users?

> In *your* module, hosting at the same time all the other logic required
> for it (caches, whitelists, etc...), instead of pushing this stuff into
> the kernel.
> WRT to the "other" system, never said they were using syscall
> interception, if you read carefully. I said that minifilters typically
> sends path names to userspace, which might drive you in the pitfall
> Andreas was describing.

Yeah, you could do something like kauth on OSX, which is I guess similar to 
LSM, which was turned off for out of tree. And now you want to push users of 
fanotify out of tree, so what should it be? In tree bad, out of tree bad? 

Tvrtko
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Paris Sept. 23, 2009, 3:51 p.m. UTC | #69

On Wed, 2009-09-23 at 13:32 +0200, Arjan van de Ven wrote:
> On Wed, 23 Sep 2009 09:39:33 +0100
> Tvrtko Ursulin <tvrtko.ursulin@sophos.com> wrote:
> 
> > Lived with it because there was no other option. We used LSM while it
> > was available for modules but then it was taken away. 
> 
> ... at which point you could have submitted your LSM module for
> inclusion... you'd be the first (and only?) Anti Virus vendor that
> would be in the mainline kernel.. speaking of competitive advantage,
> coming out of the box in all distributions.

And users would be left in a situation between choosing an LSM which
actually does in provable ways increase security and using an AV
scanner.  Until magic solves the LSM stacking problem (it's been tried,
no magic) I don't think any distro wants AV vendors as LSMs.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Davide Libenzi Sept. 23, 2009, 5:31 p.m. UTC | #70

On Wed, 23 Sep 2009, Tvrtko Ursulin wrote:

> Yeah, you could do something like kauth on OSX, which is I guess similar to 
> LSM, which was turned off for out of tree. And now you want to push users of 
> fanotify out of tree, so what should it be? In tree bad, out of tree bad? 

As I said before, the good of a syscall tracing approach, is that it is a 
completely generic mechanism (extensible for other kind of hooks too), 
with minimal kernel impact, while allowing its module-users to stuff all 
the code they want in the part that it's their responsibility.
So that a "we need this too" gets translated to "just do it in your code", 
instead of a request to add more stuff into the kernel, and maybe altering 
the userspace access interface (which is always painful).

- Davide

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Sept. 23, 2009, 9:56 p.m. UTC | #71

On Wed, Sep 23, 2009 at 11:51:37AM -0400, Eric Paris wrote:
> And users would be left in a situation between choosing an LSM which
> actually does in provable ways increase security and using an AV
> scanner.

Sounds like a good thing, no?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Sept. 23, 2009, 9:58 p.m. UTC | #72

On Wed, Sep 23, 2009 at 08:35:18AM -0700, Davide Libenzi wrote:
> The fear is that this becomes a trojan horse (no pun intended) for more 
> and more hooks and "stuff", driven by we-really-need-this-too and 
> we-really-need-that-too. And once something it's in, it's harder to say no, 
> under the pressure of offering a "limited solution".
> This ws the reason I threw the syscall tracing thing in, so they have a 
> low level generic hook, and they cam knock themselves out in their module 
> (might need a few exports, but that's about it).

Replacing idiotify with a saner interface is a good goal.  I just don't
think we should take the stakes of the snake oil industry too serious in
it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/8] networking/fanotify: declare fanotify socket numbers

Commit Message

Comments

Patch