Message ID | 20090911052558.32359.18075.stgit@paris.rdu.redhat.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
The patches did apply and build against next-20090910. I wrote a small user- space utility for testing (attached); see how painless the socket interface is. The patches seem to be working well, except that some required functionality is missing still. Currently, the CAP_NET_RAW capability is needed for being able to create watches. This seems too strict to me; I don't see why I shouldn't be able to watch my own files, or files which I have read access to (like inotify). There are some actions like creating hardlinks in directories or removing files which don't trigger events. From a user point of view, I would prefer to receive those events as well. (I notice that it's not easy to to pass file descriptors to listeners for those events.) Thanks, Andreas
On Fri, 2009-09-11 at 16:32 +0200, Andreas Gruenbacher wrote: > The patches did apply and build against next-20090910. I wrote a small user- > space utility for testing (attached); see how painless the socket interface > is. The patches seem to be working well, except that some required > functionality is missing still. > > Currently, the CAP_NET_RAW capability is needed for being able to create > watches. This seems too strict to me; I don't see why I shouldn't be able to > watch my own files, or files which I have read access to (like inotify). I agree completely. However since I don't yet have limits on how many marks a user can add, nor the number of events they can hold in their queue I'm leaving it as root only for the moment. Patches have started to flush out usability by non-root users. > There are some actions like creating hardlinks in directories or removing > files which don't trigger events. From a user point of view, I would prefer to > receive those events as well. (I notice that it's not easy to to pass file > descriptors to listeners for those events.) Yes, a number of event types hook into the vfs in places where we do not have at least a struct path. I'm certainly considering how to move hoooks to get that ability. -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Paris <eparis@redhat.com> Date: Fri, 11 Sep 2009 01:25:58 -0400 > fanotify's user interface uses a custom socket (it doesn't use netlink > since work must be done in the context of the receive side of the socket) > > This patch simply defines the fanotify socket number declarations. The > actual implementation of the socket is in a later patch. > > Signed-off-by: Eric Paris <eparis@redhat.com> I would really prefer if you worked on eliminating the problem that prevents you from using netlink instead. You're just duplicating tons of netlink functionality only because of this apparent limitation, and that's not good. If you use netlink you'll be able to define arbitrary attributes, they'll be optional and thus you can allow notification application to set filters on those attributes, as well as all sorts of other things. You can reimplement that, but I really would rather see you make netlink suit your needs. This is how we make existing facilities more powerful and useful to consumers, when someone tries to use it in a new way and with new requirements. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-09-11 at 11:46 -0700, David Miller wrote: > From: Eric Paris <eparis@redhat.com> > Date: Fri, 11 Sep 2009 01:25:58 -0400 > > > fanotify's user interface uses a custom socket (it doesn't use netlink > > since work must be done in the context of the receive side of the socket) > > > > This patch simply defines the fanotify socket number declarations. The > > actual implementation of the socket is in a later patch. > > > > Signed-off-by: Eric Paris <eparis@redhat.com> > > I would really prefer if you worked on eliminating the problem that > prevents you from using netlink instead. I'm not really sure if I can, although I'd love to hear input from someone who knows the netlink code on how I can make it do what I need. I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT macros. My code doesn't even use skbs and I'm not savy enough to really know how I could. I'm more than willing to work on it if someone can point me to how it might work. The basic problem is that process (a) opens a file. In the context of process(a) we need to take a reference on the struct path, store some other information about what happened and then wait for the listener, process(b), to run. When process (b) runs it takes the struct path, opens the file it points to, then writes up to userspace the information collected back when (a) was running and the fd opened in (b)'s context. Currently (aka this patch set) I do the storage of information and the waiting for (b) to be ready using the same queuing mechanism as inotify. the fsnotify backend creates a single 'event' which can be added to the queues for multiple notification listeners. So if 50 inotify watches are on the same directory we only create one event. I guess this could be reimplemented to send an skb which contains the information that fsnotify stores in an event rather than just queuing up the generic event. Then I could somehow hook the recv side of netlink to pull the skb off and instead of dumping it's data to the receiver to actually open an fd and rewrite the message. I'm not sure what netlink is buying me here thought. I also don't know the code well enough to know everywhere that an skb can be lost. Since the skb is going to hold a reference to a struct path and contain a pointer to a struct path I certainly can't handle it being dropped. The other major thing that I lose or have a much harder time reimplementing is event merging. After process (a) drops an event on the queue if it does something else to the same file fanotify will search the queue and merge events. I certainly don't know how to search the netlink queue to find old skbs edit them, and all the while make sure I'm doing it race free. I'm willing to try to make this happen, I'm just sure I see the benefit and I don't know anyone who knows the net/netlink code well enough who owuld be interested to help..... -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris wrote: > > I would really prefer if you worked on eliminating the problem that > > prevents you from using netlink instead. > > I'm not really sure if I can, although I'd love to hear input from > someone who knows the netlink code on how I can make it do what I need. > I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT > macros. My code doesn't even use skbs and I'm not savy enough to really > know how I could. I'm more than willing to work on it if someone can > point me to how it might work. Let's turn the question around. Since you're doing lots of non-sockety things, and can't tolerate dropped packets - why isn't it a character device? What's the reason for using a socket at all? (I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-) -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-09-11 at 21:46 +0100, Jamie Lokier wrote: > Eric Paris wrote: > > > I would really prefer if you worked on eliminating the problem that > > > prevents you from using netlink instead. > > > > I'm not really sure if I can, although I'd love to hear input from > > someone who knows the netlink code on how I can make it do what I need. > > I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT > > macros. My code doesn't even use skbs and I'm not savy enough to really > > know how I could. I'm more than willing to work on it if someone can > > point me to how it might work. > > Let's turn the question around. > > Since you're doing lots of non-sockety things, and can't tolerate > dropped packets - why isn't it a character device? What's the reason > for using a socket at all? > > (I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-) Originally it was a char device and I was told to use a socket protocol so I could use get/set sockopt rather than ioctl, because ioctl is the devil (even if those aren't THAT much better). The queuing being done using events instead of skbs was done reusing inotify code, reusing network code would be just as good with me. What I really need is a way to convey a pointer from one process to another. That's why I claim loss is not an option, since I'm holding a reference to the pointer I can't have that conveyance disappear under us. If network people want me to get back out of the network system I can go back to a char file with lots of ioctls. I'd love to reuse code, I just don't know what's possible... -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-09-11 at 15:33 -0400, Eric Paris wrote: > > I'm willing to try to make this happen, I'm just sure I see the benefit > and I don't know anyone who knows the net/netlink code well enough who > owuld be interested to help..... > I can help you on the netlink side but maybe we need to define the problem scope properly first. I dont know anything about fanotify, but i then dont need to. Can I re-phrase your problem as follows: You have kernel code which wishes to communicate information to multiple listeners in user space. The listeners register some form of filter (that you term as a monitor) for an object of interest. For a specific filter you wish to send them events when something happens to the object they registered for. Would the above be accurate? One good place to start is the taskstats code in linux. Look at Documentation/accounting in the kernel code. Ive also CCed Balbir who was involved in that work since i think it bears similarity to what you are trying to achieve conceptually. To your other statements: 0) A single skb can be used to send to all the listners.. 1) Netlink messages wont get lost unless the listener is not keeping up and the kernel sending it messages ends up filling its queues. In such a case your event message will be delivered to the 49 other users but not the overloaded one. You can add sequence numbers to the event messages you send to the listeners and any gaps in sequences on received events imply lost events. You can add a mechanism to query your user space kernel when something like that gets lost. 2) Your architecture has to take care of maintaining the state of what you want to deliver. So your editing has nothing to do with skbs. i.e an event happens, you update your state. If you need to send the event to the listeners, you alloc an skb - populate it with the info; multicast it to all the listeners. If something else happens, i would suggest for sake of simplicity you rinse and repeat. Sure, the listener may get contradicting events - but they should be able to handle it. I hope this helps... cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris wrote: > On Fri, 2009-09-11 at 21:46 +0100, Jamie Lokier wrote: > > Eric Paris wrote: > > > > I would really prefer if you worked on eliminating the problem that > > > > prevents you from using netlink instead. > > > > > > I'm not really sure if I can, although I'd love to hear input from > > > someone who knows the netlink code on how I can make it do what I need. > > > I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT > > > macros. My code doesn't even use skbs and I'm not savy enough to really > > > know how I could. I'm more than willing to work on it if someone can > > > point me to how it might work. > > > > Let's turn the question around. > > > > Since you're doing lots of non-sockety things, and can't tolerate > > dropped packets - why isn't it a character device? What's the reason > > for using a socket at all? > > > > (I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-) > > Originally it was a char device and I was told to use a socket protocol > so I could use get/set sockopt rather than ioctl, because ioctl is the > devil (even if those aren't THAT much better). > > The queuing being done using events instead of skbs was done reusing > inotify code, reusing network code would be just as good with me. What > I really need is a way to convey a pointer from one process to another. > That's why I claim loss is not an option, since I'm holding a reference > to the pointer I can't have that conveyance disappear under us. It's fine as long as the disappearing knows to releas the reference. But I suspect fanotify would be awfully hard to use if messages were unreliable. > If network people want me to get back out of the network system I can go > back to a char file with lots of ioctls. I'd love to reuse code, I just > don't know what's possible... Ok. I understand you're pushed in different directions by different schools of thought. Let's look at some history. What happened to /dev/epoll. It worked very well (and several OSes have /dev/poll which is similar). There was no technical reason to change the interface. But when it came to mainlining it, Linus objected, and forced it to become a small set of system calls. It's quite a nice interface to use now. Then /dev/inotify. You know what happened. The history was similar: Linux objected to the device, and forced it to use a few system calls. More recently, people skipped over the /dev path, having seen how it went before, and just implemented things like timerfd, eventfd and signalfd system calls. That seems to be the Linux way - if the interface can be exposed as a small set of sensible system calls, and it's really a core kernel facility. Does fanotify need "lots of ioctls", or could it fit comfortably into say 2-5 strongly typed system calls, like inotify and epoll do? -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
jamal wrote: > 1) Netlink messages wont get lost unless the listener is not keeping up > and the kernel sending it messages ends up filling its queues. In such a > case your event message will be delivered to the 49 other users but > not the overloaded one. You can add sequence numbers to the event > messages you send to the listeners and any gaps in sequences on received > events imply lost events. You can add a mechanism to query your user > space kernel when something like that gets lost. One of the uses of fanotify is as a security or auditing mechanism. That can't tolerate gaps. It's fundemantally different from inotify in one important respect: inotify apps can recover from losing events by checking what they are watching. The fanotify application will know that it missed events, but what happens to the other application which _caused_ those events? Does it get to do things it shouldn't, or hide them from the fanotify app, by simply overloading the system? Or the opposite, does it get access denied - spurious file errors when the system is overloaded? There's no way to handle that by dropping events. A transport mechanism can be dropped (say skbs), but the event itself has to be kept, and then retried. Since you have to keep an event object around until it's handled, there's no point tying it to an unreliable delivery mechanism which you'd have to wrap a retry mechanism around. In other words, that part of netlink is a poor match. It would match inotify much better. > 2) Your architecture has to take care of maintaining the state of what > you want to deliver. So your editing has nothing to do with skbs. > i.e an event happens, you update your state. If you need to send the > event to the listeners, you alloc an skb - populate it with the info; > multicast it to all the listeners. If something else happens, i would > suggest for sake of simplicity you rinse and repeat. Sure, the listener > may get contradicting events - but they should be able to handle it. Speaking of skbs, how fast and compact are they for this? Eric's explained that it would be normal for _every_ file operation on some systems to trigger a fanotify event and possibly wait on the response, or at least in major directory trees on the filesystem. Even if it's just for the fanotify app to say "oh I don't care about that file, carry on". File performance is one of those things which really needs to be fast for a good user experience - and it's not unusual to grep the odd 10,000 files here or there (just think of what a kernel developer does), or to replace a few thousand quickly (rpm/dpkg) and things like that. While skbs and netlink aren't that slow, I suspect they're an order of magnitude or two slower than, say, epoll or inotify at passing events around. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-09-11 at 22:27 +0100, Jamie Lokier wrote: > Eric Paris wrote: > > On Fri, 2009-09-11 at 21:46 +0100, Jamie Lokier wrote: > > > Eric Paris wrote: > > > > > I would really prefer if you worked on eliminating the problem that > > > > > prevents you from using netlink instead. > > > > > > > > I'm not really sure if I can, although I'd love to hear input from > > > > someone who knows the netlink code on how I can make it do what I need. > > > > I'm really not duplicating much other than the NLMSG_OK and NLMSG_NEXT > > > > macros. My code doesn't even use skbs and I'm not savy enough to really > > > > know how I could. I'm more than willing to work on it if someone can > > > > point me to how it might work. > > > > > > Let's turn the question around. > > > > > > Since you're doing lots of non-sockety things, and can't tolerate > > > dropped packets - why isn't it a character device? What's the reason > > > for using a socket at all? > > > > > > (I'm reminded of /dev/poll, /dev/epoll and /dev/inotify :-) > > > > Originally it was a char device and I was told to use a socket protocol > > so I could use get/set sockopt rather than ioctl, because ioctl is the > > devil (even if those aren't THAT much better). > > > > The queuing being done using events instead of skbs was done reusing > > inotify code, reusing network code would be just as good with me. What > > I really need is a way to convey a pointer from one process to another. > > That's why I claim loss is not an option, since I'm holding a reference > > to the pointer I can't have that conveyance disappear under us. > > It's fine as long as the disappearing knows to releas the reference. Absolutely. > But I suspect fanotify would be awfully hard to use if messages were > unreliable. For some things yes, some things no. I'd have to understand where loss can happen to know if it's feasible. If I know loss happens in the sender context that's great. If it's somewhere in the middle and the sender doesn't immediately know it'll never be delivered, yes, I don't think it can solve all my needs. How many places can and skb get lost between the sender and the receiver? > Does fanotify need "lots of ioctls", or could it fit comfortably into > say 2-5 strongly typed system calls, like inotify and epoll do? Certainly not a lot. I do it now with 5 setsockopt calls. I could do it with a limited number of syscalls (what I have today in about 4) but I still don't believe I know all the future use cases or how to define them in syscalls. People are still coming up with things they want inotify to do but we can't implement them in the syscalls we have. I think we all know it's a hell of a lot easier to implement a new setsockopt call than a new syscall. I don't think I know everything everyone is going to want out of a new notification system yet, so I'm reticent to go down that path. I know how to solve some of the suck of inotify, I know how to solve a number of people's problems (and have all the patches to do it) but when it comes to the organic nature of things and what I hope to be the growing future of fanotify to solve everything for everyone, I don't think I can define it all yet. I just don't want to end up with another inotify, it works, the syscalls do what they do very very well, but people want new things out of it, and we can't reasonably give it to them. Maybe syscalls are the right thing, and I know I can solve a lot of use cases with syscalls, but I don't even know if I know all the use cases. -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote: > One of the uses of fanotify is as a security or auditing mechanism. > That can't tolerate gaps. > > It's fundemantally different from inotify in one important respect: > inotify apps can recover from losing events by checking what they are > watching. > > The fanotify application will know that it missed events, but what > happens to the other application which _caused_ those events? Does it > get to do things it shouldn't, or hide them from the fanotify app, by > simply overloading the system? Or the opposite, does it get access > denied - spurious file errors when the system is overloaded? > > There's no way to handle that by dropping events. A transport > mechanism can be dropped (say skbs), but the event itself has to be > kept, and then retried. > > > Since you have to keep an event object around until it's handled, > there's no point tying it to an unreliable delivery mechanism which > you'd have to wrap a retry mechanism around. > > In other words, that part of netlink is a poor match. It would match > inotify much better. > Reliability is something that you should build in. Netlink provides you all the necessary tools. What you are asking for here is essentially reliable multicasting. You dont have infinite memory, therefore there will be times when you will overload one of the users, and they wont have sufficient buffer space and then you have to retransmit. You have to factor in processing speed mismatch between the different listeners. As an example, you could ensure that all users receive the message and if user #49 didnt, then you wait until they do before multicasting the next message to all 50 listeners. Is the current proposed mechanism capable of reliably multicasting without need for retransmit? > Speaking of skbs, how fast and compact are they for this? They are largish relative to say if you trimmed down to basic necessity. But then you get a lot of the buffer management aspects for free. In this case, the concept of multicasting is built in so for one event to be sent to X users - you only need one skb. > Eric's explained that it would be normal for _every_ file operation on > some systems to trigger a fanotify event and possibly wait on the > response, or at least in major directory trees on the filesystem. > Even if it's just for the fanotify app to say "oh I don't care about > that file, carry on". > That doesnt sound very scalable. Should it not be you get nothing unless you register for interest in something? > File performance is one of those things which really needs to be fast > for a good user experience - and it's not unusual to grep the odd > 10,000 files here or there (just think of what a kernel developer > does), or to replace a few thousand quickly (rpm/dpkg) and things like > that. > So grepping 10000 files would cause 10000 events? I am not sure how the scheme works; filtering of what events get delivered sounds more reasonable if it happens in the kernel. > While skbs and netlink aren't that slow, I suspect they're an order of > magnitude or two slower than, say, epoll or inotify at passing events > around. > not familiar with inotify. Theres a difference between events which are abbreviated in the form "hey some read happened on fd you are listening on" vs "hey a read of file X for 16 bytes at offset 200 by process Y just occured while at the same time process Z was writting at offset 2000". The later (which netlink will give you) includes a lot more attribute details which could be filtered or can be extended to include a lot more. The former(what epoll will give you) is merely a signal. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2009 at 05:51:42PM -0400, Eric Paris (eparis@redhat.com) wrote: > For some things yes, some things no. I'd have to understand where loss > can happen to know if it's feasible. If I know loss happens in the > sender context that's great. If it's somewhere in the middle and the > sender doesn't immediately know it'll never be delivered, yes, I don't > think it can solve all my needs. How many places can and skb get lost > between the sender and the receiver? When queue is full or you do not have enough RAM. Both are reported at 'sending' time. As of your description of netlink/socket usage - you will have to peek skb queue, which is rather error-prone operation. Also you will have to implement own skb destructor to mess with private reference counters and netlink bits.
On Fri, Sep 11, 2009 at 10:42:54PM +0100, Jamie Lokier (jamie@shareable.org) wrote: > While skbs and netlink aren't that slow, I suspect they're an order of > magnitude or two slower than, say, epoll or inotify at passing events > around. Skbs are those tiny bits which allow 100 Gbit/s forwarding between multiple interfaces. But of course it can not be as fast as plain memory copy :) Having one skb allocation per IO syscall will be challenging but after all we have this for send/recv calls and achieve high performance numbers. Idea with merging events in the same skb will be also very non-trivial.
jamal wrote: > On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote: > > > One of the uses of fanotify is as a security or auditing mechanism. > > That can't tolerate gaps. > > > > It's fundemantally different from inotify in one important respect: > > inotify apps can recover from losing events by checking what they are > > watching. > > > > The fanotify application will know that it missed events, but what > > happens to the other application which _caused_ those events? Does it > > get to do things it shouldn't, or hide them from the fanotify app, by > > simply overloading the system? Or the opposite, does it get access > > denied - spurious file errors when the system is overloaded? > > > > There's no way to handle that by dropping events. A transport > > mechanism can be dropped (say skbs), but the event itself has to be > > kept, and then retried. > > > > > > Since you have to keep an event object around until it's handled, > > there's no point tying it to an unreliable delivery mechanism which > > you'd have to wrap a retry mechanism around. > > > > In other words, that part of netlink is a poor match. It would match > > inotify much better. > > Reliability is something that you should build in. Netlink provides you > all the necessary tools. What you are asking for here is essentially > reliable multicasting. Almost. It's reliable multicasting plus unicast responses which must be waited for. That changes things. > You dont have infinite memory, therefore there > will be times when you will overload one of the users, and they wont > have sufficient buffer space and then you have to retransmit. If you have enough memory to remember _what_ to retransmit, then you have enough memory to buffer a fixed-size message. It just depends on how you do the buffering. To say netlink drops the message and you can retry is just saying that the buffering is happening one step earlier, before netlink. That's what I mean by netlink being a pointless complication for this, because you can just as easily write code which gets to the message to userspace without going through netlink and with no chance of it being dropped. > Is the current proposed mechanism capable of reliably multicasting > without need for retransmit? Yes. It uses positive acknowledge and flow control, because these match naturally with what fanotify does at the next higher level. The process generating the multicast (e.g. trying to write a file) is blocked until the receiver gets the message, handles it and acknowledges with a "yes you can" or "no you can't" response. That's part of fanotify's design. The pattern conveniently has no issues with using unbounded memory for message, because the sending process is blocked. > > Speaking of skbs, how fast and compact are they for this? > > They are largish relative to say if you trimmed down to basic necessity. > But then you get a lot of the buffer management aspects for free. > In this case, the concept of multicasting is built in so for one event > to be sent to X users - you only need one skb. True you only need one skb. But netlink doesn't handle waiting for positive acknowledge responses from every receiver, and combining their value, does it? You can't really take advantage of netlink's built in multicast, because to known when it has all the responses, the fanotify layer has to track the subscriber list itself anyway. What I'm saying is perhaps skbs are useful for fanotify, but I don't know that netlink's multicasting is useful. But storing the messages in skbs for transmission, and using parts of netlink to manage them, and to provide some of the API, that might be useful. > > Eric's explained that it would be normal for _every_ file operation on > > some systems to trigger a fanotify event and possibly wait on the > > response, or at least in major directory trees on the filesystem. > > Even if it's just for the fanotify app to say "oh I don't care about > > that file, carry on". > > > > That doesnt sound very scalable. Should it not be you get nothing unless > you register for interest in something? You do get nothing unless you register interest. The problem is there's no way to register interest on just a subtree, so the fanotify approach is let you register for events on the whole filesystem, and let the userspace daemon filter paths. At least it's decisions can be cached, although I'm not sure how that works when multiple processes want to monitor overlapping parts of the filesystem. It doesn't sound scalable to me, either, and that's why I don't like this part, and described a solution to monitoring subtrees - which would also solve the problem for inotify. (Both use fsnotify under the hood, and that's where subtree notification would go). Eric's mentioned interest in a way to monitor subtrees, but that hasn't gone anywhere as far as I know. He doesn't seem convinced by my solution - or even that scalability will be an issue. I think there's a bit of vision lacking here, and I'll admit I'm more interested in the inotify uses of fsnotify (being able to detect changes) than the fanotify uses (being able to _block_ or _modify_ changes). I think both inotify and fanotify ought to benefit from the same improvements to file monitoring. > > File performance is one of those things which really needs to be fast > > for a good user experience - and it's not unusual to grep the odd > > 10,000 files here or there (just think of what a kernel developer > > does), or to replace a few thousand quickly (rpm/dpkg) and things like > > that. > > > > So grepping 10000 files would cause 10000 events? I am not sure how the > scheme works; filtering of what events get delivered sounds more > reasonable if it happens in the kernel. I believe it would cause 10000 events, yes, even if they are files that userspace policy is not interested in. Eric, is that right? However I believe after the first grep, subsequent greps' decisions would be cached by marking the inodes. I'm not sure what happens if two fanotify monitors both try marking the inodes. Arguably if a fanotify monitor is running before those files are in page cache anyway, then I/O may dominate, and when the files are cached, fanotify has already cached it's decisions in the kernel. However fanotify is synchronous: each new file access involves a round trip to the fanotify userspace and back before it can proceed, so there's quite a lot of IPC and scheduling too. Without testing, it's hard to guess how it'll really perform. > > While skbs and netlink aren't that slow, I suspect they're an order of > > magnitude or two slower than, say, epoll or inotify at passing events > > around. > > not familiar with inotify. inotify is like dnotify, and like a signal or epoll: a message that something happened. You register interest in individual files or directories only, and inotify does not (yet) provide a way to monitor the whole filesystem or a subtree. fanotify is different: it provides access control, and can _refuse_ attempts to read file X, or even modify the file before permitting the file to be read. > Theres a difference between events which are abbreviated in the form > "hey some read happened on fd you are listening on" vs "hey a read > of file X for 16 bytes at offset 200 by process Y just occured while > at the same time process Z was writting at offset 2000". The later > (which netlink will give you) includes a lot more attribute details > which could be filtered or can be extended to include a lot > more. The former(what epoll will give you) is merely a signal. Firstly, it's really hard to retain the ordering of userspace events like that in a useful way, given the non-determinstic parallelism going on with multiple processes doing I/O do the same file :-) Second, you can't really pump messages with that much detail into netlink and let _it_ filter them to userspace; that would be too much processing. You'd have to have some way of not generating that much detail except when it's been requested, and preferably only for files you want it for. But this part is irrelevant to fanotify, because there's no plan or intention to provide that much detail about I/O. If you want, feel free to provide a stracenotify subsystem to track everything in detail :-) -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov wrote: > On Fri, Sep 11, 2009 at 05:51:42PM -0400, Eric Paris (eparis@redhat.com) wrote: > > For some things yes, some things no. I'd have to understand where loss > > can happen to know if it's feasible. If I know loss happens in the > > sender context that's great. If it's somewhere in the middle and the > > sender doesn't immediately know it'll never be delivered, yes, I don't > > think it can solve all my needs. How many places can and skb get lost > > between the sender and the receiver? > > When queue is full or you do not have enough RAM. Both are reported at > 'sending' time. Can you ->poll() and wait reliably until the queue will accept an skb? (A few spurious EAGAINs/ENOBUFs is ok, as long as it's not the norm). -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2009-09-14 at 01:03 +0100, Jamie Lokier wrote: > jamal wrote: > > On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote: First let me state the 2 main new things fanotify gives so neither is lost. #1 fanotify implements the same basic thing as inotify except rather than an arbitrary number (inotify speak a watch descriptor) which userspace has to somehow convert back into a file, fanotify gives userspace an open file descriptor to that object. (this is the part that requires recv side processing) #2 fanotify allows the userspace 'listener' or 'group' (I use group to describe it's actions in kernel and listener to describe it's action in userspace) to request that it be allowed to arbitrate open and access (read) security decisions. > > > Eric's explained that it would be normal for _every_ file operation on > > > some systems to trigger a fanotify event and possibly wait on the > > > response, or at least in major directory trees on the filesystem. > > > Even if it's just for the fanotify app to say "oh I don't care about > > > that file, carry on". > > > > > > > That doesnt sound very scalable. Should it not be you get nothing unless > > you register for interest in something? > > You do get nothing unless you register interest. The problem is > there's no way to register interest on just a subtree, so the fanotify > approach is let you register for events on the whole filesystem, and > let the userspace daemon filter paths. At least it's decisions can be > cached, although I'm not sure how that works when multiple processes > want to monitor overlapping parts of the filesystem. fanotify provides 3 options to register. 1) this inode 2) this dir and it's children 3) all files on the whole fscking system this patch only does #1 and #2. After it's in I'm going to take a serious look at #4 subtrees. Responses to access decisions are cached and checked in kernel per fanotify listener. So listener 1 can ignore requests for a given inode while listener 2 still gets notification and forces the original process to block. > It doesn't sound scalable to me, either, and that's why I don't like > this part, and described a solution to monitoring subtrees - which > would also solve the problem for inotify. (Both use fsnotify under > the hood, and that's where subtree notification would go). Subtree checking hasn't seen and work from me but it is something I plan to work on. And one thing that makes me scared to tie myself to syscalls when I already have something that works relatively cleanly and easily. > Eric's mentioned interest in a way to monitor subtrees, but that > hasn't gone anywhere as far as I know. He doesn't seem convinced by > my solution - or even that scalability will be an issue. I think > there's a bit of vision lacking here, and I'll admit I'm more > interested in the inotify uses of fsnotify (being able to detect > changes) than the fanotify uses (being able to _block_ or _modify_ > changes). I think both inotify and fanotify ought to benefit from the > same improvements to file monitoring. sort of agree with you here. anything that gets added to support subtrees would have to be in the generic code. Although I question how inotify could be used, as a wd is not (in my mind) a reasonable way to tell userspace about files. (and with subtrees it would be a wd and a pathname....) I think fanotify with notification only (what I'm giving in this patch series is a much better fit for subtree notification. > > > File performance is one of those things which really needs to be fast > > > for a good user experience - and it's not unusual to grep the odd > > > 10,000 files here or there (just think of what a kernel developer > > > does), or to replace a few thousand quickly (rpm/dpkg) and things like > > > that. > > > > > > > So grepping 10000 files would cause 10000 events? I am not sure how the > > scheme works; filtering of what events get delivered sounds more > > reasonable if it happens in the kernel. > > I believe it would cause 10000 events, yes, even if they are files > that userspace policy is not interested in. Eric, is that right? If fanotify wants it, yes, that's exactly what happens. > However I believe after the first grep, subsequent greps' decisions > would be cached by marking the inodes. I'm not sure what happens if > two fanotify monitors both try marking the inodes. Each can mark individually. > Arguably if a fanotify monitor is running before those files are in > page cache anyway, then I/O may dominate, and when the files are > cached, fanotify has already cached it's decisions in the kernel. > However fanotify is synchronous: each new file access involves a round > trip to the fanotify userspace and back before it can proceed, so > there's quite a lot of IPC and scheduling too. Without testing, it's > hard to guess how it'll really perform. As I recall my old old tests on a 32 way system showed around a 10% performance penalty when building a kernel when making userspace arbitrate decisions and the cache was blank. So yes, there is a serious performance hit to making a userspace application control access decisions. Then again, I'd rather not have those people who need this system wide access controls to do it in the kernel (anti-malware vendors) I believe that people who chose to use this interface will have to realize there is a severe up front performance penalty. On a steady state system like a web server you'd see near 0% performance (a new srcu lock, inode->i_lock, and running a short list) But yes, controling access to every file on a system eats performance, that's the nature of the beast. > > Theres a difference between events which are abbreviated in the form > > "hey some read happened on fd you are listening on" vs "hey a read > > of file X for 16 bytes at offset 200 by process Y just occured while > > at the same time process Z was writting at offset 2000". The later > > (which netlink will give you) includes a lot more attribute details > > which could be filtered or can be extended to include a lot > > more. The former(what epoll will give you) is merely a signal. > But this part is irrelevant to fanotify, because there's no plan or > intention to provide that much detail about I/O. We have ZERO plan to include ordering. ZERO. inotify sorta pretends it deals with ordering by only dropping a notification if it is the same as the last one in the queue. fanotify will gladly merge events which exist anywhere in the queue clearly throwing ordering to the wind. We do plan to include the pid, uid, and gid of the process making the original request. We also plan to include the f_flags of the file in the original process when possible. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2009-09-14 at 01:03 +0100, Jamie Lokier wrote: > If you have enough memory to remember _what_ to retransmit, then you > have enough memory to buffer a fixed-size message. It just depends on > how you do the buffering. To say netlink drops the message and you > can retry is just saying that the buffering is happening one step > earlier, before netlink. it is the receiver that drops the message because of overruns e.g. when receiver doesnt keep up.. > That's what I mean by netlink being a > pointless complication for this, because you can just as easily write > code which gets to the message to userspace without going through > netlink and with no chance of it being dropped. > Sure you can do that with netlink too. Whether it is overcomplicated needs to be weighed out. > Yes. It uses positive acknowledge and flow control, because these > match naturally with what fanotify does at the next higher level. > > The process generating the multicast (e.g. trying to write a file) is > blocked until the receiver gets the message, handles it and > acknowledges with a "yes you can" or "no you can't" response. > > That's part of fanotify's design. The pattern conveniently has no > issues with using unbounded memory for message, because the sending > process is blocked. > Ok, I understand better i think;-> So it is a synchronous type of operation whereas in netlink type multicast, the optimization is to make the operation async. > True you only need one skb. But netlink doesn't handle waiting for > positive acknowledge responses from every receiver, and combining > their value, does it? It is not netlink perse. It is how you use netlink for your app. Classical one-to-many operations have the sender (kernel mostly) do async sends to the listeners. It is up to the listener to catch up if there are any holes. But this seems not what you want for fanotify. > You can't really take advantage of netlink's > built in multicast, because to known when it has all the responses, > the fanotify layer has to track the subscriber list itself anyway. True, given my understanding so far fanotify has to track the subscriber list. i.e something along the lines of: - send a single multicast message to a set of listeners - wait for response from all subscribers - if no response from all subscribers given timeout then retransmit upto max retransmit times The chance of loosing a message in such a case is zero if the socket buffer on each listener/receiver is larger than one fanotify event message. You still have to alloc the message - and that may fail. > What I'm saying is perhaps skbs are useful for fanotify, but I don't > know that netlink's multicasting is useful. But storing the messages > in skbs for transmission, and using parts of netlink to manage them, > and to provide some of the API, that might be useful. The multicast-a-single-skb part is useful. Of course, its usefulness diminishes as the number of listeners per subtree goes down (because it reduces to one skb per listener). So all this depends on how fanotify is going to be used. The one thing i am not sure of is how you map a multicast group to a subtree. In netlink groups to which multiple listeners subscribe to are 32-bit identifiers. I suppose, one approach could be to register for the event of interest, get an ID then use the ID to listen to a multicast group of that ID. This way whoever is issuing the ID can also factor in permissions and subtree overlap of the listener (and whether the events are already being listened to in a known ID). Alternatively, to your statement above, if fanotify is keeping track of all subsribers then it can replicast a single event instead and just bump the refcount on the skb for each sent-to-user (and still use one skb).. > You do get nothing unless you register interest. The problem is > there's no way to register interest on just a subtree, so the fanotify > approach is let you register for events on the whole filesystem, and > let the userspace daemon filter paths. At least it's decisions can be > cached, although I'm not sure how that works when multiple processes > want to monitor overlapping parts of the filesystem. I guess if the non-optimal part happens only once and subsequent cached filters happen faster, then one could look at that as cost of setup. I think, given that you are capable of creating such a cache, seems that it would be cheaper to make such decision at registration time. > It doesn't sound scalable to me, either, and that's why I don't like > this part, and described a solution to monitoring subtrees - which > would also solve the problem for inotify. (Both use fsnotify under > the hood, and that's where subtree notification would go). > > Eric's mentioned interest in a way to monitor subtrees, but that > hasn't gone anywhere as far as I know. He doesn't seem convinced by > my solution - or even that scalability will be an issue. I think > there's a bit of vision lacking here, and I'll admit I'm more > interested in the inotify uses of fsnotify (being able to detect > changes) than the fanotify uses (being able to _block_ or _modify_ > changes). I think both inotify and fanotify ought to benefit from the > same improvements to file monitoring. > The subtree overlap problem seems to invoke some well known computer science algorithms, no? i.e tell me oracle given the event on nodeX of this tree, which subscriber needs to be notified? > I believe it would cause 10000 events, yes, even if they are files > that userspace policy is not interested in. Eric, is that right? > > However I believe after the first grep, subsequent greps' decisions > would be cached by marking the inodes. I'm not sure what happens if > two fanotify monitors both try marking the inodes. > > Arguably if a fanotify monitor is running before those files are in > page cache anyway, then I/O may dominate, and when the files are > cached, fanotify has already cached it's decisions in the kernel. > However fanotify is synchronous: each new file access involves a round > trip to the fanotify userspace and back before it can proceed, so > there's quite a lot of IPC and scheduling too. Without testing, it's > hard to guess how it'll really perform. > So if you can mark inodes, why not do it at register time? > > > While skbs and netlink aren't that slow, I suspect they're an order of > > > magnitude or two slower than, say, epoll or inotify at passing events > > > around. > > > > not familiar with inotify. > > inotify is like dnotify, and like a signal or epoll: a message that > something happened. You register interest in individual files or > directories only, and inotify does not (yet) provide a way to monitor > the whole filesystem or a subtree. > > fanotify is different: it provides access control, and can _refuse_ > attempts to read file X, or even modify the file before permitting the > file to be read. > Ok, I think i understood more about fanotify now. It is more of an access control than a mass notification scheme (which is what i thought of earlier). Hrm, it does sound like something closer to selinux if it is simple enough to require answers to simple questions like "should this operation continue?" > > Theres a difference between events which are abbreviated in the form > > "hey some read happened on fd you are listening on" vs "hey a read > > of file X for 16 bytes at offset 200 by process Y just occured while > > at the same time process Z was writting at offset 2000". The later > > (which netlink will give you) includes a lot more attribute details > > which could be filtered or can be extended to include a lot > > more. The former(what epoll will give you) is merely a signal. > > Firstly, it's really hard to retain the ordering of userspace events > like that in a useful way, given the non-determinstic parallelism > going on with multiple processes doing I/O do the same file :-) > Bad example ;-> That was not meant to be anything clever - rather to demonstrate that netlink allows you to send many attributes with events and that you can add as many as you want over a period of time (instead of hardcoding it at design/coding time). On a tangent: I would love to get more than simple events (read/write/exception) on a file. Probably more on the writes than on reads; example "offset X, length Y has been deleted" etc. I would still love the option to exercise my rights to simple events like read/write/exception cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 14, 2009 at 01:17:59AM +0100, Jamie Lokier (jamie@shareable.org) wrote: > > When queue is full or you do not have enough RAM. Both are reported at > > 'sending' time. > > Can you ->poll() and wait reliably until the queue will accept an skb? > (A few spurious EAGAINs/ENOBUFs is ok, as long as it's not the norm). Not that simple and for memory allocation error just can't. There is no direct access to remote peer sockets, i.e. to userspace ones, so one will have to lock netlink table and run over listeners and check whether they can accept the message and wait/poll for queue size to become big enough. Netlink table and its locking is not exported, so effectively there is no simple way to do this.
Long ago I implemented fanotify as basically a /dev interface using ioctls(). Alan suggested I use a socket protocol and could then make use of get/setsockopt() which although still not great is light years better than ioctl. Currently the fanotify interface as I want to push it to Linus and as I've been requesting comments on for the last 1.25 years is just that. It really makes no use of the networking system other than bind() and setsockopt() and everyone tends to agree the things I want to do can't reasonably be done using network hooks and a 'real' socket protocol. I like this interface, setsockopt() makes it so easy to add new functionality as we flush out other users. fanotify as it stands today has a number of groups who will port to it, has a nubmer of advantages over inotify, and I have been told privately meets the needs of the original group of people who paid me to work on it (two very large anti-malware companies who currently unprotect and hack the syscall table of their users) Just this week I got another request to look at syscalls. So I did, I haven't prototyped it, but I can do it with syscalls, they would look like this: int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority); int fanotify_add_mark(int fanotify_fd, char *path, __u64 mask, __u64 ignored_mask); int fanotify_add_mark_fd(int fanotify_fd, int fd, __u64 mask, __u64 ignored_mask); int fanotify_rm_mark(int fanotify_fd, char *path, __u64 mask); int fanotify_rm_mark_fd(int fanotify_fd, int fd, __u64 mask); Those above 4 could probably be squashed into 2 syscalls with an extra flags field. int fanotify_clear_marks(int fanotify_fd); int fanotify_perm_response(int fanotify_fd, __u64 cookie, int response); int fanotify_ignore_sb(int fanotify_fd, long f_type); int fanotify_ignore_fsid(int fanotify_fd, fsid_t f_fsid); These 2 are the most questionable, they would honestly only be used for things that wanted system wide notification, I can't imagine that being many things other than AV vendors. But they really need a way to exclude notification when people open/close/read/write to /proc (which is the point of the ignore_sb.) Since I don't have a solution to subtree notification I don't know if it will work in this syscall framework. I know people want subtree notification and I'm willing to take a stab at it after the fscking all notification is accepted. That's one of the main reasons I like setsockopt over tons of syscalls. I can add a new one very easily. I also can easily expand arguments by just creating a new sockopt. No userspace headaches. Are there demands that I convert to syscalls? Do I really gain anything using 9 new inextensible syscalls over socket(), bind(), and 8 setsockopt() calls? I'd like to send these patches along, so a ruling from on high would be great.... -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis@redhat.com) wrote: > Just this week I got another request to look at syscalls. So I did, I > haven't prototyped it, but I can do it with syscalls, they would look > like this: > > int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority); ... > Are there demands that I convert to syscalls? Do I really gain anything > using 9 new inextensible syscalls over socket(), bind(), and 8 > setsockopt() calls? In my personal opinion sockets are way much simpler and obvious than syscalls. Also one should not edit hundred of arch-dependant headers conflicting with other pity syscallers. But implementing af_fanotify opens a door for zillions of other af_something which can be implemented using existing infrastructure namely netlink will solve likely 99% of potential usage cases. And frankly I did not find it way too convincing that using netlink is impossible in your scenario if some things will be simplified, namely event merging. Moreover you can implement a pool of working threads and postpone all the work to them and appropriate event queues, which will allow to use rlimits for the listeners and open files 'kind of' on behalf of those processes. But it is quite diferent from the approach you selected and which is more obvious indeed. So if you ask a question whether fanotify should use sockets or syscalls, I would prefer sockets, but I still recommend to rethink your decision to move away from netlink to be 100% sure that it will not solve your needs.
On Wed, 2009-09-16 at 00:16 +0400, Evgeniy Polyakov wrote: > On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis@redhat.com) wrote: > > Just this week I got another request to look at syscalls. So I did, I > > haven't prototyped it, but I can do it with syscalls, they would look > > like this: > > > > int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority); > > ... > > > Are there demands that I convert to syscalls? Do I really gain anything > > using 9 new inextensible syscalls over socket(), bind(), and 8 > > setsockopt() calls? > > In my personal opinion sockets are way much simpler and obvious than > syscalls. Also one should not edit hundred of arch-dependant headers > conflicting with other pity syscallers. > > But implementing af_fanotify opens a door for zillions of other > af_something which can be implemented using existing infrastructure > namely netlink will solve likely 99% of potential usage cases. > > And frankly I did not find it way too convincing that using netlink is > impossible in your scenario if some things will be simplified, namely > event merging. Nothing's impossible, but is netlink a square peg for this round hole? One of the great benefits of netlink, the attribute matching and filtering, although possibly useful isn't some panacea as we have to do that well before netlink to have anything like decent performance. Imagine every single fs event creating an skb and sending it with netlink only to have most of them dropped. The only other benefit to netlink that I know of is the reasonable, easy, and clean addition of information later in time with backwards compatibility as needed. That's really cool, I admit, but with the limited amount of additional info that users have wanted out of inotify I think my data type extensibility should be enough. > Moreover you can implement a pool of working threads and > postpone all the work to them and appropriate event queues, which will > allow to use rlimits for the listeners and open files 'kind of' on > behalf of those processes. I'm sorry, I don't userstand. I don't see how worker threads help anything here. Can you explain what you are thinking? > But it is quite diferent from the approach you selected and which is > more obvious indeed. So if you ask a question whether fanotify should > use sockets or syscalls, I would prefer sockets I've heard someone else off list say this as well. I'm not certain why. I actually spent the day yesterday and have fanotify working over 5 new syscalls (good thing I wrote the code with separate back and and front ends for just this purpose) And I really don't hate it. I think 3 might be enough. fanotify_init() ---- very much like inotify_init fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch fanotify_modify_mark_fd() --- same but with an fd instead of a path fanotify_response() --- userspace tells the kernel what to do if requested/allowed (could probably be done using write() to the fanotify fd) fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed.... If there is a strong need for syscalls I'm ready and willing. I'd love to hear Linus say socket+setsockopt() is a merge blocker and I have to go to syscalls if he sees it that way. If davem and friends say I'm not networky enough to use socket()+setsockopt() I guess I have to look at netlink (which I'm still not convinced buys us anything vs the crazy complexity) or go with syscalls. I'm perfectly willing to admit this is a /dev+ioctl type interface only implemented on socket+setsockopt(). If that's a no go, someone say it now please. > but I still recommend > to rethink your decision to move away from netlink to be 100% sure that > it will not solve your needs. I don't see what's gained using netlink. I am already reusing the fsnotify code to do all my queuing. Someone help me understand the benefit of netlink and help me understand how we can reasonably meet the needs and I'll try to prototype it. 1) fd's must be opened in the recv process 2) reliability, if loss must know on the send side -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 15 Sep 2009, Eric Paris wrote: > > I don't see what's gained using netlink. I'm personally not a big believer in netlink. What's the point, really? If you are sending datagrams back-and-forth, go wild. But if it's more structured than that, netlink has no actual upsides as far as I can tell. Same goes for sockets in this case, actually. What's the upside? I'll throw out a couple of upsides of actual system calls, people can feel free to comment: - things like 'strace' _work_ and the traces make sense, and you generally see what the app is trying to do from the traces (sure, it takes some time for strace to learn new system calls, but even when it only gives a system call number, it's never any worse than some "made-up packet interface". - if you have a system call definition, it tends to be a much stricter interface than "let's send some packets around with a network interface". - No unnecessary infrastructure. That said, maybe the netlink/socket people can argue for their standpoints. (And btw, I still want to know what's so wonderful about fanotify that we would actually want yet-another-filesystem-notification-interface. So I'm not sayying that I'll take a system call interface. I just don't think that hiding interfaces behind some random packet interface is at all any better) Linus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote: > And btw, I still want to know what's so wonderful about fanotify that we > would actually want yet-another-filesystem-notification-interface. So I'm > not sayying that I'll take a system call interface. The real thing that fanotify provides is an open fd with the event rather than some arbitrary 'watch descriptor' that userspace must somehow magically map back to data on disk. This means that it could be used to provide subtree notification, which inotify is completely incapable of doing. And it can be used to provide system wide notification. We all know who wants that. It provides an extensible data format which allows growth impossible in inotify. I don't know if anyone remember the inotify patches which wanted to overload the inotify cookie field for some other information, but inotify information extension is not reasonable or backwards compatible. fanotify also allows userspace to make access decisions and cache those in the kernel. Useful for integrity checkers (anti-malware people) and for hierarchical storage management people. I've got private commitments for two very large anti malware companies, both of which unprotect and hack syscall tables in their customer's kernels, that they would like to move to an fanotify interface. Both Red Hat and Suse have expressed interest in these patches and have contributed to the patch set. The patch set is actually rather small (entire set of about 20 patches is 1800 lines) as it builds on the fsnotify work already in 2.6.31 to reuse code from inotify rather than reimplement the same things over and over (like we previously had with inotify and dnotify) Don't know what else to say..... -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris wrote: > On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote: > > And btw, I still want to know what's so wonderful about fanotify that we > > would actually want yet-another-filesystem-notification-interface. So I'm > > not sayying that I'll take a system call interface. > > The real thing that fanotify provides is an open fd with the event > rather than some arbitrary 'watch descriptor' that userspace must > somehow magically map back to data on disk. This means that it could be > used to provide subtree notification, which inotify is completely > incapable of doing. That's a bit of a spurious claim. - fanotify does not provide subtree notification in it's present form. When it is extended to do that, why wouldn't inotify be as well? That's an fsnotify feature, common to both. - fanotify does not provide notification at all for some events that you get with inotify. It is not a superset, so you can't use fanotify to provide a subtree-capable equivalent to inotify. What a mess when you need the combination of both features! - fanotify requires you call readlink(/proc/fd/N) for every event to get the path. It's not a particularly efficient way to get it, especially when an apps wants to know if it's something in it's region of interest but doesn't care about the actual path. When an apps knows it needs the map back to to path, why make it slow to get it? That "extensible data format" is being underutilised... - fanotify's descriptor may be race-prone as a way to get the subtree used for access, because any of the parent directories could have moved and even been deleted before the app calls readlink(/proc/fd/N). I don't know if a _reliable_ way to track changes in a subtree can be built on it. Maybe it can but it appears this hasn't been analysed. It depends on readlink(/proc/fd/N)'s behaviour when the dentry's have been changed, among other things. - Does the descriptor cause umount to fail when user does "do some stuff in baz; umount baz", or does it serialise nicely? That's one of inotify's nice features - it doesn't cause umounts to fail. > And it can be used to provide system wide notification. We all know > who wants that. People who want to break out of chroot/namespace jails using the conveniently provided open file descriptor? :-) Seriously, what does system-wide fanotify do when run from a chroot/namespace/cgroup, and a file outside them is accessed? If the event is delivered with file desciptor, that's a security hole. If it's not delivered, that sounds like working subtree support? I'd expect anti-malware to want to be run inside VMs quite often... Note that there's no such thing as "the real system root" any more. > It provides an extensible data format which allows growth impossible in > inotify. I don't know if anyone remember the inotify patches which > wanted to overload the inotify cookie field for some other information, > but inotify information extension is not reasonable or backwards > compatible. I agree with this (although that's what flags are for -- see clone). I don't have a problem with the next interface being fanotify (despite arguing a lot); I just want to see the next one being useful for the things I would otherwise be proposing my own yet-another-interface for. So we don't need a fourth one soon after the third due to easily foreseen limitations. > I've got private commitments for two very large anti malware companies, > both of which unprotect and hack syscall tables in their customer's > kernels, that they would like to move to an fanotify interface. Both > Red Hat and Suse have expressed interest in these patches and have > contributed to the patch set. > > The patch set is actually rather small (entire set of about 20 patches > is 1800 lines) as it builds on the fsnotify work already in 2.6.31 to > reuse code from inotify rather than reimplement the same things over and > over (like we previously had with inotify and dnotify) I don't have any problem with either of these, and _fs_notify generally seems like an improvement. I don't have a problem with fanotify either. For what it does, it's ok. > Don't know what else to say..... Answer questions about use-cases that you're not interested in? Why block them? What about Evigny's request for an event without an open fd - because he needs the pid information (inotify doesn't provide) but not the fd? Sorry to be so harsh. I'm really trying to make sure we don't repeat the mistakes of dnotify and inotify, and end up with a third interface which also is too restrictive (because it's good enough for your anti-malware and HSM customers) so that a fourth interface will be needed soon after. I'd like to be able to use it from some applications to accelerate userspace caching of things (faster Make, faster Samba) without penalising all other applications touching unrelated parts of the filesystem. The attitude "you can live with 10% slowdown" worries me. I'm sure that can be fixed with a bit of care. If the intention is to maintain fanotify and inotify side-by-side for different uses (because fanotify returns open descriptors and blocks the accessing process until acked), that's ok with me. It makes sense. But then it's messy that neither offers a superset of the other regarding which files and events are tracked. If it's right that inotify has no room for extensibility (I'm not sure about this), than it appears we already made a mess with dnotify and inotify, so it would be a shame to repeat the same mistakes again. Let's get the next one right, even it takes a bit longer, ok? -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote: > Eric Paris wrote: > > On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote: > > > And btw, I still want to know what's so wonderful about fanotify that we > > > would actually want yet-another-filesystem-notification-interface. So I'm > > > not sayying that I'll take a system call interface. > > > > The real thing that fanotify provides is an open fd with the event > > rather than some arbitrary 'watch descriptor' that userspace must > > somehow magically map back to data on disk. This means that it could be > > used to provide subtree notification, which inotify is completely > > incapable of doing. > > That's a bit of a spurious claim. My claim that a watch descriptor plus pathname segment sucks is spurious? You've got to be kidding me. You think that a number which represents the pathname of an object at some point in the past is a reasonable piece of information? If a watch descriptor actually provided any information about the object on which an event just happened it would be useful. Sadly, it doesn't. > - fanotify does not provide subtree notification in it's > present form. When it is extended to do that, why wouldn't > inotify be as well? That's an fsnotify feature, common to both. Because I don't believe inotify can be reasonably extended in this way. It's already clear that an arbitrary watch descriptor which userspace has to somehow know how to correctly map back to an object (impossible task) is difficult to use and I personally don't see how watch descriptor + long path name component is somehow better or even reasonable. Path names are such crap and passing a pathname to userspace is really just telling userspace, something happened to something that used to be at this location but is possibly long since gone. I don't believe that's a good interface or one we should be allowing to be {ab,}used. > - fanotify does not provide notification at all for some events that > you get with inotify. It is not a superset, so you can't use > fanotify to provide a subtree-capable equivalent to inotify. What > a mess when you need the combination of both features! This is true, although where possible I plan to rectify this situation on an ongoing basis. Absolutely nothing preventing that from happening. > - fanotify requires you call readlink(/proc/fd/N) for every event to > get the path. And inotify requires magic. Which one sounds better? > It's not a particularly efficient way to get it, What is? > especially when an apps wants to know if it's something in it's > region of interest but doesn't care about the actual path. > When an apps knows it needs the map back to to path, why make it > slow to get it? That "extensible data format" is being > underutilised... You convince Al Viro that the vfs should give us a path name for an arbitrary object that honestly might not have one and I'll consider giving it to userspace in the event notification. Probably should read some of the AppArmour arguments before you do though. You're asking for something that's impossible and is at best incredibly race prone crap. At worst is a total lie. > - fanotify's descriptor may be race-prone as a way to get the subtree > used for access, because any of the parent directories could have > moved and even been deleted before the app calls > readlink(/proc/fd/N). I don't know if a _reliable_ way to track > changes in a subtree can be built on it. Maybe it can but it > appears this hasn't been analysed. It depends on > readlink(/proc/fd/N)'s behaviour when the dentry's have been > changed, among other things. Huh? Your saying that by the time userspace deals with an event it might have moved? True. So what? Spurious and irrelevant. You got an event that happened to this object. If subtree notification is added at some later point you may get an event for an object that was in the subtree and is now not in the subtree. That sounds fine to me.... > - Does the descriptor cause umount to fail when user does "do some > stuff in baz; umount baz", or does it serialise nicely? That's one > of inotify's nice features - it doesn't cause umounts to fail. Ah yes, you didn't read the code. Seeing as how that's an fsnotify property and fanotify notify is built on fsnotify.... > Seriously, what does system-wide fanotify do when run from a > chroot/namespace/cgroup, and a file outside them is accessed? At the moment an fanotify global listener is system wide. Truely system wide. A gentleman from suse is looking rectify the problem so that if run inside a namespace it stays inside the namespace. Note that this particular little tidbit is not in the 8 patches I proposed. At the moment those just include the UI and basic notification. > If the event is delivered with file desciptor, that's a security hole. > If it's not delivered, that sounds like working subtree support? What? > I'd expect anti-malware to want to be run inside VMs quite often... What? > Answer questions about use-cases that you're not interested in? Why > block them? What about Evigny's request for an event without an open > fd - because he needs the pid information (inotify doesn't provide) > but not the fd? It's not blocked, it's on the list of things to look at. It's a relatively simple change to have the mark add code return a watch descriptor and use that value instead of an fd. I believe I already said that. It certainly doesn't seem like a merge blocker that some other feature might come along one day. There's nothing preventing it from coming out. "Here is a patch which implements TCP." "No, don't merge that, it doesn't implement UDP as well." "WTF?" > I'd like to be able to use it from some applications to accelerate > userspace caching of things (faster Make, faster Samba) without > penalising all other applications touching unrelated parts of the > filesystem. The attitude "you can live with 10% slowdown" worries me. > I'm sure that can be fixed with a bit of care. I've said that using ANYTHING which allows userspace to arbitrate access decisions for an entries system is going to eat performance. If you can't handle that, don't build that damn userspace decision making portion into your kernel! Don't buy anti-malware snake oil. fanotify as a notification system isn't some performance monster here to eat your babies. fanotify as an access control system is, and I can't really do anything about that, it's the nature of that part of the horrible beast. Lets not confuse things... > If the intention is to maintain fanotify and inotify side-by-side for > different uses (because fanotify returns open descriptors and blocks > the accessing process until acked) I know I've told you this before. fanotify provides TWO separate things. Notification with a fd. A method to provide access control and blocking. Lets not lead people to believe you can't you one without the other. They are very distinct. > , that's ok with me. It makes > sense. But then it's messy that neither offers a superset of the > other regarding which files and events are tracked. inotify isn't going away any time soon.... > If it's right that inotify has no room for extensibility (I'm not sure > about this), than it appears we already made a mess with dnotify and > inotify, so it would be a shame to repeat the same mistakes again. > Let's get the next one right, even it takes a bit longer, ok? Calling inotify a "mess" or "completely inextensible" would probably be too harsh. There are ways with inotify_init1() that we could provide a better or more forward looking data format. But you are still left with inotify's fatal flaw: userspace is required to map a random number to an object in the filesystem. inotify starts with a reasonably impossible premise and builds from there. You show me anything about my proposal that limits any use case you have in mind and I'll fix it. I'll fix it today. My proposal does what it does very well and leaves the door open to build upon it for whatever has been mentioned over the last 1.25 years I've been trying to get comments. I'm certainly not going to wait until fanotify can leap out of my computer and make a pot of coffee before I ask for it to be merged (and no, it doesn't have that forward looking extensibility, I'll add a syscall flag for it right now, O_COFFEE) -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> - fanotify does not provide subtree notification in it's > present form. When it is extended to do that, why wouldn't > inotify be as well? That's an fsnotify feature, common to both. Because inotify gives you no reliable access to the object monitored as the name passed back is not an object reference and is racy. Inotify is fine for making pretty icons pop up on desktops and making file selectors update, but it is somewhat inadequate for indexers and completely useless for stuff like HSM. > - fanotify requires you call readlink(/proc/fd/N) for every event to > get the path. It's not a particularly efficient way to get it, IFF you want the path, but the path isn't usually the most valuable bit. Plus you'll find the readlink is extremely quick anyway. > People who want to break out of chroot/namespace jails using the > conveniently provided open file descriptor? :-) chroot isn't a security model. You can already do this with AF_UNIX sockets (and there are apps that intentionally use fchdir that way) > I'd expect anti-malware to want to be run inside VMs quite often... Inside of containers - unlikely. Inside of guests sure but thats not going to relevant to fanotify() > the accessing process until acked), that's ok with me. It makes > sense. But then it's messy that neither offers a superset of the > other regarding which files and events are tracked. Agreed. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 15 September 2009, Eric Paris wrote: > fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch > fanotify_modify_mark_fd() --- same but with an fd instead of a path I think these two can be merged into one without adding complexity, in the same way that sys_utimensat can take a file descriptor or a path or both. > fanotify_response() --- userspace tells the kernel what to do if requested/allowed > (could probably be done using write() to the fanotify fd) > fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed.... Please don't use an ioctl here. While ioctl is fine for character devices and sort of fine for sockets, I think it would be very bad style to use it on file descriptors that you get back from specialized syscalls like fanotify_init. Do one or the other, but do it consistently. Why is it not strongly typed anyway? Something like int fanotify_ignore_sb(int fanotify_fd, unsigned int flag, long f_type, fsid_t f_fsid); would be type safe, although I think it would be better to only handle one of the two cases. Can you think of a case that you can't handle if you have to decide between them and only do one interface (f_type or fsid)? Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Cox wrote: > > - fanotify does not provide subtree notification in it's > > present form. When it is extended to do that, why wouldn't > > inotify be as well? That's an fsnotify feature, common to both. > > Because inotify gives you no reliable access to the object monitored as > the name passed back is not an object reference and is racy. Inotify is > fine for making pretty icons pop up on desktops and making file > selectors update, but it is somewhat inadequate for indexers and > completely useless for stuff like HSM. That was my point. (Why do people keep not getting it?) You can't rely on the name being non-racy, but you _can_ reliably invalidate application-level caches from the sequence of events including file writes, creates, renames, links, unlinks, mounts. And revalidate such caches by the absence of pending events. (There is one obscure case which inotify is missing, though, which means it cannot detect file changes in certain cases with hard links. I intend to fix that one.) For that, an inode isn't useful, a descriptor isn't useful, a directory descriptor/inode and pathname isn't useful, and file write events by themselves aren't useful. None of them quite do it by themselves. But with the correct combination of events, you can maintain very efficient application-level caching of file data / directory listing and lookups / stat results you have previously read from the filesystem. That's because the information you have previously depended upon, including path lookups, are all notified as one sort of inotify event or another when changed. Which doesn't sound all that special until you realise you can very quickly revalidate application-caches of any data structure calculated from reading things from the filesystem, no matter how many prerequisites or how complex the data structures, in a single system call. Amortised over many revalidations if you have them in parallel. That can apply to things like git, make, ccache, samba, rsync, httpd path walks, and virtually any "web templating" framework. Of course it takes userspace support as well, but that's where I'm coming from regarding "acceleration" and the essential kernel infrastructure. Clearly, I'm going to have to explain with working code :-) > but it is somewhat inadequate for indexers For indexers, the real inadequacy is the need to attach inotify watches to every directory at system startup, and to stat() everything to check it hasn't changed since the indexer was last running. Both are very slow on a large directory tree. The former can be dealt with using subtree watches (yes, even with hard links - I have proposed an algorithm for this but I think nobody understood it ;-). The latter needs filesystem support for a persistent change attribute. > > - fanotify requires you call readlink(/proc/fd/N) for every event to > > get the path. It's not a particularly efficient way to get it, > > IFF you want the path, but the path isn't usually the most valuable bit. > Plus you'll find the readlink is extremely quick anyway. I agree, you don't usually want the whole path. So what was the point about fanotify making subtree tracking possible with it's file descriptor, if not by readlink(/proc/fd/N)? Descriptors don't tell you which subtree a file is in any better than inotify watches. I.e. they do, if you track them and their containing directories all individually. > > People who want to break out of chroot/namespace jails using the > > conveniently provided open file descriptor? :-) > > chroot isn't a security model. You can already do this with AF_UNIX > sockets (and there are apps that intentionally use fchdir that way) Ah, no. AF_UNIX works with explicit sender cooperation. fanotify gives you access to files without sender cooperation, as it intercepts every open(). > > I'd expect anti-malware to want to be run inside VMs quite often... > > Inside of containers - unlikely. Why not? Some people run entire distributions in containiners, and present them as VMs to the world for other people to admin. > > the accessing process until acked), that's ok with me. It makes > > sense. But then it's messy that neither offers a superset of the > > other regarding which files and events are tracked. > > Agreed. In the end this is my main gripe. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> You can't rely on the name being non-racy, but you _can_ reliably > invalidate application-level caches from the sequence of events > including file writes, creates, renames, links, unlinks, mounts. And > revalidate such caches by the absence of pending events. You can't however create the caches reliably because you've no idea if you are referencing the right object in the first place - which is why you want a handle in these cases. I see fanotify as a handle producing addition to inotify, not as a replacement (plus some other bits around open blocking for HSM etc) > Clearly, I'm going to have to explain with working code :-) Always a good demo > > but it is somewhat inadequate for indexers > > For indexers, the real inadequacy is the need to attach inotify > watches to every directory at system startup, and to stat() everything > to check it hasn't changed since the indexer was last running. Both stat doesn't help you - inode numbers are only guaranteed unqiue (and constant) while a reference to the object is held. > Descriptors don't tell you which subtree a file is in any better than > inotify watches. I.e. they do, if you track them and their containing > directories all individually. Don't get me wrong - I don't think fanotify is sufficient on its own - and this is one reason. Some things care about the namespace, some about getting the exact content. > > chroot isn't a security model. You can already do this with AF_UNIX > > sockets (and there are apps that intentionally use fchdir that way) > > Ah, no. AF_UNIX works with explicit sender cooperation. > > fanotify gives you access to files without sender cooperation, as it > intercepts every open(). and is currently not general user accessible for this reason. > > Inside of containers - unlikely. > > Why not? Some people run entire distributions in containiners, and > present them as VMs to the world for other people to admin. In a word - performance. In two words performance and security. It isn't a sensible setup because you want to scan the most efficient way possible and you want to keep your malware scan as far away from attackers, so it makes sense to keep it outside of the containers and do the job once - one scan, one database of things to keep current. Alan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 15, 2009 at 05:54:59PM -0400, Eric Paris (eparis@redhat.com) wrote: > Nothing's impossible, but is netlink a square peg for this round hole? > One of the great benefits of netlink, the attribute matching and > filtering, although possibly useful isn't some panacea as we have to do > that well before netlink to have anything like decent performance. > Imagine every single fs event creating an skb and sending it with > netlink only to have most of them dropped. There is no problem with performance even with single IO per skb. Consider usual send/recv calls which may end up with the same skb per syscall - most of the overhead comes from data copy or syscall machinery (for small writes) and not from allocation path. I have a 3.5 years old performance graph at http://www.ioremap.net/gallery/netlink_perf.png which shows 400 MB/s of bandwidth for 4k writes, I'm pretty sure it is limited by copy performance only. > The only other benefit to netlink that I know of is the reasonable, > easy, and clean addition of information later in time with backwards > compatibility as needed. That's really cool, I admit, but with the > limited amount of additional info that users have wanted out of inotify > I think my data type extensibility should be enough. I want alot from inotify which I'm afraid will not be easy with fanotify either, but its existing model just does not allow its extension. I would not be 100% sure that there will be no additional needs in a year or so for fanotify. > > Moreover you can implement a pool of working threads and > > postpone all the work to them and appropriate event queues, which will > > allow to use rlimits for the listeners and open files 'kind of' on > > behalf of those processes. > > I'm sorry, I don't userstand. I don't see how worker threads help > anything here. Can you explain what you are thinking? I meant that it could be possible to postpone all the work of queueing, event allocation, fd opening and population all be done on behalf of some other threads in the system and only original process credentials would be checked to satisfy various limits. In this case there will be no questions in which context given fd was created and it is possible to use async netlink nature. I do not force you to do this of course, but there is already quite huge infrastructure for similar tasks and it could be worth to change/reconsider things to use existing models and not invent own. Of course this is a matter of overall benefit. > > But it is quite diferent from the approach you selected and which is > > more obvious indeed. So if you ask a question whether fanotify should > > use sockets or syscalls, I would prefer sockets > > I've heard someone else off list say this as well. I'm not certain why. > I actually spent the day yesterday and have fanotify working over 5 new > syscalls (good thing I wrote the code with separate back and and front > ends for just this purpose) And I really don't hate it. I think 3 > might be enough. > > fanotify_init() ---- very much like inotify_init > fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch > fanotify_modify_mark_fd() --- same but with an fd instead of a path Those two can be combined I think. > fanotify_response() --- userspace tells the kernel what to do if requested/allowed > (could probably be done using write() to the fanotify fd) > fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed.... It all sounds good and simple, but what if you will need modify command with new arguments? Instead of adding new typed option you will need to add another syscall. I already did that for inotify but via ioctl and pretty sure there will be such need for much wider fanotify some time in the future. > I don't see what's gained using netlink. I am already reusing the > fsnotify code to do all my queuing. Someone help me understand the > benefit of netlink and help me understand how we can reasonably meet the > needs and I'll try to prototype it. > > 1) fd's must be opened in the recv process Or just injected into registered process' fd table with appropriate limit checks? In this case it can be done on behalf of whatever other worker. > 2) reliability, if loss must know on the send side You have this knowledge at netlink sending time, but there is no way to wait until 'fail' condition is removed like when you can block writing into socket waiting for buffer space to become large enough. And there is no way to tell how many listeners got message and how many was dropped in multicast deliver except that there were drops. This can be trivially extended though.
Eric Paris wrote: > On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote: > > Eric Paris wrote: > > > On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote: > > > rather than some arbitrary 'watch descriptor' that userspace must > > > somehow magically map back to data on disk. This means that it could be > > > used to provide subtree notification, which inotify is completely > > > incapable of doing. > > > > That's a bit of a spurious claim. > > My claim that a watch descriptor plus pathname segment sucks is > spurious? You've got to be kidding me. You think that a number which > represents the pathname of an object at some point in the past is a > reasonable piece of information? If a watch descriptor actually > provided any information about the object on which an event just > happened it would be useful. Sadly, it doesn't. It's subtler than that. Too subtle, apparently. For directory changes, the event represents the path in that directory, which is useful, yes, even though the path may not be valid by the time you receive the message. It's useful because it tells userspace to invalidate any cached information about that path, and because the _lack_ of such a message after reading the inotify descriptor tells you that the path was valid before the point where you started the read (if not memory-coherent, then good enough for some applications). I know, I know, it's sounding a bit obscure and questionable. This is going to need explaining with real userspace code, so I'll hack on that (when I have time - next few days) and get back to you. I see now that my arguments aren't helping without clear working code, mainly because you believe I'm talking poo and haven't a clue what I'm talking about. > It's already clear that an arbitrary watch descriptor which userspace > has to somehow know how to correctly map back to an object (impossible > task) is difficult to use and I personally don't see how watch > descriptor + long path name component is somehow better or even > reasonable. Path names are such crap and passing a pathname to > userspace is really just telling userspace, something happened to > something that used to be at this location but is possibly long since > gone. I don't believe that's a good interface or one we should be > allowing to be {ab,}used. It's subtler than that. Application caches depend on lookups as well as inode operations. A simple stat("/foo/bar") does. Those path components in events are the only way to invalidate/revalidate data dependent on lookups. Inodes (fstat->st_ino from the descriptor returned by fanotify) are insufficient and cannot be used to revalidate path lookups or data dependent on them. Yes, it works even though the paths in events are not valid by the time you get them. In fact it _depends_ on getting those paths which aren't valid by the time you get them. > > especially when an apps wants to know if it's something in it's > > region of interest but doesn't care about the actual path. > > When an apps knows it needs the map back to to path, why make it > > slow to get it? That "extensible data format" is being > > underutilised... > > You convince Al Viro that the vfs should give us a path name for an > arbitrary object that honestly might not have one and I'll consider > giving it to userspace in the event notification. Probably should read > some of the AppArmour arguments before you do though. You're asking for > something that's impossible and is at best incredibly race prone crap. > At worst is a total lie. No, I'm asking for something that clearly is not being understood here I'm well aware of AppArmour and it's races etc.; it doesn't apply. > > Seriously, what does system-wide fanotify do when run from a > > chroot/namespace/cgroup, and a file outside them is accessed? > > At the moment an fanotify global listener is system wide. Truely system > wide. A gentleman from suse is looking rectify the problem so that if > run inside a namespace it stays inside the namespace. Note that this > particular little tidbit is not in the 8 patches I proposed. At the > moment those just include the UI and basic notification. I'll be really interested in the gentleman's solution. In general bind mounts complicate cache maintenance with inotify rather a lot. That's another corner case to tidy up. Bind mounts and namespaces have a lot in common. > Because I don't believe inotify can be reasonably extended in this way. I do, so give me a few days to code something and explain / settle it. I don't think I can code it using fanotify - the descriptor doesn't provide enough context. I think that's why we see them so differently - the different ways to use the events means that sometimes the inotify info does not work, and sometimes the fanotify descriptor does not work, because each provides some critical information (subtly) that the other does not. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov wrote: > It all sounds good and simple, but what if you will need modify command > with new arguments? Instead of adding new typed option you will need to > add another syscall. I already did that for inotify but via ioctl and > pretty sure there will be such need for much wider fanotify some time in > the future. Ew, that's unpleasant (adding to inotify via ioctl). You're right, it's not easy to add new syscalls, but it's not that hard either. I'd forgotten about Linus' strace argument. That's a good one. Of course everything should be a syscall by that argument :-) And strace can trace some ioctls and setsockopts. (But it's never pretty to see isatty() showing in strace as SNDCTL_TMR_TIMEBASE :-) Strace never shows structure of reads and writes to devices, so although efficient (you can batch), it's not nice for tracing. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Cox wrote: > > You can't rely on the name being non-racy, but you _can_ reliably > > invalidate application-level caches from the sequence of events > > including file writes, creates, renames, links, unlinks, mounts. And > > revalidate such caches by the absence of pending events. > > You can't however create the caches reliably because you've no idea if > you are referencing the right object in the first place - which is why > you want a handle in these cases. I see fanotify as a handle producing > addition to inotify, not as a replacement (plus some other bits around > open blocking for HSM etc) There are two sets of events getting mixed up here. Inode events - reads, writes, truncates, chmods; and directory events - renames, links, creates, unlinks. Inode events alone _not enough_ to maintain caches, and here's why. With a file descriptor for an _inode_ event, that's fine. If you have { int fd1 = open("/foo/bar"), fd2 = open("/foo/baz"); } early in your program, and later cached_file_read(fd1) and cached_file_read(fd2), you have to recognise the inode number and invalidate both. You have to call fstat() on the event's descriptor and then look up a device+inode number in your own table. (The inotify way doesn't need the fstat() but is otherwise the same). That's fine for files you're keeping open and only want to know if the content changes _of an open file_. But that's not so useful. More often, you want to validate cached_file_read("/foo/bar"). That is, validate what you'd get if you opened that path _now_ and read it. Same for cached_stat("/foo/bar") to cache permissions, and other things like that. That needs to validate the path lookup _and_ the inode state. For that, we need directory events, and they must include the name in the directory that's affected. If you receive a directory event involving name "bar" in directory (identified by inode) "/foo", you invalidate cached_file_read("/foo/bar") and cached_stat("/foo/bar"). Oh, but wait, how do we know the inode for the directory in our event still refers to "/foo"? Answer: We're also watching it's parent directory "/". Assuming no reordering of certain events, that's ok. That way, by watching "/", "/foo" and "/foo/bar", when you receive no events you validate the results of cached_file_read("/foo/bar") and cached_stat("/foo/bar"). A lot to set up, but fast to check. Worth it if you're checking a lot of things that rarely change. If you receive inode events while watching the parent directory of the path used to access the inode, then you can avoid watching "/foo/bar", and just watch the path of parent directories. That saves an order of magnitude of watches typically. fanotify offers something similar, and in this case the event is probably more useful than inotify's. (The above is even hard-link-safe, if you do it right. I won't complicate the explanation with details). -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote: > Alan Cox wrote: > > > You can't rely on the name being non-racy, but you _can_ reliably > > > invalidate application-level caches from the sequence of events > > > including file writes, creates, renames, links, unlinks, mounts. And > > > revalidate such caches by the absence of pending events. > > > > You can't however create the caches reliably because you've no idea if > > you are referencing the right object in the first place - which is why > > you want a handle in these cases. I see fanotify as a handle producing > > addition to inotify, not as a replacement (plus some other bits around > > open blocking for HSM etc) > > There are two sets of events getting mixed up here. Inode events - > reads, writes, truncates, chmods; and directory events - renames, > links, creates, unlinks. My understanding of you argument is that fanotify does not yet provide all inotify events, namely those of directories operations and thus is not suitable to wholesale replace everything inotify can do. I've already said that working towards that goal is something I plan to pursue, but for now, you still have inotify. The mlocate/updatedb people ask me about fanotify and it's on the todo list to allow global reception of of such events. The fd you get would be of the dir where the event happened. They didn't care, and I haven't decided if we would provide the path component like inotify does. Most users are perfectly happy to stat everything in the dir. It's hopefully feasible, but it's going to take some fsnotify hook movements and possibly so arguments with Al to get the information I want where I want it. But there is nothing about the interface that precludes it and it has been discussed and considered. Am I still missing it? -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris wrote: > On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote: > > Alan Cox wrote: > > > > You can't rely on the name being non-racy, but you _can_ reliably > > > > invalidate application-level caches from the sequence of events > > > > including file writes, creates, renames, links, unlinks, mounts. And > > > > revalidate such caches by the absence of pending events. > > > > > > You can't however create the caches reliably because you've no idea if > > > you are referencing the right object in the first place - which is why > > > you want a handle in these cases. I see fanotify as a handle producing > > > addition to inotify, not as a replacement (plus some other bits around > > > open blocking for HSM etc) > > > > There are two sets of events getting mixed up here. Inode events - > > reads, writes, truncates, chmods; and directory events - renames, > > links, creates, unlinks. > > My understanding of you argument is that fanotify does not yet provide > all inotify events, namely those of directories operations and thus is > not suitable to wholesale replace everything inotify can do. Largely that, plus seeing fanotify look like it'll acquire some capabilities that would be useful with those inotify events and inotify not getting them. Bothered by the apparent direction of development, really. Btw, I'm not sure you can use inotify+fanotify together simultaneously in this way, which may be of benefit - caching might help the anti-malware-style access controls. I'll have to think carefully about ordering of some events, and using fanotify and inotify independently at the same time loses that ordering. > I've already said that working towards that goal is something I plan > to pursue, Sorry, I missed that, just as I didn't find a reply to Evigny's "I need pids". And from another mail, I thought you were stopping at the things with file descriptors. > but for now, you still have inotify. That's right. And it sucks for subtrees, so that's why I'd like to absorb improvements on subtree inclusions, and exclusion nodes look useful too. > The mlocate/updatedb people ask me about fanotify and it's on the todo > list to allow global reception of of such events. The fd you get would > be of the dir where the event happened. They didn't care, and I haven't > decided if we would provide the path component like inotify does. Most > users are perfectly happy to stat everything in the dir. mlocate/updatedb is relatively low performance and of course wants to be system-wide. It's not looking so good if a user wants an indexer just on their /home, and the administrator does not want everyone else to pay the cost. But I think we're quite agreed on how useful subtrees would be. System-wide events won't be needed if we can monitor the / subtree to get the same effect, and that'll also sort out namespaces and chroots. Stat'ing every entry in a dir event. Thinking out loud: 1. Stat'ing everything in a dir just to find out which 1 file was deleted can be quite expensive for some uses (if you have a large dir and it happens a lot), and is unpleasant just because each change _should_ result in about O(1) work. Taste, style, elegance ;-) For POSIX filesystems, I don't see any logical problem with this, actually. You don't need to call stat()! It's enough to call readdir() and look at d_ino to track renames/links/creates/unlinks - assuming only directory-change events are relevant here. Just an unpleasant O(n) scaling with directory size. (Note that I ignore mount points not returning the correct d_ino, because apps can track the mount list and compensate; they should be doing this anyway). 2. updatedb-style indexing apps don't care about the readdir/stat-all-entries cost, because they don't need to read the directory after every change, they only need to do it once every 24 hours if any events were received in that interval! (Obviously this isn't the same for pseudo-real-time indexers.) For Samba-style caching, on the other hand, the cost of rescanning a large directory when one file is being read often and another file in it is changing often might be prohibitive, forcing it to use heuristics to decide when to monitor a directory and when not to to cache it, depending on directory size. I'd rather avoid that. 3. Non-POSIX filesystems don't always have stable inode numbers. You can't tell that foo was renamed to bar by reading the directory and looking at d_ino, or by calling stat on each entry. You can assume stable inode numbers for inodes where there's an open file descriptor; that *might* be just enough to squeeze through the logic of a cache. I'm not sure right now. 4. You can't tell when file contents are changed from stat info. That means you have to receive an inode event, not a directory event for data changes, but that's not a problem of course - the name-used-for-access isn't useful for data changes anyway (except for debugging perhaps). 5. stat() doesn't tell you about xattr and ACL changes. xattrs can be large and slow to read on a whole directory. But as point 4, if attribute changes count as inode changes, there's no problem. 6. Calling stat() pulls a lot into cache that doesn't need to be in cache: all those inodes. But as I mentioned in points 1, 4 and 5, provided only directory name operations pass the directory to be scanned, and inode operations always pass the inode, you can use readdir() and avoid stat(), so the inodes don't have to be pulled into cache after all. Except for non-POSIX inode instability. Would be good to work out if that breaks the algorithm. In summary, calling readdir() and maybe stat/getxattr on each entry in a directory might be workable, but I'd rather it was avoidable. Simple apps may prefer to do it anyway - and let multiple events in a directory be merged as a result. While I'm here it would be nice to receive one event instead of two for operations which involve two paths: link, rename and bind mount. Having to pair up two events from inotify isn't helpful in any way. Imho an API that satisfies everything we've talking about would let you specify which fields you want to receive in the event when you bind a listener. Not _everything_ is selectable of course, but whether you want: For inode events (data read/write, attribute/ACL/xattr changes): - Open file descriptor of the affected file [Optional]. - The inode number and device number (always?). - A way to identify the vfsmount (due to bind mounts making the device number insufficient to identify a directory; always?). For directory events (create/unlink/link/rename/reflink/mkdir/rmdir /mount/umount): - Same as inode above, for the object created/linked/deleted. - Same as inode above, for the directory containing the source name. - Source name [Optional]. - Same as above, for the directory containing the target name - Target name [Optional] Source and target are the two names for rename/link/reflink/bind-mount operations. Otherwise there is only one name to include. Ironically, it begins to look a bit like netlink ;-) As you can see, I've made the open descriptors optional, and the names for directory events optional. For directory events, the object descriptor option should be independent from the source/target directory descriptor option. Add one more option: wait for Ack before file accessing process can proceed, or don't require Ack. That basically distinguishes inotify behaviour from fsnotify behaviour. It's not obvious, but that option's useful for directory events too, if you think about it: Think like an anti-malware or other access control manager, and ask: what if I have to block something which depends on the layout of files? Just as directory events are enough for caching, they are enough for complete access control of layout-dependent state too. For example, some line of text is no problem in a random file, but might be forbidden by the access manager from appearing in .bash_login, including by "mv harmless .bash_login". The above is not a final proposal, but I'd be delighted if you'd take a look at whether it's suitable. I realise some things may not work out for implementation reasons. Meanwhile, I'll take a look at userspace code for my caching algorithm and see how well that works out. I think we'll get subtree monitors out of this before the month is over... > It's hopefully feasible, but it's going to take some fsnotify hook > movements and possibly so arguments with Al to get the information I > want where I want it. That may, indeed, be a sticking point :-) > But there is nothing about the interface that > precludes it and it has been discussed and considered. > > Am I still missing it? No I think we're on the same wavelength now. Thanks for being patient. (And thanks, Alan, for stepping in and making me describe what I had in mind better). -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-09-16 at 22:49 +0100, Jamie Lokier wrote: > Eric Paris wrote: > > On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote: > > > Alan Cox wrote: > > > > > You can't rely on the name being non-racy, but you _can_ reliably > > > > > invalidate application-level caches from the sequence of events > > > > > including file writes, creates, renames, links, unlinks, mounts. And > > > > > revalidate such caches by the absence of pending events. > > > > > > > > You can't however create the caches reliably because you've no idea if > > > > you are referencing the right object in the first place - which is why > > > > you want a handle in these cases. I see fanotify as a handle producing > > > > addition to inotify, not as a replacement (plus some other bits around > > > > open blocking for HSM etc) > > > > > > There are two sets of events getting mixed up here. Inode events - > > > reads, writes, truncates, chmods; and directory events - renames, > > > links, creates, unlinks. > > > > My understanding of you argument is that fanotify does not yet provide > > all inotify events, namely those of directories operations and thus is > > not suitable to wholesale replace everything inotify can do. > > Largely that, plus seeing fanotify look like it'll acquire some > capabilities that would be useful with those inotify events and > inotify not getting them. Bothered by the apparent direction of > development, really. > > Btw, I'm not sure you can use inotify+fanotify together simultaneously > in this way, which may be of benefit - caching might help the > anti-malware-style access controls. I'll have to think carefully > about ordering of some events, and using fanotify and inotify > independently at the same time loses that ordering. As soon as you say "ordering" you lose :) > > > I've already said that working towards that goal is something I plan > > to pursue, > > Sorry, I missed that, just as I didn't find a reply to Evigny's "I > need pids". And from another mail, I thought you were stopping at the > things with file descriptors. > > > but for now, you still have inotify. > > That's right. And it sucks for subtrees, so that's why I'd like to > absorb improvements on subtree inclusions, and exclusion nodes look > useful too. > > > The mlocate/updatedb people ask me about fanotify and it's on the todo > > list to allow global reception of of such events. The fd you get would > > be of the dir where the event happened. They didn't care, and I haven't > > decided if we would provide the path component like inotify does. Most > > users are perfectly happy to stat everything in the dir. > > mlocate/updatedb is relatively low performance and of course wants to > be system-wide. It's not looking so good if a user wants an indexer > just on their /home, and the administrator does not want everyone else > to pay the cost. > > But I think we're quite agreed on how useful subtrees would be. > System-wide events won't be needed if we can monitor the / subtree > to get the same effect, and that'll also sort out namespaces and chroots. > > Stat'ing every entry in a dir event. Thinking out loud: > > 1. Stat'ing everything in a dir just to find out which 1 file was > deleted can be quite expensive for some uses (if you have a > large dir and it happens a lot), and is unpleasant just because > each change _should_ result in about O(1) work. Taste, style, > elegance ;-) > > For POSIX filesystems, I don't see any logical problem with > this, actually. You don't need to call stat()! It's enough to > call readdir() and look at d_ino to track > renames/links/creates/unlinks - assuming only directory-change > events are relevant here. > > Just an unpleasant O(n) scaling with directory size. > > (Note that I ignore mount points not returning the correct > d_ino, because apps can track the mount list and > compensate; they should be doing this anyway). I think we generally agree here. > > 2. updatedb-style indexing apps don't care about the > readdir/stat-all-entries cost, because they don't need to read > the directory after every change, they only need to do it once > every 24 hours if any events were received in that interval! > > (Obviously this isn't the same for pseudo-real-time indexers.) See, that's why the mlocate/updatedb contacted me. They don't want it to be a 24 hour thing. They was pseudo real time without thrashing the hd. Turns out, the interface is there, and the backend just needs struct path information available to give it to them. > For Samba-style caching, on the other hand, the cost of > rescanning a large directory when one file is being read often > and another file in it is changing often might be prohibitive, > forcing it to use heuristics to decide when to monitor a > directory and when not to to cache it, depending on directory > size. I'd rather avoid that. This seems like you are confusing the events that happen to a directory (mv/rename) and the events that happen to a file (read/write.) (The former I do not have code for, the latter I do) Which is surprising since you so clearly delineate them later, so maybe I'm misunderstanding. > 3. Non-POSIX filesystems don't always have stable inode numbers. > You can't tell that foo was renamed to bar by reading the > directory and looking at d_ino, or by calling stat on each entry. > > You can assume stable inode numbers for inodes where there's an > open file descriptor; that *might* be just enough to squeeze > through the logic of a cache. I'm not sure right now. > > 4. You can't tell when file contents are changed from stat info. > > That means you have to receive an inode event, not a directory > event for data changes, but that's not a problem of course - the > name-used-for-access isn't useful for data changes anyway > (except for debugging perhaps). > > 5. stat() doesn't tell you about xattr and ACL changes. xattrs can > be large and slow to read on a whole directory. But as point 4, > if attribute changes count as inode changes, there's no problem. > > 6. Calling stat() pulls a lot into cache that doesn't need to be in > cache: all those inodes. But as I mentioned in points 1, 4 and > 5, provided only directory name operations pass the directory to > be scanned, and inode operations always pass the inode, you can > use readdir() and avoid stat(), so the inodes don't have to be > pulled into cache after all. > > Except for non-POSIX inode instability. Would be good to work > out if that breaks the algorithm. > > In summary, calling readdir() and maybe stat/getxattr on each entry in > a directory might be workable, but I'd rather it was avoidable. > Simple apps may prefer to do it anyway - and let multiple events in a > directory be merged as a result. > > While I'm here it would be nice to receive one event instead of two > for operations which involve two paths: link, rename and bind mount. > Having to pair up two events from inotify isn't helpful in any way. What do you propose the format of the event should be. Is this precluded in what's been proposed? > Imho an API that satisfies everything we've talking about would let > you specify which fields you want to receive in the event when you > bind a listener. Not _everything_ is selectable of course, but > whether you want: > > For inode events (data read/write, attribute/ACL/xattr changes): > > - Open file descriptor of the affected file [Optional]. Could be added and I've agreed to take a look. I'm just not sure bringing back the major flaw of inotify is really moving us forward. > - The inode number and device number (always?). hmmm, if you have the fd you have both, if you choose to get a wd like inotify, I say you're still own your own to do the magic map. I don't want to copy tons of almost always useless data into userspace. > - A way to identify the vfsmount (due to bind mounts making the > device number insufficient to identify a directory; always?). Now you want reliable path names? I need a vfsmount in kernel to open the fd for userspace, but I don't see how that's translatable to anything even remotely useable by userspace.... > For directory events (create/unlink/link/rename/reflink/mkdir/rmdir > /mount/umount): > > - Same as inode above, for the object created/linked/deleted. > > - Same as inode above, for the directory containing the source name. > - Source name [Optional]. > - Same as above, for the directory containing the target name > - Target name [Optional] > > Source and target are the two names for > rename/link/reflink/bind-mount operations. Otherwise there is > only one name to include. > > Ironically, it begins to look a bit like netlink ;-) > > As you can see, I've made the open descriptors optional, and the names > for directory events optional. For directory events, the object > descriptor option should be independent from the source/target > directory descriptor option. > > Add one more option: wait for Ack before file accessing process can > proceed, or don't require Ack. That basically distinguishes inotify > behaviour from fsnotify behaviour. Those 2 things are completely independent. If you request read with blocking you get read with blocking. If you request just read, you get just read. > It's not obvious, but that option's useful for directory events too, > if you think about it: Think like an anti-malware or other access > control manager, and ask: what if I have to block something which > depends on the layout of files? Just as directory events are enough > for caching, they are enough for complete access control of > layout-dependent state too. For example, some line of text is no > problem in a random file, but might be forbidden by the access manager > from appearing in .bash_login, including by "mv harmless .bash_login". I'd say they can realize the bad data when .bash_login is next opened and deny access then :) No, it's honestly a good idea, and one that is going to likely take serious hook movement. A lot of these things can go on the todo list, but don't sound like show stoppers to me.... > > It's hopefully feasible, but it's going to take some fsnotify hook > > movements and possibly so arguments with Al to get the information I > > want where I want it. > > That may, indeed, be a sticking point :-) My silent comrade from suse is looking at moving some hooks as we speak so hopefully directory events can get added shortly after a merge. I don't think we should wait until every conceived (but not necessarily needed) possibility is coded. We have things that work, meet needs, and hopefully you'll agree leave us with a place to go in the future. Do you? -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 16 Sep 2009, Jamie Lokier wrote: > > I'd forgotten about Linus' strace argument. That's a good one. > > Of course everything should be a syscall by that argument :-) Oh yes, everything _should_ be a syscall. The problem is that many things are too "amorphous" to be system calls, and don't have any sane generic semantics (ie they only act on a specific device node). So we have ioctl's etc for those things. And then we have page faults. I've long wished that from a system call tracing standpoint we could show page faults as pseudo-system-calls (at least as long as they happen from user space - trying to handle nesting is not worth it). It would make it _so_ much more obvious what the performance patterns are if you could just do strace -ttT firefox for the cold-cache case and you'd see where the time is really spent. (yeah, yeah, you can get that kind of information other ways, but it's a hell of a lot less convenient than just getting a nice trace with timestamps). > And strace can trace some ioctls and setsockopts. (But it's never > pretty to see isatty() showing in strace as SNDCTL_TMR_TIMEBASE :-) Yes, strace can fix things up, and show "send a packet" as "fanotify". But it's nasty and hard. Quite frankly, I have _never_ever_ seen a good reason for talking to the kernel with some idiotic packet interface. It's just a fancy way to do ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy packet interfaces suddenly much better? Linus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 17 Sep 2009 09:40:16 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And then we have page faults. I've long wished that from a system > call tracing standpoint we could show page faults as > pseudo-system-calls (at least as long as they happen from user space > - trying to handle nesting is not worth it). It would make it _so_ > much more obvious what the performance patterns are if you could just > do > > strace -ttT firefox > > for the cold-cache case and you'd see where the time is really spent. > > (yeah, yeah, you can get that kind of information other ways, but > it's a hell of a lot less convenient than just getting a nice trace > with timestamps). ohhh I should add pagefaults to timechart good one.
On Thu, 2009-09-17 at 09:40 -0700, Linus Torvalds wrote: > > On Wed, 16 Sep 2009, Jamie Lokier wrote: > > > > I'd forgotten about Linus' strace argument. That's a good one. > > > > Of course everything should be a syscall by that argument :-) > > Oh yes, everything _should_ be a syscall. I rewrote the interface and prototyped out a working fanotify like so: SYSCALL_DEFINE4(fanotify_init, unsigned int, flags, int, event_f_flags, __u64, mask, int, priority) int flags indicates - things like global or directed, fd's or wd's, could include fail allow vs fail deny, O_CLOEXEC, O_NONBLOCK, etc int event_f_flags - flags used when opening an fd for the listener __u64 mask - in global mode the events of interest int priority - the order fanotify listeners should be checked (so HSM can be before AV scanners) Do we need a timeout for access decisions? I left room for that in the bind address, but we can't just leave room to spare with a syscall... SYSCALL_DEFINE6(fanotify_modify_mark, int, fanotify_fd, unsigned int, flags, int, fd, const char __user *, pathname, __u64, mask, __u64, ignored_mask) int fanotify_fd - duh int flags - add, remove, flush, events on child, event on subtree? int fd - either fd to object or fd to dir for relative pathname const char __user * pathname - either pathname or null if only use fd __u64 mask - events this inode cares about __u64 ignored_mask - events this inode does NOT care about (not yet done, would someone like to comment?) fanotify_response(int fanotify_fd, __u64 cookie, __u32 response); __u64 cookie - which of our permission requests we are waiting on __u32 response - allow, deny, wait longer Could be done using write(), but I think the strace argument clearly says that this should be a syscall that can be easily found and reported (not settled in my mind) int fanotify_ignore_sb(int fanotify_fd, unsigned int flags, long f_type, fsid_t f_fsid) int fanotify_fd - duh unsigned int flags - f_type or fsid? long f_type - statfs(2) f_type fsid_t f_fsid - statfs(2) f_fsid Reads from the fd would return data of this structure: struct fanotify_event_metadata { __u32 event_len; __u32 vers; __s32 fd; __u32 mask; __u32 f_flags; __s32 pid; __s32 uid; __s32 tgid; __u64 cookie; } __attribute__((packed)); Thanks to event_len and vers, we could extend it to include __u32 filename1_len, char filename1[filename1_len] __u32 filename2_len, char filename2[filename2_len] This can all take shape as that work is completed and I don't believe should block merging. Do my syscalls look pretty enough? I'm down to 3 or 4. Jamie, you tend to agree that the interface and the event types are nice enough that we can build out (if we get the right hooks in the right places) everywhere we need to go? -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, 16 September 2009 14:17:08 Jamie Lokier wrote: > Eric Paris wrote: > > On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote: > > > Seriously, what does system-wide fanotify do when run from a > > > chroot/namespace/cgroup, and a file outside them is accessed? > > > > At the moment an fanotify global listener is system wide. Truely system > > wide. A gentleman from suse is looking rectify the problem so that if > > run inside a namespace it stays inside the namespace. Note that this > > particular little tidbit is not in the 8 patches I proposed. At the > > moment those just include the UI and basic notification. > > I'll be really interested in the gentleman's solution. I guess Eric meant me. From my point of view, "global" events make no sense, and fanotify listeners should register which directories they are interested in (e.g., include "/", exclude "/proc"). This takes care of chroots and namespaces as well. I think we want to register for events on objects rather than in the namespace, i.e., for inodes visible in multiple places because of hardlinks or bind mounts, we get the same kinds of events no matter which path is used. (The path actually used would still show up in /proc/self/fd/x.) When moving registered inodes, the registrations would move with them. This is how inotify works, except that inotify watches are not recursive. The difficulty with this is that in the worst case, this would require walking the entire namespace and all cached inodes. I don't see how this could be done for two reasons: * First, we can't take the vfsmount_lock and dcache_lock for the entire time. * Second, we would need to pin almost all the inodes, which is a clear no-go. [Why pin? At least we would need to remember which objects a listener has registered interest in, so we need to pin the inodes. We could still allow unregistered directory inodes to be thrown out because we can recreate their registration status from the parent. We can't recreate the registration status of non-directories because of hardlinks, though.] The only other idea I could come up with is to only allow recursive registrations at mount points: instead of inodes, the vfsmounts would be included or excluded (probably automatically including bind mounts). This has one big drawback though: users would no longer be able to watch arbitrary subtrees anymore. Privileged users could still arrange to watch almost all subtrees with bind mounts (mount --bind /foo/bar /foo/bar). Any ideas? Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote: > From my point of view, "global" events make no sense, and fanotify listeners > should register which directories they are interested in (e.g., include "/", > exclude "/proc"). This takes care of chroots and namespaces as well. While I completely agree that most users don't want global events, the antimalware vendors who today, unprotect and hack the syscall table on their unsuspecting customer's machines to intercept every read, write, open, close, mmap, etc syscall want EXACTLY that. They'd been asking for a way to get this information for quite some time now. The largest vendors in this market have agreed the interface (well, when it was a socket interface that I talked about for so long) should meet their needs. Subtree watching / isn't any different or better, just harder and more complex to implement. You still have to exclude /proc and /sys and everything else. Just like one must with a global listener. Still though, this sounds like an issue for the f_type and f_fsid exclusion syscall I say I'm still not settled on. Not and issue with the basis of fanotify or with the 3 proposed syscalls. Jamie, do you see a problem with what I have been asking for review on or see a problem with extending it moving forward? Linus, do you see the value of 'yet another notification scheme' ? -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, 18 September 2009 22:52:08 Eric Paris wrote: > On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote: > > From my point of view, "global" events make no sense, and fanotify > > listeners should register which directories they are interested in (e.g., > > include "/", exclude "/proc"). This takes care of chroots and namespaces > > as well. > > While I completely agree that most users don't want global events, the > antimalware vendors who today, unprotect and hack the syscall table on > their unsuspecting customer's machines to intercept every read, write, > open, close, mmap, etc syscall want EXACTLY that. I understand that "global" is what those guys get today for lack of a reasonable mechanism, but it's not what anybody can ge given by fanotify: it conflicts with filesystem namespaces. Consider running several "virtual machines" in separate namespaces on the same kernel. With "global" you are forced to run the same global fanotify listeners everywhere; with per-mount-point listeners, you can choose between "global" and something more fine-grained by identifying which vfsmounts you are interested in. (Filesystem namespaces correspond to vfsmount hierarchies.) > [...] You still have to exclude /proc and /sys and everything else. Those are mount points, and so convenient to handle with a per-mount-point mechanism. No additional kernel code needed. > [...] Still though, this sounds like an issue for the f_type and f_fsid > exclusion syscall I say I'm still not settled on. Those are also obsolete with a per-mount-point mechanism. Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2009-09-19 at 00:00 +0200, Andreas Gruenbacher wrote: > On Friday, 18 September 2009 22:52:08 Eric Paris wrote: > > On Thu, 2009-09-17 at 22:07 +0200, Andreas Gruenbacher wrote: > > > From my point of view, "global" events make no sense, and fanotify > > > listeners should register which directories they are interested in (e.g., > > > include "/", exclude "/proc"). This takes care of chroots and namespaces > > > as well. > > > > While I completely agree that most users don't want global events, the > > antimalware vendors who today, unprotect and hack the syscall table on > > their unsuspecting customer's machines to intercept every read, write, > > open, close, mmap, etc syscall want EXACTLY that. > > I understand that "global" is what those guys get today for lack of a > reasonable mechanism, but it's not what anybody can ge given by fanotify: it > conflicts with filesystem namespaces. > > Consider running several "virtual machines" in separate namespaces on the same > kernel. With "global" you are forced to run the same global fanotify > listeners everywhere; with per-mount-point listeners, you can choose > between "global" and something more fine-grained by identifying which > vfsmounts you are interested in. (Filesystem namespaces correspond to > vfsmount hierarchies.) Let me start by saying I am agreeing I should pursue subtree notification. It's what I think everyone really wants. It's a great idea, and I think you might have a simple way to get close. Clearly these are avenues I'm willing and hoping to pursue. Also I say it again, I believe the interface as proposed (except maybe some of my exclusion stuff) is flexible enough to implement any of these ideas. Does anyone disagree? BUT to solve one of the main problems fanotify is intending to solve it needs a way to be the 'fscking all notifier.' It needs to be the whole damn system. I totally agree that what I have in my tree today (yet unposted) restricting global notification (CAP_SYS_ADMIN) is highly inadequate. If any root task in any namespace could easily hop on out of it's namespace using fanotify, that's a problem. No arguments with me. But there must be a way for fanotify to globally get everything. That's one of the main points of fanotify. It needs to be a fscking all notifier, even of things in a completely detached namespace. AV vendors are going to get it. Their customers our users are going to load kernel modules that do horrible things. These are the realities of the world in which we live. Do we really throw 10's or 100's of thousands of our users under the bus because we don't like the software they are using on philosophical grounds? I'm sure namespace people are calling me an idiot and tell me to stay in my namespace. I want to stay in my namespace for 'most' root users, but I need a way to get a global scanner. I want to know what is the sanest way? And for people who feel it's insane, just don't compile it in. I'll make global listeners a build option. But global listeners is an absolute requirement. I was considering saying you needed cap_sys_admin and you needed current->ns_proxy->mnt_ns == the original init task's mnt_ns. Maybe this isn't a great way to determine if a task should be allowed to use global listeners. Is there a better way to restrict it? Think about your web hosting company. They sell 'cheap' vm's to customers in a private name. The web hosting company want to run an AV scanner that scans every file on the computer, their files, their customer's files, everything. Certainly we don't want the customer to break out of their namespace. So, what is the sanest, even if you hate the idea so much you compile it out, way to let the hosting company get information about files in their customer's detached namespace which not letting their customers get information about each other? -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Saturday, 19 September 2009 5:04:31 Eric Paris wrote: > Let me start by saying I am agreeing I should pursue subtree > notification. It's what I think everyone really wants. It's a great > idea, and I think you might have a simple way to get close. Clearly > these are avenues I'm willing and hoping to pursue. Also I say it > again, I believe the interface as proposed (except maybe some of my > exclusion stuff) is flexible enough to implement any of these ideas. > Does anyone disagree? It does seem flexible enough. However, the current interface assumes "global" listeners (the mask argument of fanotify_init): int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority); Once subtree support is added, this parameter becomes obsolete. That's pretty broken for a syscall yet to be introduced. > BUT to solve one of the main problems fanotify is intending to solve it > needs a way to be the 'fscking all notifier.' It needs to be the whole > damn system. Think of a system after boot, with a single global namespace. Whatever you access by filename is reachable from the namespace root. At this point, nothing more global exists. A listener can watch the mount points of interest, and everything's fine. What's a bit more tricky is to ensure that this listener will continue to receive all events from whatever else is mounted anywhere, irrespective of namespaces. I think we can get there. By the way, Documentation/filesystems/sharedsubtree.txt describes how filesystem namespaces work. Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Gruenbacher wrote: > On Saturday, 19 September 2009 5:04:31 Eric Paris wrote: > > Let me start by saying I am agreeing I should pursue subtree > > notification. It's what I think everyone really wants. It's a great > > idea, and I think you might have a simple way to get close. Clearly > > these are avenues I'm willing and hoping to pursue. Also I say it > > again, I believe the interface as proposed (except maybe some of my > > exclusion stuff) is flexible enough to implement any of these ideas. > > Does anyone disagree? > > It does seem flexible enough. However, the current interface assumes "global" > listeners (the mask argument of fanotify_init): > > int fanotify_init(int flags, int f_flags, __u64 mask, > unsigned int priority); > > Once subtree support is added, this parameter becomes obsolete. That's pretty > broken for a syscall yet to be introduced. > > > BUT to solve one of the main problems fanotify is intending to solve it > > needs a way to be the 'fscking all notifier.' It needs to be the whole > > damn system. > > Think of a system after boot, with a single global namespace. Whatever you > access by filename is reachable from the namespace root. At this point, > nothing more global exists. A listener can watch the mount points of > interest, and everything's fine. > > What's a bit more tricky is to ensure that this listener will continue to > receive all events from whatever else is mounted anywhere, irrespective of > namespaces. I think we can get there. I think so to, and that'd be a great all round solution. We _have_ to receive mount & umount events to do this. But even inotify-style tracking needs those if it's to be accurate, so it's not an additional burden. It would be logical if fanotify could block and ack those in the same way as it can block and ack other accesses (with the usual filtering rules on which inodes trigger events, and which don't or are cached). As in to prevent: mount --bind innocent .bash_login, but also to ensure it always knows what's mounted when another event occurs. > By the way, Documentation/filesystems/sharedsubtree.txt describes how > filesystem namespaces work. Fortunately, after making a new namespace you can read the mounts in the new namespace from /proc/self/mount* (I think) without having to know anything about the shared subtree rules. So to follow monitoring/checking across all namespaces, it would (I think) be enough to receive a fanotify "new namespace" event, and Ack that event to allow the CLONE_NS to proceed. It's still tricky stuff though. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote: > It would be logical if fanotify could block and ack those [mount & umount > events] in the same way as it can block and ack other accesses (with the > usual filtering rules on which inodes trigger events, and which don't or are > cached). Hmm. To me, fanotify is about file contents first of all: this is what fanotify wants to be able to veto. Directory events seem reasonable to add for inotify compatibility, but I see no need for access decisions on them. Even less so for mounts and unmounts. (Besides, we can't hold any vfs locks while asking fanotify so those operations wouldn't be atomic, anyway.) Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Gruenbacher wrote: > On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote: > > It would be logical if fanotify could block and ack those [mount & umount > > events] in the same way as it can block and ack other accesses (with the > > usual filtering rules on which inodes trigger events, and which don't or are > > cached). > > Hmm. To me, fanotify is about file contents first of all: this is what > fanotify wants to be able to veto. Surely you don't assume that what constitutes malicious content is independent of it's location and/or name? (See also "echo 'run_virus&' >>.bash_login). Wait a minute. You don't assume that, otherwise why the interest in subtrees? :-) > Directory events seem reasonable to add for inotify compatibility, Did you see may point about userspace caches and how directory events are fundamental to that - there's no way to build a cache without them? > but I see no need for access decisions on them. Please excuse me; I'm a bit confused. Is fanotify intended just for use by access decision programs, or is the plan now for it to also be a replacement for inotify? I'm getting conflicting signals about that. If it's just for access decision programs, and if those aren't going to care about location, then there's no need to add directory events to fanotify at all. But then I'll be demanding subtree support in inotify, please :-) > Even less so for mounts and unmounts. (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat /bin/cat If fanotify doesn't react to that, which is just a fancy way of saying "zcat virus.gz >/bin/cat" in a way which doesn't cause any writes or opens, what's the point in it? Is fanotify only for checking files written by non-root users? > (Besides, we can't hold any vfs locks > while asking fanotify so those operations wouldn't be atomic, anyway.) Indeed, good point. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 21 Sep 2009, Jamie Lokier wrote:
> I think so to, and that'd be a great all round solution.
If this is for anti-malware vendors to intercept userspace accesses
they're currently doing it by hacking the syscall table, why don't we
offer a way to monitor syscalls (kernel side) in a non racy way?
Modules can [un]register themselves for syscall intercaption, and receive
the syscall number and parameters. They'd be able to change paramters,
return error codes, and so on.
The cost of the check in the syscall path could even be under an
alternative-like patching, if really neeeded.
The Pros of this would be:
- The kernel code to implement this would be trivially small, with no
I-need-this-feature-too growth potential
- There won't be any externally visible API to maintain (and its kernel
counter part) and expand
- Any system call can be intercepted, allowing it to be flexible while
leaving the burden of the interception handling, and communication with
userspace policy enforcers, to the anti-malware (or whoever really)
companies modules
The anti-malware are already doing this (intercepting syscall), they
already have code for it, and they always did (writing kernel
modules/drivers, that is) for Windows.
- Davide
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, 22 September 2009 0:00:02 Jamie Lokier wrote: > Andreas Gruenbacher wrote: > > On Monday, 21 September 2009 22:28:23 Jamie Lokier wrote: > > > It would be logical if fanotify could block and ack those [mount & > > > umount events] in the same way as it can block and ack other accesses > > > (with the usual filtering rules on which inodes trigger events, and > > > which don't or are cached). > > > > Hmm. To me, fanotify is about file contents first of all: this is what > > fanotify wants to be able to veto. > > Surely you don't assume that what constitutes malicious content is > independent of it's location and/or name? If the antimalware vendors want to base their decisions on pathnames then that's their decision, and they can check /proc/self/fd/N. We should be able to treat directory events the same. > (See also "echo 'run_virus&' >>.bash_login). > > Wait a minute. You don't assume that, otherwise why the interest in > subtrees? :-) > > > Directory events seem reasonable to add for inotify compatibility, > > Did you see may point about userspace caches and how directory events > are fundamental to that - there's no way to build a cache without them? Yes, there were some doubts about this appoach. Waiting for your code to demonstrate; an object based cache (e.g., st_dev + st_ino) rather than a pathname based cache would seem more reasonable. > > but I see no need for access decisions on them. > > Please excuse me; I'm a bit confused. Is fanotify intended just for > use by access decision programs, or is the plan now for it to also be > a replacement for inotify? I'm getting conflicting signals about > that. Inotify doesn't support access decisions. So where's the problem with having "notify only" events for directory / mount / unmount events? > If it's just for access decision programs, and if those aren't going > to care about location, then there's no need to add directory events > to fanotify at all. But then I'll be demanding subtree support in > inotify, please :-) > > > Even less so for mounts and unmounts. > > (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat > /bin/cat ... and then someone accesses /bin/cat, which triggers a fanotify access decision. Thanks, Andreas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 21 Sep 2009, Jamie Lokier wrote: > > > I think so to, and that'd be a great all round solution. > > If this is for anti-malware vendors Personally I'm not interested in anti-malware, and am simply interested in leveraging fsnotify improvements to accelerate userspace caches of information which depends on files (indexes, templates, compiler caches, stat caches etc.). Basically make inotify better, and sufficiently correct for that purpose. My sticking my oar in lately is to ensure the fsnotify improvements are going in the (imho) right direction. There's a lot of interesting apps waiting in the wings on this. It doesn't have to be complicated, just... sensible. > to intercept userspace accesses > they're currently doing it by hacking the syscall table, why don't we > offer a way to monitor syscalls (kernel side) in a non racy way? > Modules can [un]register themselves for syscall intercaption, and receive > the syscall number and parameters. They'd be able to change paramters, > return error codes, and so on. > The cost of the check in the syscall path could even be under an > alternative-like patching, if really neeeded. > The Pros of this would be: > > - The kernel code to implement this would be trivially small, with no > I-need-this-feature-too growth potential (Fwiw, the {fa,fs,i}notify thing looks to me like it's getting simpler as we go. Good design = decrease complexity + increase versatility. E.g. see epoll.) > - There won't be any externally visible API to maintain (and its kernel > counter part) and expand > > - Any system call can be intercepted, allowing it to be flexible while > leaving the burden of the interception handling, and communication with > userspace policy enforcers, to the anti-malware (or whoever really) > companies modules > > The anti-malware are already doing this (intercepting syscall), they > already have code for it, and they always did (writing kernel > modules/drivers, that is) for Windows. I don't mind at all if fanotify is replaced by a general purpose "take over the system call table" solution for anti-malware, and I still get to keep the fsnotify improvements :-) But I can't help noticing that we _already_ have quite well placed hooks for intercepting system calls, called security_this and security_that (SELinux etc), albeit they can't redirect things so much. However, being a little kinder, I suspect even the anti-malware vendors would rather not slow down everything with race-prone complicated tracking of everything every process does... which is why fanotify allows it's "interest set" to be reduced from everything to a subset of files, and it's results to be cached, and let the races be handled in the normal way by VFS. Once you have an "interest set" and focus on files, it looks somewhat reasonable to use the fsnotify hooks. ...That is, if you believe monitoring files is the best approach to anti-malware. I can't help noticing that on (ahem) Windows, running just a "virus checker" which generically scans every file independent of it's location looking for signatures and keeping up with patches is no longer considered good enough. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Gruenbacher wrote: > If the antimalware vendors want to base their decisions on pathnames then > that's their decision, and they can check /proc/self/fd/N. Race hazards and loopholes. It doesn't work. > Waiting for your code to demonstrate; an object based cache (e.g., > st_dev + st_ino) rather than a pathname based cache would seem more > reasonable. Nearly everything that people do with files involves paths. The point is to cache what people (or their programs) do. Apache does not consult inodes by number, and rsync does not write inodes by number :-) Yes, to the code... > > > but I see no need for access decisions on them. > > > > Please excuse me; I'm a bit confused. Is fanotify intended just for > > use by access decision programs, or is the plan now for it to also be > > a replacement for inotify? I'm getting conflicting signals about > > that. > > Inotify doesn't support access decisions. So where's the problem with > having "notify only" events for directory / mount / unmount events? No problem here. You seemed to be saying you want to add directory events to fanotify. But if fanotify is only intended for access decisions? Something I must have misunderstood in that. > > If it's just for access decision programs, and if those aren't going > > to care about location, then there's no need to add directory events > > to fanotify at all. But then I'll be demanding subtree support in > > inotify, please :-) > > > > > Even less so for mounts and unmounts. > > > > (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat > > /bin/cat > > ... and then someone accesses /bin/cat, which triggers a fanotify access > decision. That's fine as long as there was no location-awareness in the logic which checked foo/innocent.txt and set that inode's "read-ok,cache-me" bit. Mount only matters if you're sensitive to location. If you think location-independent checks make good anti-malware I_have_a_bridge_to_sell^H^H^H^H^H^H^H^H^H^H^Hfine with me :-) -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds <torvalds@linux-foundation.org> writes: > Quite frankly, I have _never_ever_ seen a good reason for talking to the > kernel with some idiotic packet interface. It's just a fancy way to do > ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy > packet interfaces suddenly much better? For working with the networking stack there are a lot of advantages because netlink is the interface to everything in the network stack. There are nice things like the packet to create a new interface is the same packet the kernel sends everyone to report a new interface etc. netlink also seems to get the structured data thing right. You can parse the packet even if you don't understand everything. Each tag is well defined like a syscall, taking exactly one kind of argument. Which avoids the worst failure of ioctl in that you can't even parse everything, and the argument may be a linked list in the calling process or something else atrocious. All of that said syscalls are good, and I would not recommend netlink to anything not in the network stack. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 21 Sep 2009 17:15:28 -0700 Eric W. Biederman wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > Quite frankly, I have _never_ever_ seen a good reason for talking to the > > kernel with some idiotic packet interface. It's just a fancy way to do > > ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy > > packet interfaces suddenly much better? > > For working with the networking stack there are a lot of advantages because > netlink is the interface to everything in the network stack. > > There are nice things like the packet to create a new interface is the same > packet the kernel sends everyone to report a new interface etc. > > netlink also seems to get the structured data thing right. You can > parse the packet even if you don't understand everything. Each tag is > well defined like a syscall, taking exactly one kind of argument. > Which avoids the worst failure of ioctl in that you can't even parse > everything, and the argument may be a linked list in the calling > process or something else atrocious. > > All of that said syscalls are good, and I would not recommend netlink > to anything not in the network stack. like CONFIG_SCSI_NETLINK and CONFIG_QUOTA_NETLINK_INTERFACE :( --- ~Randy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 22 Sep 2009, Jamie Lokier wrote: > I don't mind at all if fanotify is replaced by a general purpose "take > over the system call table" solution ... That was not what I meant ;) You'd register/unregister as syscall interceptor, receiving syscall number and parameters, you'd be able to return status/error codes directly, and you'd have the ability to eventually change the parameters. All this should be pretty trivial code, and at the same time give full syscall visibility to the modules. The complexity would be left to the interceptors, as they already do it today. > But I can't help noticing that we _already_ have quite well placed > hooks for intercepting system calls, called security_this and > security_that (SELinux etc), ... That has "some" limits WRT non-GPL modules and relative static linkage. > However, being a little kinder, I suspect even the anti-malware > vendors would rather not slow down everything with race-prone > complicated tracking of everything every process does... which is why > fanotify allows it's "interest set" to be reduced from everything to a > subset of files, and it's results to be cached, and let the races be > handled in the normal way by VFS. They are already doing it today, since they are forced to literally find and hack the syscall table. They already have things like process whitelists, path whitelists, scan caches, and all the whistles, in their code. Of course, some of them might be interested in pushing given complexity inside the kernel, since they won't have to maintain it. Some other OTOH, might be interested in keeping a syscall-based access, since they already have working code based on that abstraction. The good part of this would be that all the userspace communication API, whitelists, caches, etc... would be left to the module implementors, and not pushed inside the kernel. That, and the flexibility of being able to intercept all the userspace entrances into the kernel. - Davide -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, 22 September 2009 16:51:39 Davide Libenzi wrote: > On Tue, 22 Sep 2009, Jamie Lokier wrote: > > I don't mind at all if fanotify is replaced by a general purpose "take > > over the system call table" solution ... > > That was not what I meant ;) > You'd register/unregister as syscall interceptor, receiving syscall number > and parameters, you'd be able to return status/error codes directly, and > you'd have the ability to eventually change the parameters. All this > should be pretty trivial code, and at the same time give full syscall > visibility to the modules. The fatal flaw of syscall interception is race conditions: you look up a pathname in your interception layer; then when you call into the proper syscall, the kernel again looks up the same pathname. There is no way to guarantee that you end up at the same object in both lookups. The security and fsnotify hooks are placed in the appropriate spots to avoid exactly that. Andreas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 22 Sep 2009, Andreas Gruenbacher wrote: > The fatal flaw of syscall interception is race conditions: you look up a > pathname in your interception layer; then when you call into the proper > syscall, the kernel again looks up the same pathname. There is no way to > guarantee that you end up at the same object in both lookups. The security > and fsnotify hooks are placed in the appropriate spots to avoid exactly that. Fatal? You mean, for this corner case that the anti-malware industry lived with for so much time (in Linux and Windows), you're prepared in pushing all the logic that is currently implemented into their modules, into the kernel? This includes process whitelisting, path whitelisting, caches, userspace access API definition, and so on? On top of providing a generally more limited interception. Why don't we instead offer a lower and broader level of interception, letting the users decide if such fatal flaw needs to be addressed or not, in their modules? They get a broader inteception layer, with the option to decide if or if not address certain scenarios, and we get less code inside the kernel. A win/win situation, if you ask me. - Davide -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-09-22 at 17:31 +0200, Andreas Gruenbacher wrote: > On Tuesday, 22 September 2009 16:51:39 Davide Libenzi wrote: > > On Tue, 22 Sep 2009, Jamie Lokier wrote: > > > I don't mind at all if fanotify is replaced by a general purpose "take > > > over the system call table" solution ... > > > > That was not what I meant ;) > > You'd register/unregister as syscall interceptor, receiving syscall number > > and parameters, you'd be able to return status/error codes directly, and > > you'd have the ability to eventually change the parameters. All this > > should be pretty trivial code, and at the same time give full syscall > > visibility to the modules. > > The fatal flaw of syscall interception is race conditions: That's not the fatal flaw. The fatal flaw is that I am not going to write 90% of a rootkit and make it easy to use. Not going to happen. There's a reason we went to the trouble to mark the syscall call RO, we don't export it, and we don't want people playing with it. It clearly would have been the quickest, easiest, and fastest way to make anti-virus companies happy, but it doesn't really solve a good problem and it leaves all of us in a worse position than we are today. Easy != Good. -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris wrote: > That's not the fatal flaw. The fatal flaw is that I am not going to > write 90% of a rootkit and make it easy to use. I hate to point out the obvious, but fanotify's ability to intercept every file access and rewrite the file before the access proceeds is also 90% of a rootkit... But fortunately both fanotify and syscall rewriting require root in the first place. I think that makes the rootkit argument moot. As long as fanotify doesn't have a non-root flavour... which really would be handy for rootkits :-) > Easy != Good. I agree. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2009-09-21 at 22:04 +0200, Andreas Gruenbacher wrote: > On Saturday, 19 September 2009 5:04:31 Eric Paris wrote: > > Let me start by saying I am agreeing I should pursue subtree > > notification. It's what I think everyone really wants. It's a great > > idea, and I think you might have a simple way to get close. Clearly > > these are avenues I'm willing and hoping to pursue. Also I say it > > again, I believe the interface as proposed (except maybe some of my > > exclusion stuff) is flexible enough to implement any of these ideas. > > Does anyone disagree? > > It does seem flexible enough. However, the current interface assumes "global" > listeners (the mask argument of fanotify_init): > > int fanotify_init(int flags, int f_flags, __u64 mask, > unsigned int priority); > > Once subtree support is added, this parameter becomes obsolete. That's pretty > broken for a syscall yet to be introduced. Absolutely not obsolete. Subtree notification cannot do fscking all notification. > > BUT to solve one of the main problems fanotify is intending to solve it > > needs a way to be the 'fscking all notifier.' It needs to be the whole > > damn system. > > Think of a system after boot, with a single global namespace. Whatever you > access by filename is reachable from the namespace root. At this point, > nothing more global exists. A listener can watch the mount points of > interest, and everything's fine. this is true, if there is only one namespace subtree notification works the same as global notification. > What's a bit more tricky is to ensure that this listener will continue to > receive all events from whatever else is mounted anywhere, irrespective of > namespaces. I think we can get there. Lets say I want the subtree under / to get every event on the system. A process comes along and clones the namespace. Then lets say that process mounts something inside his new namespace. There is absolutely no path between my / and that new mount. How can subtree checking possibly find and indicate it wants notification about this mount? I don't see how subtree checking could do it. There can be completely disjoint trees with no overlap. mount -t tmpfs none /to_umount clone namespace mount -t tmpfs none /to_umount/private pivot_root /tmp_umount/private Something else umounts /to_umount That process is in an completely detached namespace? right? Heck, there could be operations on files that aren't in ANY namespace. a = open(/path/to/dir/); umount -l /path/to/ openat(a, "filename"); I don't see how subtree notification can possibly solve the global notification problem. I've been thinking that checking CAP_SYS_RAWIO as well as CAP_SYS_ADMIN might be reasonable when trying to use a global listener. If you can CAP_SYS_RAWIO I sorta feel like you can break out of a namespace anyway, right? -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, 22 September 2009 23:06:16 Eric Paris wrote: > this is true, if there is only one namespace subtree notification works > the same as global notification. > > [...] > > I don't see how subtree notification can possibly solve the global > notification problem. I'm thinking of is something like this: A listener registers interest in "/", recursively. The kernel sets a FSNOTIFY_WATCH_RECURSIVE flag on "/" and each mount point below. Afterwards when something is mounted anywhere, same namespace or not, the kernel sets the new mount's FSNOTIFY_WATCH_RECURSIVE flag if the parent mount has this flag set. (Of course we need per fsnotify_group flags and not global ones, but this doesn't change the principle.) Andreas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 22 Sep 2009, Jamie Lokier wrote: > Eric Paris wrote: > > That's not the fatal flaw. The fatal flaw is that I am not going to > > write 90% of a rootkit and make it easy to use. > > I hate to point out the obvious, but fanotify's ability to intercept > every file access and rewrite the file before the access proceeds is > also 90% of a rootkit... Obvious, but worth noticing. Indeed, the syscall table has been RO to make it harder for RKs to exploit it, not to make it impossible. RO syscall table makes perfect sense. But, once you are root, with very few lines of code, you can find, prot-fix the page, and patch the table. > But fortunately both fanotify and syscall rewriting require root in > the first place. ^^^^^^^^^ Again, maybe I wasn't clear about how this would work, but the syscall table would continue to remain RO ;) And as I said before, if we want to bring the cost of the interception for non-users pretty close to zero (a few NOPs to run onto), we could even adopt an alternative-like patching triggered by a kernel boot option. - Davide -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 22 September 2009 17:04:44 Davide Libenzi wrote: > On Tue, 22 Sep 2009, Andreas Gruenbacher wrote: > > The fatal flaw of syscall interception is race conditions: you look up a > > pathname in your interception layer; then when you call into the proper > > syscall, the kernel again looks up the same pathname. There is no way to > > guarantee that you end up at the same object in both lookups. The > > security and fsnotify hooks are placed in the appropriate spots to avoid > > exactly that. > > Fatal? You mean, for this corner case that the anti-malware industry lived > with for so much time (in Linux and Windows), you're prepared in pushing > all the logic that is currently implemented into their modules, into the > kernel? Lived with it because there was no other option. We used LSM while it was available for modules but then it was taken away. And not all vendors even use syscall interception, not even across platforms, of which you sound so sure about. You can't even scan something which is not in your namespace if you are at the syscall level. And you can't catch things like kernel nfsd. No, syscall interception is not really appropriate at all. Tvrtko -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2009 at 09:39:33AM +0100, Tvrtko Ursulin wrote: > Lived with it because there was no other option. We used LSM while it was > available for modules but then it was taken away. > > And not all vendors even use syscall interception, not even across platforms, > of which you sound so sure about. You can't even scan something which is not > in your namespace if you are at the syscall level. And you can't catch things > like kernel nfsd. No, syscall interception is not really appropriate at all. The "Anti-Malware" industry is just snake oil anyway. I think the proper approach to support it is just to add various no-op exports claim to do something and all the people requiring anti-virus on Linux will be just as happy with it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 23 Sep 2009 09:39:33 +0100 Tvrtko Ursulin <tvrtko.ursulin@sophos.com> wrote: > Lived with it because there was no other option. We used LSM while it > was available for modules but then it was taken away. ... at which point you could have submitted your LSM module for inclusion... you'd be the first (and only?) Anti Virus vendor that would be in the mainline kernel.. speaking of competitive advantage, coming out of the box in all distributions. sadly this road hasn't been chosen....
On Wed, 23 Sep 2009, Tvrtko Ursulin wrote: > Lived with it because there was no other option. We used LSM while it was > available for modules but then it was taken away. > > And not all vendors even use syscall interception, not even across platforms, > of which you sound so sure about. You can't even scan something which is not > in your namespace if you are at the syscall level. And you can't catch things > like kernel nfsd. No, syscall interception is not really appropriate at all. Really? And *if* namespaces were the problem for the devices you were targeting, what prevented you to resolving the object and offering a stream to userspace? In *your* module, hosting at the same time all the other logic required for it (caches, whitelists, etc...), instead of pushing this stuff into the kernel. WRT to the "other" system, never said they were using syscall interception, if you read carefully. I said that minifilters typically sends path names to userspace, which might drive you in the pitfall Andreas was describing. - Davide -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 23 Sep 2009, hch@infradead.org wrote: > On Wed, Sep 23, 2009 at 09:39:33AM +0100, Tvrtko Ursulin wrote: > > Lived with it because there was no other option. We used LSM while it was > > available for modules but then it was taken away. > > > > And not all vendors even use syscall interception, not even across platforms, > > of which you sound so sure about. You can't even scan something which is not > > in your namespace if you are at the syscall level. And you can't catch things > > like kernel nfsd. No, syscall interception is not really appropriate at all. > > The "Anti-Malware" industry is just snake oil anyway. I think the > proper approach to support it is just to add various no-op exports claim > to do something and all the people requiring anti-virus on Linux will be > just as happy with it. The fear is that this becomes a trojan horse (no pun intended) for more and more hooks and "stuff", driven by we-really-need-this-too and we-really-need-that-too. And once something it's in, it's harder to say no, under the pressure of offering a "limited solution". This ws the reason I threw the syscall tracing thing in, so they have a low level generic hook, and they cam knock themselves out in their module (might need a few exports, but that's about it). - Davide -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 23 September 2009 12:32:32 Arjan van de Ven wrote: > On Wed, 23 Sep 2009 09:39:33 +0100 > > Tvrtko Ursulin <tvrtko.ursulin@sophos.com> wrote: > > Lived with it because there was no other option. We used LSM while it > > was available for modules but then it was taken away. > > ... at which point you could have submitted your LSM module for > inclusion... you'd be the first (and only?) Anti Virus vendor that > would be in the mainline kernel.. speaking of competitive advantage, > coming out of the box in all distributions. > > sadly this road hasn't been chosen.... It has, but since what we had wasn't acceptable the road was long and it turned into fanotify. Which is not only about anti-malware (hello Christoph) so I think this sub-thread is going off-topic. Tvrtko -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 23 September 2009 16:26:49 Davide Libenzi wrote: > On Wed, 23 Sep 2009, Tvrtko Ursulin wrote: > > Lived with it because there was no other option. We used LSM while it was > > available for modules but then it was taken away. > > > > And not all vendors even use syscall interception, not even across > > platforms, of which you sound so sure about. You can't even scan > > something which is not in your namespace if you are at the syscall level. > > And you can't catch things like kernel nfsd. No, syscall interception is > > not really appropriate at all. > > Really? > And *if* namespaces were the problem for the devices you were targeting, > what prevented you to resolving the object and offering a stream to > userspace? You are right, nothing really, we even do it like that today. But what about other interested users? > In *your* module, hosting at the same time all the other logic required > for it (caches, whitelists, etc...), instead of pushing this stuff into > the kernel. > WRT to the "other" system, never said they were using syscall > interception, if you read carefully. I said that minifilters typically > sends path names to userspace, which might drive you in the pitfall > Andreas was describing. Yeah, you could do something like kauth on OSX, which is I guess similar to LSM, which was turned off for out of tree. And now you want to push users of fanotify out of tree, so what should it be? In tree bad, out of tree bad? Tvrtko -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-09-23 at 13:32 +0200, Arjan van de Ven wrote: > On Wed, 23 Sep 2009 09:39:33 +0100 > Tvrtko Ursulin <tvrtko.ursulin@sophos.com> wrote: > > > Lived with it because there was no other option. We used LSM while it > > was available for modules but then it was taken away. > > ... at which point you could have submitted your LSM module for > inclusion... you'd be the first (and only?) Anti Virus vendor that > would be in the mainline kernel.. speaking of competitive advantage, > coming out of the box in all distributions. And users would be left in a situation between choosing an LSM which actually does in provable ways increase security and using an AV scanner. Until magic solves the LSM stacking problem (it's been tried, no magic) I don't think any distro wants AV vendors as LSMs. -Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 23 Sep 2009, Tvrtko Ursulin wrote: > Yeah, you could do something like kauth on OSX, which is I guess similar to > LSM, which was turned off for out of tree. And now you want to push users of > fanotify out of tree, so what should it be? In tree bad, out of tree bad? As I said before, the good of a syscall tracing approach, is that it is a completely generic mechanism (extensible for other kind of hooks too), with minimal kernel impact, while allowing its module-users to stuff all the code they want in the part that it's their responsibility. So that a "we need this too" gets translated to "just do it in your code", instead of a request to add more stuff into the kernel, and maybe altering the userspace access interface (which is always painful). - Davide -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2009 at 11:51:37AM -0400, Eric Paris wrote: > And users would be left in a situation between choosing an LSM which > actually does in provable ways increase security and using an AV > scanner. Sounds like a good thing, no? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2009 at 08:35:18AM -0700, Davide Libenzi wrote: > The fear is that this becomes a trojan horse (no pun intended) for more > and more hooks and "stuff", driven by we-really-need-this-too and > we-really-need-that-too. And once something it's in, it's harder to say no, > under the pressure of offering a "limited solution". > This ws the reason I threw the syscall tracing thing in, so they have a > low level generic hook, and they cam knock themselves out in their module > (might need a few exports, but that's about it). Replacing idiotify with a saner interface is a good goal. I just don't think we should take the stakes of the snake oil industry too serious in it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/socket.h b/include/linux/socket.h index 3b461df..e03f47b 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -195,7 +195,8 @@ struct ucred { #define AF_ISDN 34 /* mISDN sockets */ #define AF_PHONET 35 /* Phonet sockets */ #define AF_IEEE802154 36 /* IEEE802154 sockets */ -#define AF_MAX 37 /* For now.. */ +#define AF_FANOTIFY 37 /* fscking all access sockets */ +#define AF_MAX 38 /* For now.. */ /* Protocol families, same as address families. */ #define PF_UNSPEC AF_UNSPEC @@ -235,6 +236,7 @@ struct ucred { #define PF_ISDN AF_ISDN #define PF_PHONET AF_PHONET #define PF_IEEE802154 AF_IEEE802154 +#define PF_FANOTIFY AF_FANOTIFY #define PF_MAX AF_MAX /* Maximum queue length specifiable by listen. */ @@ -306,6 +308,7 @@ struct ucred { #define SOL_PNPIPE 275 #define SOL_RDS 276 #define SOL_IUCV 277 +#define SOL_FANOTIFY 278 /* IPX options */ #define IPX_TYPE 1 diff --git a/net/core/sock.c b/net/core/sock.c index 30d5446..28a99bf 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -155,7 +155,7 @@ static const char *const af_family_key_strings[AF_MAX+1] = { "sk_lock-27" , "sk_lock-28" , "sk_lock-AF_CAN" , "sk_lock-AF_TIPC" , "sk_lock-AF_BLUETOOTH", "sk_lock-IUCV" , "sk_lock-AF_RXRPC" , "sk_lock-AF_ISDN" , "sk_lock-AF_PHONET" , - "sk_lock-AF_IEEE802154", + "sk_lock-AF_IEEE802154", "sk_lock-AF_FANOTIFY", "sk_lock-AF_MAX" }; static const char *const af_family_slock_key_strings[AF_MAX+1] = { @@ -171,7 +171,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = { "slock-27" , "slock-28" , "slock-AF_CAN" , "slock-AF_TIPC" , "slock-AF_BLUETOOTH", "slock-AF_IUCV" , "slock-AF_RXRPC" , "slock-AF_ISDN" , "slock-AF_PHONET" , - "slock-AF_IEEE802154", + "slock-AF_IEEE802154", "slock=AF_FANOTIFY", "slock-AF_MAX" }; static const char *const af_family_clock_key_strings[AF_MAX+1] = { @@ -187,7 +187,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = { "clock-27" , "clock-28" , "clock-AF_CAN" , "clock-AF_TIPC" , "clock-AF_BLUETOOTH", "clock-AF_IUCV" , "clock-AF_RXRPC" , "clock-AF_ISDN" , "clock-AF_PHONET" , - "clock-AF_IEEE802154", + "clock-AF_IEEE802154", "clock-AF_FANOTIFY", "clock-AF_MAX" };
fanotify's user interface uses a custom socket (it doesn't use netlink since work must be done in the context of the receive side of the socket) This patch simply defines the fanotify socket number declarations. The actual implementation of the socket is in a later patch. Signed-off-by: Eric Paris <eparis@redhat.com> --- include/linux/socket.h | 5 ++++- net/core/sock.c | 6 +++--- 2 files changed, 7 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html