Message ID | 20130408154519.18177.57709.stgit@localhost |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
From: Paul Moore <pmoore@redhat.com> Date: Mon, 08 Apr 2013 11:45:19 -0400 > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted > tcp_make_synack() to use alloc_skb() directly instead of calling > sock_wmalloc(), the goal being the elimination of two atomic > operations. Unfortunately, in doing so the change broke certain > SELinux/NetLabel configurations by no longer correctly assigning > the sock to the outgoing packet. > > This patch fixes this regression by doing the skb->sk assignment > directly inside tcp_make_synack(). > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> > Signed-off-by: Paul Moore <pmoore@redhat.com> Setting skb->sk without the destructor results in an SKB that can live potentially forever with a stale reference to a destroyed socket. You cannot fix the problem in this way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-04-08 at 11:45 -0400, Paul Moore wrote: > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted > tcp_make_synack() to use alloc_skb() directly instead of calling > sock_wmalloc(), the goal being the elimination of two atomic > operations. Unfortunately, in doing so the change broke certain > SELinux/NetLabel configurations by no longer correctly assigning > the sock to the outgoing packet. > > This patch fixes this regression by doing the skb->sk assignment > directly inside tcp_make_synack(). > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> > Signed-off-by: Paul Moore <pmoore@redhat.com> > --- > net/ipv4/tcp_output.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 5d0b438..23cc295 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2705,6 +2705,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, > dst_release(dst); > return NULL; > } > + skb->sk = sk; > + Hmm... Keeping a pointer on a socket without taking a refcount is not going to work. We are trying to make the stack scale, so you need to add a selinux call to take a ref count only if needed. That is : If selinux is not used, we don't need to slow down the stack. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, April 08, 2013 12:14:34 PM David Miller wrote: > From: Paul Moore <pmoore@redhat.com> > Date: Mon, 08 Apr 2013 11:45:19 -0400 > > > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted > > tcp_make_synack() to use alloc_skb() directly instead of calling > > sock_wmalloc(), the goal being the elimination of two atomic > > operations. Unfortunately, in doing so the change broke certain > > SELinux/NetLabel configurations by no longer correctly assigning > > the sock to the outgoing packet. > > > > This patch fixes this regression by doing the skb->sk assignment > > directly inside tcp_make_synack(). > > > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> > > Signed-off-by: Paul Moore <pmoore@redhat.com> > > Setting skb->sk without the destructor results in an SKB that can live > potentially forever with a stale reference to a destroyed socket. > > You cannot fix the problem in this way. Okay, no worries, I'll work on v2. For some reason I missed the destructor assignment in skb_set_owner_w(); I guess I was spending so much time hunting around looking for the missing skb->sk assignment that once I found it I declared victory ... a bit too soon. Looking at the code again, I think the right solution is to call skb_set_owner_w() instead of doing the assignment directly but that is starting to bring us back to sock_wmalloc(force == 1) which gets back to Eric's comments ... (below) ... On Monday, April 08, 2013 09:19:23 AM Eric Dumazet wrote: > Keeping a pointer on a socket without taking a refcount is not going to > work. > > We are trying to make the stack scale, so you need to add a selinux call > to take a ref count only if needed. > > That is : If selinux is not used, we don't need to slow down the stack. Contrary to popular belief, my goal is to not destroy the scalability and/or performance of our network stack, I just want to make sure we have a quality network stack that is not only fast and scalable, but also preserves the security functionality that makes Linux attractive to a number of users. To that end, we could put a #ifdef in the middle of tcp_make_synack(), but that seems very ugly to me and I think sets a bad precedence for the network stack and kernel as a whole. So a question for Dave, et al. - would you prefer that I fix this by: 1. Restore the original sock_wmalloc() call? 2. Keep things as-is with skb_alloc() but add skb_set_owner_w()? 3. Add a #ifdef depending on SELinux (probably the LSM in general to be safe) and use sock_wmalloc() if enabled, skb_alloc() if not? I guess I'm leaning towards #1 for the sake of simplicity, but I'd be happy with either #1 or #2. The #3 option seems like a hack and makes me a bit afraid of the future. I am also open to suggestions; to me, the most important thing is that we fix this regression, I'm less concerned about how we do it.
On Mon, 2013-04-08 at 13:22 -0400, Paul Moore wrote: > On Monday, April 08, 2013 12:14:34 PM David Miller wrote: > > From: Paul Moore <pmoore@redhat.com> > > Date: Mon, 08 Apr 2013 11:45:19 -0400 > > > > > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted > > > tcp_make_synack() to use alloc_skb() directly instead of calling > > > sock_wmalloc(), the goal being the elimination of two atomic > > > operations. Unfortunately, in doing so the change broke certain > > > SELinux/NetLabel configurations by no longer correctly assigning > > > the sock to the outgoing packet. > > > > > > This patch fixes this regression by doing the skb->sk assignment > > > directly inside tcp_make_synack(). > > > > > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> > > > Signed-off-by: Paul Moore <pmoore@redhat.com> > > > > Setting skb->sk without the destructor results in an SKB that can live > > potentially forever with a stale reference to a destroyed socket. > > > > You cannot fix the problem in this way. > > Okay, no worries, I'll work on v2. For some reason I missed the destructor > assignment in skb_set_owner_w(); I guess I was spending so much time hunting > around looking for the missing skb->sk assignment that once I found it I > declared victory ... a bit too soon. > > Looking at the code again, I think the right solution is to call > skb_set_owner_w() instead of doing the assignment directly but that is > starting to bring us back to sock_wmalloc(force == 1) which gets back to > Eric's comments ... (below) ... > > On Monday, April 08, 2013 09:19:23 AM Eric Dumazet wrote: > > Keeping a pointer on a socket without taking a refcount is not going to > > work. > > > > We are trying to make the stack scale, so you need to add a selinux call > > to take a ref count only if needed. > > > > That is : If selinux is not used, we don't need to slow down the stack. > > Contrary to popular belief, my goal is to not destroy the scalability and/or > performance of our network stack, I just want to make sure we have a quality > network stack that is not only fast and scalable, but also preserves the > security functionality that makes Linux attractive to a number of users. To > that end, we could put a #ifdef in the middle of tcp_make_synack(), but that > seems very ugly to me and I think sets a bad precedence for the network stack > and kernel as a whole. > > So a question for Dave, et al. - would you prefer that I fix this by: > > 1. Restore the original sock_wmalloc() call? > 2. Keep things as-is with skb_alloc() but add skb_set_owner_w()? > 3. Add a #ifdef depending on SELinux (probably the LSM in general to be safe) > and use sock_wmalloc() if enabled, skb_alloc() if not? Didnt we had the same issue with RST packets ? Please take a look at commit 3a7c384ffd57ef5fbd95f48edaa2ca4eb3d9f2ee -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, April 08, 2013 10:36:26 AM Eric Dumazet wrote: > On Mon, 2013-04-08 at 13:22 -0400, Paul Moore wrote: > > On Monday, April 08, 2013 12:14:34 PM David Miller wrote: > > > From: Paul Moore <pmoore@redhat.com> > > > Date: Mon, 08 Apr 2013 11:45:19 -0400 > > > > > > > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted > > > > tcp_make_synack() to use alloc_skb() directly instead of calling > > > > sock_wmalloc(), the goal being the elimination of two atomic > > > > operations. Unfortunately, in doing so the change broke certain > > > > SELinux/NetLabel configurations by no longer correctly assigning > > > > the sock to the outgoing packet. > > > > > > > > This patch fixes this regression by doing the skb->sk assignment > > > > directly inside tcp_make_synack(). > > > > > > > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> > > > > Signed-off-by: Paul Moore <pmoore@redhat.com> > > > > > > Setting skb->sk without the destructor results in an SKB that can live > > > potentially forever with a stale reference to a destroyed socket. > > > > > > You cannot fix the problem in this way. > > > > Okay, no worries, I'll work on v2. For some reason I missed the > > destructor assignment in skb_set_owner_w(); I guess I was spending so much > > time hunting around looking for the missing skb->sk assignment that once I > > found it I declared victory ... a bit too soon. > > > > Looking at the code again, I think the right solution is to call > > skb_set_owner_w() instead of doing the assignment directly but that is > > starting to bring us back to sock_wmalloc(force == 1) which gets back to > > Eric's comments ... (below) ... > > > > On Monday, April 08, 2013 09:19:23 AM Eric Dumazet wrote: > > > Keeping a pointer on a socket without taking a refcount is not going to > > > work. > > > > > > We are trying to make the stack scale, so you need to add a selinux call > > > to take a ref count only if needed. > > > > > > That is : If selinux is not used, we don't need to slow down the stack. > > > > Contrary to popular belief, my goal is to not destroy the scalability > > and/or performance of our network stack, I just want to make sure we have > > a quality network stack that is not only fast and scalable, but also > > preserves the security functionality that makes Linux attractive to a > > number of users. To that end, we could put a #ifdef in the middle of > > tcp_make_synack(), but that seems very ugly to me and I think sets a bad > > precedence for the network stack and kernel as a whole. > > > > So a question for Dave, et al. - would you prefer that I fix this by: > > > > 1. Restore the original sock_wmalloc() call? > > 2. Keep things as-is with skb_alloc() but add skb_set_owner_w()? > > 3. Add a #ifdef depending on SELinux (probably the LSM in general to be > > safe) and use sock_wmalloc() if enabled, skb_alloc() if not? > > Didnt we had the same issue with RST packets ? > > Please take a look at commit 3a7c384ffd57ef5fbd95f48edaa2ca4eb3d9f2ee Sort of a similar problem, but not really the same. Also, arguably, there is no real associated sock/socket for a RST so orphaning the packet makes sense. In the case of a SYNACK we can, and should, associate the packet with a sock/socket.
On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote: > Sort of a similar problem, but not really the same. Also, arguably, there is > no real associated sock/socket for a RST so orphaning the packet makes sense. > In the case of a SYNACK we can, and should, associate the packet with a > sock/socket. What is the intent ? This kind of requirement has a huge cost, and thats why I want a hook instead of a 'generic thing' Please read : http://workshop.netfilter.org/2013/wiki/images/3/3a/NFWS-TCP.pdf -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-04-08 at 10:47 -0700, Eric Dumazet wrote: > This kind of requirement has a huge cost, and thats why I want a hook > instead of a 'generic thing' I meant " I would like ... " We for example have security_inet_csk_clone() We could have security_skb_owned_by(skb, sk) I probably can send a patch, it seems quite easy. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello. On 04/08/2013 07:45 PM, Paul Moore wrote: > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 Please also specify that commit's summary line in parens (DaveM seems to also require double quotes inside the parens). > converted > tcp_make_synack() to use alloc_skb() directly instead of calling > sock_wmalloc(), the goal being the elimination of two atomic > operations. Unfortunately, in doing so the change broke certain > SELinux/NetLabel configurations by no longer correctly assigning > the sock to the outgoing packet. > > This patch fixes this regression by doing the skb->sk assignment > directly inside tcp_make_synack(). > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> > Signed-off-by: Paul Moore <pmoore@redhat.com> WBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote: > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote: > > Sort of a similar problem, but not really the same. Also, arguably, there > > is no real associated sock/socket for a RST so orphaning the packet makes > > sense. In the case of a SYNACK we can, and should, associate the packet > > with a sock/socket. > > What is the intent ? We have to do a number of painful things in SELinux because we aren't allowed a proper security blob (void *security) in a sk_buff. One of those things is using the security blob in the originating sock as a stand-in for the packet's own security blob; as a result, when skb->sk is not set we have to make some guesses about the security attributes of packet. We do have methods to handle packets without a valid sock pointer, but those are used primarily for non- local packets (e.g. forwarded traffic) and some limited local use cases (e.g. TCP RST packets). Also, don't mention skb->secmark; unfortunately that was used for something else and doesn't apply to this conversation. > On Mon, 2013-04-08 at 10:47 -0700, Eric Dumazet wrote: > > This kind of requirement has a huge cost, and thats why I want a hook > > instead of a 'generic thing' > > I meant " I would like ... " > > We for example have security_inet_csk_clone() > > We could have security_skb_owned_by(skb, sk) > > I probably can send a patch, it seems quite easy. It seems a bit fragile to me, perhaps even hacky, but in some ways I guess it isn't anymore fragile than relying on skb->sk - as this problem demonstrates. My other concern is that adding this hook *correctly* is likely to touch a lot of files and may be a bit much so late in the 3.9 cycle, Dave, what say you? Assuming that this is the preferred option, Eric, would you be open to reverting your patch for 3.9 with the assumption that I promise to add the hook for 3.10? You've got my word that I'll have it done ASAP.
On Monday, April 08, 2013 10:03:33 PM Sergei Shtylyov wrote: > Hello. > > On 04/08/2013 07:45 PM, Paul Moore wrote: > > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 > > Please also specify that commit's summary line in parens > (DaveM seems to also require double quotes inside the parens). Noted for future reference.
On Mon, 2013-04-08 at 14:12 -0400, Paul Moore wrote: > > It seems a bit fragile to me, perhaps even hacky, but in some ways I guess it > isn't anymore fragile than relying on skb->sk - as this problem demonstrates. > My other concern is that adding this hook *correctly* is likely to touch a lot > of files and may be a bit much so late in the 3.9 cycle, Dave, what say you? I don't get it, 90ba9b1986b5ac4b2d18 was in 3.6, why do you care of 3.9 ? I am preparing a fix right now. Not a revert, thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, April 08, 2013 11:21:43 AM Eric Dumazet wrote: > On Mon, 2013-04-08 at 14:12 -0400, Paul Moore wrote: > > It seems a bit fragile to me, perhaps even hacky, but in some ways I guess > > it isn't anymore fragile than relying on skb->sk - as this problem > > demonstrates. My other concern is that adding this hook *correctly* is > > likely to touch a lot of files and may be a bit much so late in the 3.9 > > cycle, Dave, what say you? > > I don't get it, 90ba9b1986b5ac4b2d18 was in 3.6, why do you care of > 3.9 ? I wasn't made aware of the particular bug until just recently. When I see a regression I generally try to get it fixed as soon as possible. > I am preparing a fix right now. Not a revert, thank you. I guess we'll have to wait and see then; the more I think about the new hook you proposed the less enthused I am about it. I'm still curious to hear what Dave has to say on this.
On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote: > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote: > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote: > > > Sort of a similar problem, but not really the same. Also, arguably, > > > there is no real associated sock/socket for a RST so orphaning the > > > packet makes sense. In the case of a SYNACK we can, and should, > > > associate the packet with a sock/socket. > > > > What is the intent ? > > We have to do a number of painful things in SELinux because we aren't > allowed a proper security blob (void *security) in a sk_buff. One of those > things ... Actually, I wonder if this problem means it is a good time to revisit the no- security-blob-in-sk_buff decision? The management of the blob would be hidden behind the LSM hooks like everything else and it would have a number of advantages including making problems like we are seeing here easier to fix or avoid entirely. It would also make life much easier for those of working on LSM stuff and it would pave the way for including network access controls in the stacked-LSM stuff Casey is kicking around. I'm aware of all the arguments against, but thought it would be worth bringing it up again, if for no other reason than I haven't heard enough shouting yet today :)
On Mon, 2013-04-08 at 14:26 -0400, Paul Moore wrote: > I guess we'll have to wait and see then; the more I think about the new hook > you proposed the less enthused I am about it. > > I'm still curious to hear what Dave has to say on this. > 90ba9b1986b5ac4b2 is 10 months old, and nobody complained until today ? This sounds like a very small issue to me, a revert is simply overkill. Lets go forward instead of applying blind fixes. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Paul Moore <pmoore@redhat.com> Date: Mon, 08 Apr 2013 13:22:50 -0400 > Contrary to popular belief, my goal is to not destroy the scalability and/or > performance of our network stack, I just want to make sure we have a quality > network stack that is not only fast and scalable, but also preserves the > security functionality that makes Linux attractive to a number of users. Get the violin out. > To that end, we could put a #ifdef in the middle of > tcp_make_synack(), but that seems very ugly to me and I think sets a > bad precedence for the network stack and kernel as a whole. Not an ifdef, a run time state test. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote: > On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote: > > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote: > > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote: > > > > Sort of a similar problem, but not really the same. Also, arguably, > > > > there is no real associated sock/socket for a RST so orphaning the > > > > packet makes sense. In the case of a SYNACK we can, and should, > > > > associate the packet with a sock/socket. > > > > > > What is the intent ? > > > > We have to do a number of painful things in SELinux because we aren't > > allowed a proper security blob (void *security) in a sk_buff. One of > > those things ... > > Actually, I wonder if this problem means it is a good time to revisit the > no- security-blob-in-sk_buff decision? The management of the blob would be > hidden behind the LSM hooks like everything else and it would have a number > of advantages including making problems like we are seeing here easier to > fix or avoid entirely. It would also make life much easier for those of > working on LSM stuff and it would pave the way for including network access > controls in the stacked-LSM stuff Casey is kicking around. No comment, or am I just too anxious?
From: Paul Moore <pmoore@redhat.com> Date: Mon, 08 Apr 2013 17:10:43 -0400 > On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote: >> On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote: >> > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote: >> > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote: >> > > > Sort of a similar problem, but not really the same. Also, arguably, >> > > > there is no real associated sock/socket for a RST so orphaning the >> > > > packet makes sense. In the case of a SYNACK we can, and should, >> > > > associate the packet with a sock/socket. >> > > >> > > What is the intent ? >> > >> > We have to do a number of painful things in SELinux because we aren't >> > allowed a proper security blob (void *security) in a sk_buff. One of >> > those things ... >> >> Actually, I wonder if this problem means it is a good time to revisit the >> no- security-blob-in-sk_buff decision? The management of the blob would be >> hidden behind the LSM hooks like everything else and it would have a number >> of advantages including making problems like we are seeing here easier to >> fix or avoid entirely. It would also make life much easier for those of >> working on LSM stuff and it would pave the way for including network access >> controls in the stacked-LSM stuff Casey is kicking around. > > No comment, or am I just too anxious? There is no way I'm putting LSM overhead into sk_buff, it's already too big. I didn't comment because it wasn't worth a comment, but since you're pushing me on the issue, I'll make the no explicit. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, April 08, 2013 05:15:12 PM David Miller wrote: > From: Paul Moore <pmoore@redhat.com> > Date: Mon, 08 Apr 2013 17:10:43 -0400 > > > On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote: > >> On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote: > >> > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote: > >> > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote: > >> > > > Sort of a similar problem, but not really the same. Also, > >> > > > arguably, > >> > > > there is no real associated sock/socket for a RST so orphaning the > >> > > > packet makes sense. In the case of a SYNACK we can, and should, > >> > > > associate the packet with a sock/socket. > >> > > > >> > > What is the intent ? > >> > > >> > We have to do a number of painful things in SELinux because we aren't > >> > allowed a proper security blob (void *security) in a sk_buff. One of > >> > those things ... > >> > >> Actually, I wonder if this problem means it is a good time to revisit the > >> no- security-blob-in-sk_buff decision? The management of the blob would > >> be hidden behind the LSM hooks like everything else and it would have a > >> number of advantages including making problems like we are seeing here > >> easier to fix or avoid entirely. It would also make life much easier for > >> those of working on LSM stuff and it would pave the way for including > >> network access controls in the stacked-LSM stuff Casey is kicking around. > > > > No comment, or am I just too anxious? > > There is no way I'm putting LSM overhead into sk_buff, it's already > too big. If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and the management of that pointer is handled by LSM hooks why is it a concern? I apologize for pushing on the issue, but I'm having a hard time reconciling the reason for the "no" with the comments/decisions about the regression fix; at present there seems to be a level of contradiction between the two. > I didn't comment because it wasn't worth a comment, but since you're > pushing me on the issue, I'll make the no explicit.
From: Paul Moore <pmoore@redhat.com> Date: Mon, 08 Apr 2013 17:24:50 -0400 > If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and > the management of that pointer is handled by LSM hooks why is it a concern? I > apologize for pushing on the issue, but I'm having a hard time reconciling the > reason for the "no" with the comments/decisions about the regression fix; at > present there seems to be a level of contradiction between the two. 8 bytes times however many millions of packets per second we can process on a big machine, you do the math. It's memory, less cache locality, etc. etc. etc. It's the most important data structure in the entire networking stack, and every single byte matters. I want the overhead to be your problem, so that only users of your stuff eat the overhead, rather than everyone. And don't even mention ifdefs, that's bogus, because every distribution turns every option on, %99.9999999 of users will therefore not see the savings. Really, this is a dead topic, let's move on. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-04-08 at 17:24 -0400, Paul Moore wrote: > On Monday, April 08, 2013 05:15:12 PM David Miller wrote: > > From: Paul Moore <pmoore@redhat.com> > > Date: Mon, 08 Apr 2013 17:10:43 -0400 > > > > > On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote: > > >> On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote: > > >> > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote: > > >> > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote: > > >> > > > Sort of a similar problem, but not really the same. Also, > > >> > > > arguably, > > >> > > > there is no real associated sock/socket for a RST so orphaning the > > >> > > > packet makes sense. In the case of a SYNACK we can, and should, > > >> > > > associate the packet with a sock/socket. > > >> > > > > >> > > What is the intent ? > > >> > > > >> > We have to do a number of painful things in SELinux because we aren't > > >> > allowed a proper security blob (void *security) in a sk_buff. One of > > >> > those things ... > > >> > > >> Actually, I wonder if this problem means it is a good time to revisit the > > >> no- security-blob-in-sk_buff decision? The management of the blob would > > >> be hidden behind the LSM hooks like everything else and it would have a > > >> number of advantages including making problems like we are seeing here > > >> easier to fix or avoid entirely. It would also make life much easier for > > >> those of working on LSM stuff and it would pave the way for including > > >> network access controls in the stacked-LSM stuff Casey is kicking around. > > > > > > No comment, or am I just too anxious? > > > > There is no way I'm putting LSM overhead into sk_buff, it's already > > too big. > > If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and > the management of that pointer is handled by LSM hooks why is it a concern? I > apologize for pushing on the issue, but I'm having a hard time reconciling the > reason for the "no" with the comments/decisions about the regression fix; at > present there seems to be a level of contradiction between the two. Most Linux users are running distribution kernels with one or more LSMs built-in. Anything you make dependent on CONFIG_SECURITY or similar generic symbol will have a cost for all those users, whether or not they actually use the LSM. Ben. > > I didn't comment because it wasn't worth a comment, but since you're > > pushing me on the issue, I'll make the no explicit. >
On Monday, April 08, 2013 05:33:25 PM David Miller wrote: > From: Paul Moore <pmoore@redhat.com> > Date: Mon, 08 Apr 2013 17:24:50 -0400 > > > If the void pointer is wrapped by a #ifdef (plenty of precedence for that) > > and the management of that pointer is handled by LSM hooks why is it a > > concern? I apologize for pushing on the issue, but I'm having a hard > > time reconciling the reason for the "no" with the comments/decisions > > about the regression fix; at present there seems to be a level of > > contradiction between the two. > > 8 bytes times however many millions of packets per second we can process > on a big machine, you do the math. > > It's memory, less cache locality, etc. etc. etc. > > It's the most important data structure in the entire networking stack, > and every single byte matters. > > I want the overhead to be your problem, so that only users of your > stuff eat the overhead, rather than everyone. Okay, if the objection is really just one of structure size and not the hooks, what if I did the work to consolidate the skb->secmark and skb->sp fields into a new structure/pointer? Assuming it wasn't too painful, it would be a net reduction of four bytes. If that worked would you have an objection to us adding a LSM security blob to this new structure?
From: Paul Moore <pmoore@redhat.com> Date: Mon, 08 Apr 2013 18:01:56 -0400 > Okay, if the objection is really just one of structure size and not the hooks, > what if I did the work to consolidate the skb->secmark and skb->sp fields into > a new structure/pointer? Assuming it wasn't too painful, it would be a net > reduction of four bytes. If that worked would you have an objection to us > adding a LSM security blob to this new structure? 'sp' is sepreate from LSM and making it indirect would hurt IPSEC performance. Please, really, just drop this. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/8/2013 2:33 PM, David Miller wrote: > From: Paul Moore <pmoore@redhat.com> > Date: Mon, 08 Apr 2013 17:24:50 -0400 > >> If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and >> the management of that pointer is handled by LSM hooks why is it a concern? I >> apologize for pushing on the issue, but I'm having a hard time reconciling the >> reason for the "no" with the comments/decisions about the regression fix; at >> present there seems to be a level of contradiction between the two. > 8 bytes times however many millions of packets per second we can process > on a big machine, you do the math. OK, let's do the math. First off, it's 4 bytes, not 8. It replaces the secmark. Your increased memory usage is going to be 4 bytes/packet * M packets/second * N seconds Where M is the rate at which you're processing packets and N is the length of time it takes to process a packet. Let's pretend we have an embedded system that does nothing but send 128 byte packets on a 10Gb port. That's 10M packets/second. If it takes a full second to process a packet the overhead is 40MB for that second. I have it on good authority that packets can be processed in considerably less time than that. The real number is more like 0.05 seconds. That means your actual overhead is more like 1MB. These are dumbed down calculations. I am not a memory usage expert. I am convinced that "real" calculations are going to get similar numbers. I am, of course, willing to be swayed by evidence that I am wrong. Compare that to the overhead associated with using CIPSO on packets that never leave the box. > > It's memory, less cache locality, etc. etc. etc. > > It's the most important data structure in the entire networking stack, > and every single byte matters. > > I want the overhead to be your problem, so that only users of your > stuff eat the overhead, rather than everyone. > > And don't even mention ifdefs, that's bogus, because every > distribution turns every option on, %99.9999999 of users will > therefore not see the savings. > > Really, this is a dead topic, let's move on. > > Thanks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-security-module" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-04-08 at 16:40 -0700, Casey Schaufler wrote: > OK, let's do the math. > > First off, it's 4 bytes, not 8. It replaces the secmark. > Your increased memory usage is going to be > > 4 bytes/packet * M packets/second * N seconds > > Where M is the rate at which you're processing packets and > N is the length of time it takes to process a packet. > > Let's pretend we have an embedded system that does nothing but send > 128 byte packets on a 10Gb port. That's 10M packets/second. If it > takes a full second to process a packet the overhead is 40MB for that > second. I have it on good authority that packets can be processed > in considerably less time than that. The real number is more like > 0.05 seconds. That means your actual overhead is more like 1MB. > > These are dumbed down calculations. I am not a memory usage expert. > I am convinced that "real" calculations are going to get similar > numbers. I am, of course, willing to be swayed by evidence that I > am wrong. > > Compare that to the overhead associated with using CIPSO on packets > that never leave the box. Maths are not that simple, and its not about size of sk_buff, since the number of in-flight skb should be quite small. Its the time to init this memory for _every_ packet. sizeof(sk_buff) is 0xf8, very close to cross the 256 bytes limit. Add a single _byte_ and it becomes a matter of adding a _cache_ line, and thats 25 % cost, assuming 64bytes cache lines. So instead of processing 10M packets per second, we would process 9M packets per second, or maybe less. Yes, 256 bytes per sk_buff, this is the current insane situation. (Not counting the struct skb_shared_info, adding at least one additional cache line) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/8/2013 5:33 PM, Eric Dumazet wrote: > On Mon, 2013-04-08 at 16:40 -0700, Casey Schaufler wrote: > >> OK, let's do the math. >> >> First off, it's 4 bytes, not 8. It replaces the secmark. >> Your increased memory usage is going to be >> >> 4 bytes/packet * M packets/second * N seconds >> >> Where M is the rate at which you're processing packets and >> N is the length of time it takes to process a packet. >> >> Let's pretend we have an embedded system that does nothing but send >> 128 byte packets on a 10Gb port. That's 10M packets/second. If it >> takes a full second to process a packet the overhead is 40MB for that >> second. I have it on good authority that packets can be processed >> in considerably less time than that. The real number is more like >> 0.05 seconds. That means your actual overhead is more like 1MB. >> >> These are dumbed down calculations. I am not a memory usage expert. >> I am convinced that "real" calculations are going to get similar >> numbers. I am, of course, willing to be swayed by evidence that I >> am wrong. >> >> Compare that to the overhead associated with using CIPSO on packets >> that never leave the box. > Maths are not that simple, I am willing to accept that. I am willing to be educated. I would be interested to see what the "real" maths are, and how far off my dumb version might actually be. > and its not about size of sk_buff, since the > number of in-flight skb should be quite small. The reason we're told that there can't be a blob pointer in the sk_buff is that it increases the size of the sk_buff. Yes, it *is* about the size. > Its the time to init this memory for _every_ packet. Which is a function of the size, no? > sizeof(sk_buff) is 0xf8, very close to cross the 256 bytes limit. 0xf8 + 0x4 = 0xfc 248 + 4 = 252 > Add a single _byte_ and it becomes a matter of adding a _cache_ line, > and thats 25 % cost, assuming 64bytes cache lines. I don't see that with adding 4 bytes. Again, I'm willing to be educated if I'm wrong. > So instead of processing 10M packets per second, we would process 9M > packets per second, or maybe less. > > Yes, 256 bytes per sk_buff, this is the current insane situation. > (Not counting the struct skb_shared_info, adding at least one additional > cache line) Sorry, but I don't see what's insane with either the current 248 byte sk_buff or a 252 byte sk_buff. As always, I am willing to be educated. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote: > I don't see that with adding 4 bytes. Again, I'm willing to be > educated if I'm wrong. Feel free to add 4 bytes without having the 'align to 8 bytes' problem on 64 bit arches. Show us your patch. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/8/2013 6:09 PM, Eric Dumazet wrote: > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote: > >> I don't see that with adding 4 bytes. Again, I'm willing to be >> educated if I'm wrong. > Feel free to add 4 bytes without having the 'align to 8 bytes' problem > on 64 bit arches. Show us your patch. Recall that it's replacing an existing 4 byte value with an 8 byte value. My compiler days were quite short and long ago, but it would seem that an 8 byte value ought not have an 'align to 8 bytes' problem. Again, I'm willing to be educated. > -- > To unsubscribe from this list: send the line "unsubscribe linux-security-module" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote: > On 4/8/2013 6:09 PM, Eric Dumazet wrote: > > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote: > >> I don't see that with adding 4 bytes. Again, I'm willing to be > >> educated if I'm wrong. > > > > Feel free to add 4 bytes without having the 'align to 8 bytes' problem > > on 64 bit arches. Show us your patch. > > Recall that it's replacing an existing 4 byte value with an 8 byte value. > My compiler days were quite short and long ago, but it would seem that > an 8 byte value ought not have an 'align to 8 bytes' problem. > > Again, I'm willing to be educated. Armed with a cup of coffee I took a look at the sk_buff structure this morning with the pahole tool and using the current sk_buff if we turn on all the #ifdefs here is what I see on x86_64: struct sk_buff { struct sk_buff * next; /* 0 8 */ struct sk_buff * prev; /* 8 8 */ ktime_t tstamp; /* 16 8 */ struct sock * sk; /* 24 8 */ struct net_device * dev; /* 32 8 */ char cb[48]; /* 40 48 */ /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */ long unsigned int _skb_refdst; /* 88 8 */ struct sec_path * sp; /* 96 8 */ unsigned int len; /* 104 4 */ unsigned int data_len; /* 108 4 */ __u16 mac_len; /* 112 2 */ __u16 hdr_len; /* 114 2 */ union { __wsum csum; /* 4 */ struct { __u16 csum_start; /* 116 2 */ __u16 csum_offset; /* 118 2 */ }; /* 4 */ }; /* 116 4 */ __u32 priority; /* 120 4 */ int flags1_begin[0]; /* 124 0 */ __u8 local_df:1; /* 124: 7 1 */ __u8 cloned:1; /* 124: 6 1 */ __u8 ip_summed:2; /* 124: 4 1 */ __u8 nohdr:1; /* 124: 3 1 */ __u8 nfctinfo:3; /* 124: 0 1 */ __u8 pkt_type:3; /* 125: 5 1 */ __u8 fclone:2; /* 125: 3 1 */ __u8 ipvs_property:1; /* 125: 2 1 */ __u8 peeked:1; /* 125: 1 1 */ __u8 nf_trace:1; /* 125: 0 1 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ int flags1_end[0]; /* 128 0 */ __be16 protocol; /* 128 2 */ /* XXX 6 bytes hole, try to pack */ void (*destructor)(struct sk_buff *); /* 136 8 */ struct nf_conntrack * nfct; /* 144 8 */ struct sk_buff * nfct_reasm; /* 152 8 */ struct nf_bridge_info * nf_bridge; /* 160 8 */ int skb_iif; /* 168 4 */ __u32 rxhash; /* 172 4 */ __u16 vlan_tci; /* 176 2 */ __u16 tc_index; /* 178 2 */ __u16 tc_verd; /* 180 2 */ __u16 queue_mapping; /* 182 2 */ int flags2_begin[0]; /* 184 0 */ __u8 ndisc_nodetype:2; /* 184: 6 1 */ __u8 pfmemalloc:1; /* 184: 5 1 */ __u8 ooo_okay:1; /* 184: 4 1 */ __u8 l4_rxhash:1; /* 184: 3 1 */ __u8 wifi_acked_valid:1; /* 184: 2 1 */ __u8 wifi_acked:1; /* 184: 1 1 */ __u8 no_fcs:1; /* 184: 0 1 */ __u8 head_frag:1; /* 185: 7 1 */ __u8 encapsulation:1; /* 185: 6 1 */ /* XXX 6 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ int flags2_end[0]; /* 188 0 */ dma_cookie_t dma_cookie; /* 188 4 */ /* --- cacheline 3 boundary (192 bytes) --- */ __u32 secmark; /* 192 4 */ union { __u32 mark; /* 4 */ __u32 dropcount; /* 4 */ __u32 reserved_tailroom; /* 4 */ }; /* 196 4 */ sk_buff_data_t inner_transport_header; /* 200 8 */ sk_buff_data_t inner_network_header; /* 208 8 */ sk_buff_data_t transport_header; /* 216 8 */ sk_buff_data_t network_header; /* 224 8 */ sk_buff_data_t mac_header; /* 232 8 */ sk_buff_data_t tail; /* 240 8 */ sk_buff_data_t end; /* 248 8 */ /* --- cacheline 4 boundary (256 bytes) --- */ unsigned char * head; /* 256 8 */ unsigned char * data; /* 264 8 */ unsigned int truesize; /* 272 4 */ atomic_t users; /* 276 4 */ /* size: 280, cachelines: 5, members: 62 */ /* sum members: 270, holes: 3, sum holes: 10 */ /* bit holes: 1, sum bit holes: 6 bits */ /* last cacheline: 24 bytes */ }; It looks like there some holes we might be able to capitalize on. If we remove "secmark" (we can handle it inside a security blob) and move "protocol" to after the flags2 bit field we can make an aligned 8 byte hold for a security blob before "destructor". According to pahole the structure size stays the same and the only field which moves to a different cacheline is "dma_cookie" which moves from cacheline 2 to 3. Here is the pahole output: struct sk_buff_test { struct sk_buff * next; /* 0 8 */ struct sk_buff * prev; /* 8 8 */ ktime_t tstamp; /* 16 8 */ struct sock * sk; /* 24 8 */ struct net_device * dev; /* 32 8 */ char cb[48]; /* 40 48 */ /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */ long unsigned int _skb_refdst; /* 88 8 */ struct sec_path * sp; /* 96 8 */ unsigned int len; /* 104 4 */ unsigned int data_len; /* 108 4 */ __u16 mac_len; /* 112 2 */ __u16 hdr_len; /* 114 2 */ union { __wsum csum; /* 4 */ struct { __u16 csum_start; /* 116 2 */ __u16 csum_offset; /* 118 2 */ }; /* 4 */ }; /* 116 4 */ __u32 priority; /* 120 4 */ int flags1_begin[0]; /* 124 0 */ __u8 local_df:1; /* 124: 7 1 */ __u8 cloned:1; /* 124: 6 1 */ __u8 ip_summed:2; /* 124: 4 1 */ __u8 nohdr:1; /* 124: 3 1 */ __u8 nfctinfo:3; /* 124: 0 1 */ __u8 pkt_type:3; /* 125: 5 1 */ __u8 fclone:2; /* 125: 3 1 */ __u8 ipvs_property:1; /* 125: 2 1 */ __u8 peeked:1; /* 125: 1 1 */ __u8 nf_trace:1; /* 125: 0 1 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ int flags1_end[0]; /* 128 0 */ void * security; /* 128 8 */ void (*destructor)(struct sk_buff *); /* 136 8 */ struct nf_conntrack * nfct; /* 144 8 */ struct sk_buff * nfct_reasm; /* 152 8 */ struct nf_bridge_info * nf_bridge; /* 160 8 */ int skb_iif; /* 168 4 */ __u32 rxhash; /* 172 4 */ __u16 vlan_tci; /* 176 2 */ __u16 tc_index; /* 178 2 */ __u16 tc_verd; /* 180 2 */ __u16 queue_mapping; /* 182 2 */ int flags2_begin[0]; /* 184 0 */ __u8 ndisc_nodetype:2; /* 184: 6 1 */ __u8 pfmemalloc:1; /* 184: 5 1 */ __u8 ooo_okay:1; /* 184: 4 1 */ __u8 l4_rxhash:1; /* 184: 3 1 */ __u8 wifi_acked_valid:1; /* 184: 2 1 */ __u8 wifi_acked:1; /* 184: 1 1 */ __u8 no_fcs:1; /* 184: 0 1 */ __u8 head_frag:1; /* 185: 7 1 */ __u8 encapsulation:1; /* 185: 6 1 */ /* XXX 6 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ int flags2_end[0]; /* 188 0 */ __be16 protocol; /* 188 2 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 3 boundary (192 bytes) --- */ dma_cookie_t dma_cookie; /* 192 4 */ union { __u32 mark; /* 4 */ __u32 dropcount; /* 4 */ __u32 reserved_tailroom; /* 4 */ }; /* 196 4 */ sk_buff_data_t inner_transport_header; /* 200 8 */ sk_buff_data_t inner_network_header; /* 208 8 */ sk_buff_data_t transport_header; /* 216 8 */ sk_buff_data_t network_header; /* 224 8 */ sk_buff_data_t mac_header; /* 232 8 */ sk_buff_data_t tail; /* 240 8 */ sk_buff_data_t end; /* 248 8 */ /* --- cacheline 4 boundary (256 bytes) --- */ unsigned char * head; /* 256 8 */ unsigned char * data; /* 264 8 */ unsigned int truesize; /* 272 4 */ atomic_t users; /* 276 4 */ /* size: 280, cachelines: 5, members: 62 */ /* sum members: 274, holes: 3, sum holes: 6 */ /* bit holes: 1, sum bit holes: 6 bits */ /* last cacheline: 24 bytes */ }; As Casey already mentioned, if this isn't acceptable please help me understand why.
On Tuesday, April 09, 2013 09:19:30 AM Paul Moore wrote: > On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote: > > On 4/8/2013 6:09 PM, Eric Dumazet wrote: > > > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote: > > >> I don't see that with adding 4 bytes. Again, I'm willing to be > > >> educated if I'm wrong. > > > > > > Feel free to add 4 bytes without having the 'align to 8 bytes' problem > > > on 64 bit arches. Show us your patch. > > > > Recall that it's replacing an existing 4 byte value with an 8 byte value. > > My compiler days were quite short and long ago, but it would seem that > > an 8 byte value ought not have an 'align to 8 bytes' problem. > > > > Again, I'm willing to be educated. > > Armed with a cup of coffee I took a look at the sk_buff structure this > morning with the pahole tool and using the current sk_buff if we turn on > all the #ifdefs here is what I see on x86_64: > > struct sk_buff { ... > /* size: 280, cachelines: 5, members: 62 */ > /* sum members: 270, holes: 3, sum holes: 10 */ > /* bit holes: 1, sum bit holes: 6 bits */ > /* last cacheline: 24 bytes */ > }; > > It looks like there some holes we might be able to capitalize on. If we > remove "secmark" (we can handle it inside a security blob) and move > "protocol" to after the flags2 bit field we can make an aligned 8 byte hold > for a security blob before "destructor". According to pahole the structure > size stays the same and the only field which moves to a different cacheline > is "dma_cookie" which moves from cacheline 2 to 3. Here is the pahole > output: > > struct sk_buff_test { ... > /* size: 280, cachelines: 5, members: 62 */ > /* sum members: 274, holes: 3, sum holes: 6 */ > /* bit holes: 1, sum bit holes: 6 bits */ > /* last cacheline: 24 bytes */ > }; > > As Casey already mentioned, if this isn't acceptable please help me > understand why. For the sake of completeness I also checked out the changes when compiled for 32 bit and it was very much the same; same structure size and in the 32 bit case no field movement from one cacheline to another.
On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote: > As Casey already mentioned, if this isn't acceptable please help me understand > why. > You see something which is not the reality. If you do such analysis, better do it properly, because any change you are going to submit will be doubly checked by people who really care. sizeof(sk_buff) is not 280. (aligned to 320 because of cache line being 64) Tell me what you have when doing : ls -l /sys/kernel/slab/skbuff_head_cache Do you really see $ ls -l /sys/kernel/slab/skbuff_head_cache lrwxrwxrwx 1 root root 0 Apr 9 06:54 /sys/kernel/slab/skbuff_head_cache -> :t-0000320 Here I get : $ ls -l /sys/kernel/slab/skbuff_head_cache lrwxrwxrwx 1 root root 0 Apr 9 06:54 /sys/kernel/slab/skbuff_head_cache -> :t-0000256 because sizeof(sk_buff) <= 256 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote: > On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote: > > On 4/8/2013 6:09 PM, Eric Dumazet wrote: > > > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote: > > >> I don't see that with adding 4 bytes. Again, I'm willing to be > > >> educated if I'm wrong. > > > > > > Feel free to add 4 bytes without having the 'align to 8 bytes' problem > > > on 64 bit arches. Show us your patch. > > > > Recall that it's replacing an existing 4 byte value with an 8 byte value. > > My compiler days were quite short and long ago, but it would seem that > > an 8 byte value ought not have an 'align to 8 bytes' problem. > > > > Again, I'm willing to be educated. > > Armed with a cup of coffee I took a look at the sk_buff structure this morning > with the pahole tool and using the current sk_buff if we turn on all the > #ifdefs here is what I see on x86_64: > > struct sk_buff { [...] > sk_buff_data_t inner_transport_header; /* 200 8 */ > sk_buff_data_t inner_network_header; /* 208 8 */ > sk_buff_data_t transport_header; /* 216 8 */ > sk_buff_data_t network_header; /* 224 8 */ > sk_buff_data_t mac_header; /* 232 8 */ > sk_buff_data_t tail; /* 240 8 */ > sk_buff_data_t end; /* 248 8 */ [...] This is wrong; sk_buff_data_t is always 32-bit: #if BITS_PER_LONG > 32 #define NET_SKBUFF_DATA_USES_OFFSET 1 #endif #ifdef NET_SKBUFF_DATA_USES_OFFSET typedef unsigned int sk_buff_data_t; #else typedef unsigned char *sk_buff_data_t; #endif Ben.
On Tuesday, April 09, 2013 03:05:42 PM Ben Hutchings wrote: > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote: > > On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote: > > > On 4/8/2013 6:09 PM, Eric Dumazet wrote: > > > > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote: > > > >> I don't see that with adding 4 bytes. Again, I'm willing to be > > > >> educated if I'm wrong. > > > > > > > > Feel free to add 4 bytes without having the 'align to 8 bytes' problem > > > > on 64 bit arches. Show us your patch. > > > > > > Recall that it's replacing an existing 4 byte value with an 8 byte > > > value. > > > My compiler days were quite short and long ago, but it would seem that > > > an 8 byte value ought not have an 'align to 8 bytes' problem. > > > > > > Again, I'm willing to be educated. > > > > Armed with a cup of coffee I took a look at the sk_buff structure this > > morning with the pahole tool and using the current sk_buff if we turn on > > all the #ifdefs here is what I see on x86_64: > > > > struct sk_buff { > > [...] > > > sk_buff_data_t inner_transport_header; /* 200 8 > > */ > > sk_buff_data_t inner_network_header; /* 208 8 */ > > sk_buff_data_t transport_header; /* 216 8 */ > > sk_buff_data_t network_header; /* 224 8 */ > > sk_buff_data_t mac_header; /* 232 8 */ > > sk_buff_data_t tail; /* 240 8 */ > > sk_buff_data_t end; /* 248 8 */ > > [...] > > This is wrong; sk_buff_data_t is always 32-bit: Yep. My mistake. While looking at this a bit more after my original email I noticed the same thing. Ultimately it doesn't change the size or cachelines as the sk_buff_data_t structures are at the bottom of sk_buff but here are the correct breakdowns: ORIG: struct sk_buff { struct sk_buff * next; /* 0 8 */ struct sk_buff * prev; /* 8 8 */ ktime_t tstamp; /* 16 8 */ struct sock * sk; /* 24 8 */ struct net_device * dev; /* 32 8 */ char cb[48]; /* 40 48 */ /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */ long unsigned int _skb_refdst; /* 88 8 */ struct sec_path * sp; /* 96 8 */ unsigned int len; /* 104 4 */ unsigned int data_len; /* 108 4 */ __u16 mac_len; /* 112 2 */ __u16 hdr_len; /* 114 2 */ union { __wsum csum; /* 4 */ struct { __u16 csum_start; /* 116 2 */ __u16 csum_offset; /* 118 2 */ }; /* 4 */ }; /* 116 4 */ __u32 priority; /* 120 4 */ int flags1_begin[0]; /* 124 0 */ __u8 local_df:1; /* 124: 7 1 */ __u8 cloned:1; /* 124: 6 1 */ __u8 ip_summed:2; /* 124: 4 1 */ __u8 nohdr:1; /* 124: 3 1 */ __u8 nfctinfo:3; /* 124: 0 1 */ __u8 pkt_type:3; /* 125: 5 1 */ __u8 fclone:2; /* 125: 3 1 */ __u8 ipvs_property:1; /* 125: 2 1 */ __u8 peeked:1; /* 125: 1 1 */ __u8 nf_trace:1; /* 125: 0 1 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ int flags1_end[0]; /* 128 0 */ __be16 protocol; /* 128 2 */ /* XXX 6 bytes hole, try to pack */ void (*destructor)(struct sk_buff *); /* 136 8 */ struct nf_conntrack * nfct; /* 144 8 */ struct sk_buff * nfct_reasm; /* 152 8 */ struct nf_bridge_info * nf_bridge; /* 160 8 */ int skb_iif; /* 168 4 */ __u32 rxhash; /* 172 4 */ __u16 vlan_tci; /* 176 2 */ __u16 tc_index; /* 178 2 */ __u16 tc_verd; /* 180 2 */ __u16 queue_mapping; /* 182 2 */ int flags2_begin[0]; /* 184 0 */ __u8 ndisc_nodetype:2; /* 184: 6 1 */ __u8 pfmemalloc:1; /* 184: 5 1 */ __u8 ooo_okay:1; /* 184: 4 1 */ __u8 l4_rxhash:1; /* 184: 3 1 */ __u8 wifi_acked_valid:1; /* 184: 2 1 */ __u8 wifi_acked:1; /* 184: 1 1 */ __u8 no_fcs:1; /* 184: 0 1 */ __u8 head_frag:1; /* 185: 7 1 */ __u8 encapsulation:1; /* 185: 6 1 */ /* XXX 6 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ int flags2_end[0]; /* 188 0 */ dma_cookie_t dma_cookie; /* 188 4 */ /* --- cacheline 3 boundary (192 bytes) --- */ __u32 secmark; /* 192 4 */ union { __u32 mark; /* 4 */ __u32 dropcount; /* 4 */ __u32 reserved_tailroom; /* 4 */ }; /* 196 4 */ sk_buff_data_t inner_transport_header; /* 200 4 */ sk_buff_data_t inner_network_header; /* 204 4 */ sk_buff_data_t transport_header; /* 208 4 */ sk_buff_data_t network_header; /* 212 4 */ sk_buff_data_t mac_header; /* 216 4 */ sk_buff_data_t tail; /* 220 4 */ sk_buff_data_t end; /* 224 4 */ /* XXX 4 bytes hole, try to pack */ unsigned char * head; /* 232 8 */ unsigned char * data; /* 240 8 */ unsigned int truesize; /* 248 4 */ atomic_t users; /* 252 4 */ /* --- cacheline 4 boundary (256 bytes) --- */ /* size: 256, cachelines: 4, members: 62 */ /* sum members: 242, holes: 4, sum holes: 14 */ /* bit holes: 1, sum bit holes: 6 bits */ }; W/BLOB: struct sk_buff_test { struct sk_buff * next; /* 0 8 */ struct sk_buff * prev; /* 8 8 */ ktime_t tstamp; /* 16 8 */ struct sock * sk; /* 24 8 */ struct net_device * dev; /* 32 8 */ char cb[48]; /* 40 48 */ /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */ long unsigned int _skb_refdst; /* 88 8 */ struct sec_path * sp; /* 96 8 */ unsigned int len; /* 104 4 */ unsigned int data_len; /* 108 4 */ __u16 mac_len; /* 112 2 */ __u16 hdr_len; /* 114 2 */ union { __wsum csum; /* 4 */ struct { __u16 csum_start; /* 116 2 */ __u16 csum_offset; /* 118 2 */ }; /* 4 */ }; /* 116 4 */ __u32 priority; /* 120 4 */ int flags1_begin[0]; /* 124 0 */ __u8 local_df:1; /* 124: 7 1 */ __u8 cloned:1; /* 124: 6 1 */ __u8 ip_summed:2; /* 124: 4 1 */ __u8 nohdr:1; /* 124: 3 1 */ __u8 nfctinfo:3; /* 124: 0 1 */ __u8 pkt_type:3; /* 125: 5 1 */ __u8 fclone:2; /* 125: 3 1 */ __u8 ipvs_property:1; /* 125: 2 1 */ __u8 peeked:1; /* 125: 1 1 */ __u8 nf_trace:1; /* 125: 0 1 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ int flags1_end[0]; /* 128 0 */ void * security; /* 128 8 */ void (*destructor)(struct sk_buff *); /* 136 8 */ struct nf_conntrack * nfct; /* 144 8 */ struct sk_buff * nfct_reasm; /* 152 8 */ struct nf_bridge_info * nf_bridge; /* 160 8 */ int skb_iif; /* 168 4 */ __u32 rxhash; /* 172 4 */ __u16 vlan_tci; /* 176 2 */ __u16 tc_index; /* 178 2 */ __u16 tc_verd; /* 180 2 */ __u16 queue_mapping; /* 182 2 */ int flags2_begin[0]; /* 184 0 */ __u8 ndisc_nodetype:2; /* 184: 6 1 */ __u8 pfmemalloc:1; /* 184: 5 1 */ __u8 ooo_okay:1; /* 184: 4 1 */ __u8 l4_rxhash:1; /* 184: 3 1 */ __u8 wifi_acked_valid:1; /* 184: 2 1 */ __u8 wifi_acked:1; /* 184: 1 1 */ __u8 no_fcs:1; /* 184: 0 1 */ __u8 head_frag:1; /* 185: 7 1 */ __u8 encapsulation:1; /* 185: 6 1 */ /* XXX 6 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ int flags2_end[0]; /* 188 0 */ __be16 protocol; /* 188 2 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 3 boundary (192 bytes) --- */ dma_cookie_t dma_cookie; /* 192 4 */ union { __u32 mark; /* 4 */ __u32 dropcount; /* 4 */ __u32 reserved_tailroom; /* 4 */ }; /* 196 4 */ sk_buff_data_t inner_transport_header; /* 200 4 */ sk_buff_data_t inner_network_header; /* 204 4 */ sk_buff_data_t transport_header; /* 208 4 */ sk_buff_data_t network_header; /* 212 4 */ sk_buff_data_t mac_header; /* 216 4 */ sk_buff_data_t tail; /* 220 4 */ sk_buff_data_t end; /* 224 4 */ /* XXX 4 bytes hole, try to pack */ unsigned char * head; /* 232 8 */ unsigned char * data; /* 240 8 */ unsigned int truesize; /* 248 4 */ atomic_t users; /* 252 4 */ /* --- cacheline 4 boundary (256 bytes) --- */ /* size: 256, cachelines: 4, members: 62 */ /* sum members: 246, holes: 4, sum holes: 10 */ /* bit holes: 1, sum bit holes: 6 bits */ };
On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote: > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote: > > As Casey already mentioned, if this isn't acceptable please help me > > understand why. > > You see something which is not the reality. If you do such analysis, > better do it properly, because any change you are going to submit will > be doubly checked by people who really care. I am attempting to do it properly, I simply made a mistake. Ben also pointed it out. As you wrote yesterday, "Lets go forward". After fixing the BITS_PER_LONG problem I looked at it again and it appears that by simply replacing the "secmark" field with a blob we retain the size of the sk_buff as well as the cacheline positions of all the fields, e.g. dma_cookie no longer moves cachelines. Thoughts? struct sk_buff_test { struct sk_buff * next; /* 0 8 */ struct sk_buff * prev; /* 8 8 */ ktime_t tstamp; /* 16 8 */ struct sock * sk; /* 24 8 */ struct net_device * dev; /* 32 8 */ char cb[48]; /* 40 48 */ /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */ long unsigned int _skb_refdst; /* 88 8 */ struct sec_path * sp; /* 96 8 */ unsigned int len; /* 104 4 */ unsigned int data_len; /* 108 4 */ __u16 mac_len; /* 112 2 */ __u16 hdr_len; /* 114 2 */ union { __wsum csum; /* 4 */ struct { __u16 csum_start; /* 116 2 */ __u16 csum_offset; /* 118 2 */ }; /* 4 */ }; /* 116 4 */ __u32 priority; /* 120 4 */ int flags1_begin[0]; /* 124 0 */ __u8 local_df:1; /* 124: 7 1 */ __u8 cloned:1; /* 124: 6 1 */ __u8 ip_summed:2; /* 124: 4 1 */ __u8 nohdr:1; /* 124: 3 1 */ __u8 nfctinfo:3; /* 124: 0 1 */ __u8 pkt_type:3; /* 125: 5 1 */ __u8 fclone:2; /* 125: 3 1 */ __u8 ipvs_property:1; /* 125: 2 1 */ __u8 peeked:1; /* 125: 1 1 */ __u8 nf_trace:1; /* 125: 0 1 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ int flags1_end[0]; /* 128 0 */ __be16 protocol; /* 128 2 */ /* XXX 6 bytes hole, try to pack */ void (*destructor)(struct sk_buff *); /* 136 8 */ struct nf_conntrack * nfct; /* 144 8 */ struct sk_buff * nfct_reasm; /* 152 8 */ struct nf_bridge_info * nf_bridge; /* 160 8 */ int skb_iif; /* 168 4 */ __u32 rxhash; /* 172 4 */ __u16 vlan_tci; /* 176 2 */ __u16 tc_index; /* 178 2 */ __u16 tc_verd; /* 180 2 */ __u16 queue_mapping; /* 182 2 */ int flags2_begin[0]; /* 184 0 */ __u8 ndisc_nodetype:2; /* 184: 6 1 */ __u8 pfmemalloc:1; /* 184: 5 1 */ __u8 ooo_okay:1; /* 184: 4 1 */ __u8 l4_rxhash:1; /* 184: 3 1 */ __u8 wifi_acked_valid:1; /* 184: 2 1 */ __u8 wifi_acked:1; /* 184: 1 1 */ __u8 no_fcs:1; /* 184: 0 1 */ __u8 head_frag:1; /* 185: 7 1 */ __u8 encapsulation:1; /* 185: 6 1 */ /* XXX 6 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ int flags2_end[0]; /* 188 0 */ dma_cookie_t dma_cookie; /* 188 4 */ /* --- cacheline 3 boundary (192 bytes) --- */ void * security; /* 192 8 */ union { __u32 mark; /* 4 */ __u32 dropcount; /* 4 */ __u32 reserved_tailroom; /* 4 */ }; /* 200 4 */ sk_buff_data_t inner_transport_header; /* 204 4 */ sk_buff_data_t inner_network_header; /* 208 4 */ sk_buff_data_t transport_header; /* 212 4 */ sk_buff_data_t network_header; /* 216 4 */ sk_buff_data_t mac_header; /* 220 4 */ sk_buff_data_t tail; /* 224 4 */ sk_buff_data_t end; /* 228 4 */ unsigned char * head; /* 232 8 */ unsigned char * data; /* 240 8 */ unsigned int truesize; /* 248 4 */ atomic_t users; /* 252 4 */ /* --- cacheline 4 boundary (256 bytes) --- */ /* size: 256, cachelines: 4, members: 62 */ /* sum members: 246, holes: 3, sum holes: 10 */ /* bit holes: 1, sum bit holes: 6 bits */ };
On Tue, 2013-04-09 at 10:19 -0400, Paul Moore wrote: > On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote: > > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote: > > > As Casey already mentioned, if this isn't acceptable please help me > > > understand why. > > > > You see something which is not the reality. If you do such analysis, > > better do it properly, because any change you are going to submit will > > be doubly checked by people who really care. > > I am attempting to do it properly, I simply made a mistake. Ben also pointed > it out. As you wrote yesterday, "Lets go forward". > > After fixing the BITS_PER_LONG problem I looked at it again and it appears > that by simply replacing the "secmark" field with a blob we retain the size of > the sk_buff as well as the cacheline positions of all the fields, e.g. > dma_cookie no longer moves cachelines. Thoughts? If you take a look at recent history of changes on sk_buff, you can see we added very recently fields for encapsulation support. These were absolutely wanted for modern operations at datacenter level. This effort might still need new room, so I prefer not filling sk_buff right now. Take a look at the cloned sk_buff. We need an extra atomic_t at the end, so if make sk_buff bigger than 0xf8 bytes, fclone_cache will use an extra cache line as well. Not a big deal, but RPC workloads like netperf -t TCP_RR will probably show a regression. ls -l /sys/kernel/slab/skbuff_fclone_cache -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, April 09, 2013 07:31:04 AM Eric Dumazet wrote: > On Tue, 2013-04-09 at 10:19 -0400, Paul Moore wrote: > > On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote: > > > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote: > > > > As Casey already mentioned, if this isn't acceptable please help me > > > > understand why. > > > > > > You see something which is not the reality. If you do such analysis, > > > better do it properly, because any change you are going to submit will > > > be doubly checked by people who really care. > > > > I am attempting to do it properly, I simply made a mistake. Ben also > > pointed it out. As you wrote yesterday, "Lets go forward". > > > > After fixing the BITS_PER_LONG problem I looked at it again and it appears > > that by simply replacing the "secmark" field with a blob we retain the > > size of the sk_buff as well as the cacheline positions of all the fields, > > e.g. dma_cookie no longer moves cachelines. Thoughts? > > If you take a look at recent history of changes on sk_buff, you can see > we added very recently fields for encapsulation support. These were > absolutely wanted for modern operations at datacenter level. > > This effort might still need new room, so I prefer not filling sk_buff > right now. Has anyone proposed any additional encapsulation patches which need additional fields in the sk_buff? Are you aware of any additional encapsulation patches which are in progress? When would you consider it "safe"? > Take a look at the cloned sk_buff. We need an extra atomic_t at the end, > so if make sk_buff bigger than 0xf8 bytes, fclone_cache will use an > extra cache line as well. Not a big deal, but RPC workloads like netperf > -t TCP_RR will probably show a regression. > > ls -l /sys/kernel/slab/skbuff_fclone_cache Perhaps I'm misunderstanding, but these comments above only apply if we were to increase the size of the sk_buff struct, yes? What I proposed, replacing "secmark" with a blob, does not currently change the size of the sk_buff struct so the performance and memory usage should remain unchanged as well.
On Tuesday, April 09, 2013 10:52:17 AM Paul Moore wrote: > On Tuesday, April 09, 2013 07:31:04 AM Eric Dumazet wrote: > > On Tue, 2013-04-09 at 10:19 -0400, Paul Moore wrote: > > > On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote: > > > > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote: > > > > > As Casey already mentioned, if this isn't acceptable please help me > > > > > understand why. > > > > > > > > You see something which is not the reality. If you do such analysis, > > > > better do it properly, because any change you are going to submit will > > > > be doubly checked by people who really care. > > > > > > I am attempting to do it properly, I simply made a mistake. Ben also > > > pointed it out. As you wrote yesterday, "Lets go forward". > > > > > > After fixing the BITS_PER_LONG problem I looked at it again and it > > > appears > > > that by simply replacing the "secmark" field with a blob we retain the > > > size of the sk_buff as well as the cacheline positions of all the > > > fields, > > > e.g. dma_cookie no longer moves cachelines. Thoughts? > > > > If you take a look at recent history of changes on sk_buff, you can see > > we added very recently fields for encapsulation support. These were > > absolutely wanted for modern operations at datacenter level. > > > > This effort might still need new room, so I prefer not filling sk_buff > > right now. > > Has anyone proposed any additional encapsulation patches which need > additional fields in the sk_buff? Are you aware of any additional > encapsulation patches which are in progress? When would you consider it > "safe"? Another thought. If we move the "protocol" field after the flags2 bitfield, remove "secmark", and add the blob before/after "destructor" then we preserve the existing four byte hole in cacheline #3 for potential use by the encapsulation folks and move the dma_cookie into cacheline #3 as well which may actually be a good thing (corrections welcome on this comment). The overall size remains the same at 256 bytes. struct sk_buff_test { struct sk_buff * next; /* 0 8 */ struct sk_buff * prev; /* 8 8 */ ktime_t tstamp; /* 16 8 */ struct sock * sk; /* 24 8 */ struct net_device * dev; /* 32 8 */ char cb[48]; /* 40 48 */ /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */ long unsigned int _skb_refdst; /* 88 8 */ struct sec_path * sp; /* 96 8 */ unsigned int len; /* 104 4 */ unsigned int data_len; /* 108 4 */ __u16 mac_len; /* 112 2 */ __u16 hdr_len; /* 114 2 */ union { __wsum csum; /* 4 */ struct { __u16 csum_start; /* 116 2 */ __u16 csum_offset; /* 118 2 */ }; /* 4 */ }; /* 116 4 */ __u32 priority; /* 120 4 */ int flags1_begin[0]; /* 124 0 */ __u8 local_df:1; /* 124: 7 1 */ __u8 cloned:1; /* 124: 6 1 */ __u8 ip_summed:2; /* 124: 4 1 */ __u8 nohdr:1; /* 124: 3 1 */ __u8 nfctinfo:3; /* 124: 0 1 */ __u8 pkt_type:3; /* 125: 5 1 */ __u8 fclone:2; /* 125: 3 1 */ __u8 ipvs_property:1; /* 125: 2 1 */ __u8 peeked:1; /* 125: 1 1 */ __u8 nf_trace:1; /* 125: 0 1 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ int flags1_end[0]; /* 128 0 */ void * security; /* 128 8 */ void (*destructor)(struct sk_buff *); /* 136 8 */ struct nf_conntrack * nfct; /* 144 8 */ struct sk_buff * nfct_reasm; /* 152 8 */ struct nf_bridge_info * nf_bridge; /* 160 8 */ int skb_iif; /* 168 4 */ __u32 rxhash; /* 172 4 */ __u16 vlan_tci; /* 176 2 */ __u16 tc_index; /* 178 2 */ __u16 tc_verd; /* 180 2 */ __u16 queue_mapping; /* 182 2 */ int flags2_begin[0]; /* 184 0 */ __u8 ndisc_nodetype:2; /* 184: 6 1 */ __u8 pfmemalloc:1; /* 184: 5 1 */ __u8 ooo_okay:1; /* 184: 4 1 */ __u8 l4_rxhash:1; /* 184: 3 1 */ __u8 wifi_acked_valid:1; /* 184: 2 1 */ __u8 wifi_acked:1; /* 184: 1 1 */ __u8 no_fcs:1; /* 184: 0 1 */ __u8 head_frag:1; /* 185: 7 1 */ __u8 encapsulation:1; /* 185: 6 1 */ /* XXX 6 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ int flags2_end[0]; /* 188 0 */ __be16 protocol; /* 188 2 */ /* XXX 2 bytes hole, try to pack */ /* --- cacheline 3 boundary (192 bytes) --- */ dma_cookie_t dma_cookie; /* 192 4 */ union { __u32 mark; /* 4 */ __u32 dropcount; /* 4 */ __u32 reserved_tailroom; /* 4 */ }; /* 196 4 */ sk_buff_data_t inner_transport_header; /* 200 4 */ sk_buff_data_t inner_network_header; /* 204 4 */ sk_buff_data_t transport_header; /* 208 4 */ sk_buff_data_t network_header; /* 212 4 */ sk_buff_data_t mac_header; /* 216 4 */ sk_buff_data_t tail; /* 220 4 */ sk_buff_data_t end; /* 224 4 */ /* XXX 4 bytes hole, try to pack */ unsigned char * head; /* 232 8 */ unsigned char * data; /* 240 8 */ unsigned int truesize; /* 248 4 */ atomic_t users; /* 252 4 */ /* --- cacheline 4 boundary (256 bytes) --- */ /* size: 256, cachelines: 4, members: 62 */ /* sum members: 246, holes: 4, sum holes: 10 */ /* bit holes: 1, sum bit holes: 6 bits */ };
On Tue, 2013-04-09 at 10:52 -0400, Paul Moore wrote: > > Perhaps I'm misunderstanding, but these comments above only apply if we were > to increase the size of the sk_buff struct, yes? What I proposed, replacing > "secmark" with a blob, does not currently change the size of the sk_buff > struct so the performance and memory usage should remain unchanged as well. > If blob size is 4 bytes, thats fine. If not, read again my mail. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, April 09, 2013 08:07:06 AM Eric Dumazet wrote: > On Tue, 2013-04-09 at 10:52 -0400, Paul Moore wrote: > > Perhaps I'm misunderstanding, but these comments above only apply if we > > were to increase the size of the sk_buff struct, yes? What I proposed, > > replacing "secmark" with a blob, does not currently change the size of > > the sk_buff struct so the performance and memory usage should remain > > unchanged as well. > > If blob size is 4 bytes, thats fine. > > If not, read again my mail. The "blob" is a void pointer, so 8 bytes. We're talking about removing the "secmark" field (4 bytes) and adding a void pointer (8 bytes). I've shown several different approaches that make this change without increasing the overall size of the sk_buff struct. One of the proposals makes use of the existing holes in the third cacheline to make the change without any increase in size, misalignment, or cacheline changes. You were concerned that at some point in the future, the hardware encapsulation developers *might* want to add another field. Taking your comments into consideration I just made another proposal which preserves the overall size of the sk_buff struct, as well as the 4 byte hole in the third cacheline (for possible use by hw encapsulation folks at some later date). It does shift the "dma_cookie" field from the second to the third cacheline, but considering the fields already in the third cacheline this may be a good thing. To the best of my knowledge, all of the proposals I've posted this morning do not change the size of the sk_buff so the cloned sk_buff functionality/performance/etc. should not be affected. If that is not the case, please let me know because I'm currently at a loss (even after re- reading your email).
On Tue, 2013-04-09 at 11:17 -0400, Paul Moore wrote: > The "blob" is a void pointer, so 8 bytes. We're talking about removing the > "secmark" field (4 bytes) and adding a void pointer (8 bytes). I've shown > several different approaches that make this change without increasing the > overall size of the sk_buff struct. You want to use 4 extra bytes in sk_buff. You'll have to show us why we close the way for other valid uses of the current holes. I have no idea why its needed, and why it can't be solved in another way. It looks like _I_ have to do your work. Sorry, I have no more time to spend on this topic. You'll have to convince David, not me. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, April 09, 2013 08:32:56 AM Eric Dumazet wrote: > On Tue, 2013-04-09 at 11:17 -0400, Paul Moore wrote: > > The "blob" is a void pointer, so 8 bytes. We're talking about removing > > the "secmark" field (4 bytes) and adding a void pointer (8 bytes). I've > > shown several different approaches that make this change without > > increasing the overall size of the sk_buff struct. > > You want to use 4 extra bytes in sk_buff. You'll have to show us why we > close the way for other valid uses of the current holes. > > I have no idea why its needed, and why it can't be solved in another > way. FWIW, I was focusing on arriving at a basic design that addressed the initial reasons for not including a security blob in a sk_buff. In the beginning I thought it was both the need for LSM hook in the skb management routines as well as the memory overhead in the skb itself. During the course of our discussion it became clear that the hooks were acceptable, it was the memory overhead that was the concern, so that is what I (and Casey) focused on. Based on your latest comment, it appears that we have some possible candidates for adding a security blob (void *) to the sk_buff that address your technical arguments, I wasn't aware we had reached that point, but it is indeed good news. Now we just need to make our case that it is the "Right Thing to Do", that is perfectly reasonable. > It looks like _I_ have to do your work. I don't believe I ever asked you to do anything other than to repost a patch you posted to the LSM list so we could get it included upstream. A patch that you created to counter my proposed fix for a SELinux regression. Further, I tested your patch, and ACK'd it earlier this morning. I suppose I also asked you to explain/clarify a few of your technical objections a bit further so I could address them, but that just seems like normal peer design review. > Sorry, I have no more time to spend on this topic. You'll have to convince > David, not me. Well, thank you for your time; I'm sure we'll get to talk about this again in the future. It looks like we've had enough of a conversation now that I can start working on some patches.
On 4/9/2013 8:32 AM, Eric Dumazet wrote: > On Tue, 2013-04-09 at 11:17 -0400, Paul Moore wrote: > >> The "blob" is a void pointer, so 8 bytes. We're talking about removing the >> "secmark" field (4 bytes) and adding a void pointer (8 bytes). I've shown >> several different approaches that make this change without increasing the >> overall size of the sk_buff struct. > You want to use 4 extra bytes in sk_buff. So far, so good. > You'll have to show us why we > close the way for other valid uses of the current holes. We have what looks to us like a very legitimate use for a 4 byte hole, we have that use defined, and we have it today. In fact, we've had it for the last five years. The 4 byte secmark (instead of an 8 byte blob pointer) has impacted design choices since the LSM broke out of the SELinux monopoly. Not having a blob *will* negatively impact the forthcoming multiple concurrent LSM support. That is not hypothetical, that is concrete and very much a problem. > I have no idea why its needed, and why it can't be solved in another > way. Ah, let's see if I can't help there. We have two LSMs today and a third on the way that potentially want to use the secmark. SELinux uses the secmark, Smack may pick it up for IPv6 support in the not to distant future, and AppArmor appears to be considering it as well. Each of these LSMs uses (or will use) a 4 byte secid to represent the security information about a process or other relevant entity. This is what gets put into the secmark. The secid maps to the real security information. If I have two LSMs, say SELinux and AppArmor, I have a real problem. I have 8 bytes of data and a 4 byte secmark. If I have SELinux, Smack and AppArmor I have 12 bytes to stuff into a 4 byte bag. To top it off, these are already indirect references to the security data, not pointers to the data. What I have to try to do is fit 3 pointers into 4 bytes. Not good. If we replace secmark with secblob, I don't have to use the secid indirection if there is only one LSM. If there are multiple LSMs the secblob contains a pointer to a master blob, and it's still faster than the secmark indirection. The alternative is to restrict the use of secmarks to one LSM and let all the others figure out some other method for communicating the security information. I don't see that as a great choice. > It looks like _I_ have to do your work. Sorry, I have no more time to > spend on this topic. You'll have to convince David, not me. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 09 Apr 2013 08:32:56 -0700 > It looks like _I_ have to do your work. Sorry, I have no more time to > spend on this topic. You'll have to convince David, not me. I already have no interest in considering these changes seriously, and I've already asked Paul multiple times to drop this idea. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, April 09, 2013 12:56:35 PM David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 09 Apr 2013 08:32:56 -0700 > > > It looks like _I_ have to do your work. Sorry, I have no more time to > > spend on this topic. You'll have to convince David, not me. > > I already have no interest in considering these changes seriously, > and I've already asked Paul multiple times to drop this idea. If we address all of your technical concerns, why are you not interested in allowing a security blob in the sk_buff?
From: Paul Moore <pmoore@redhat.com> Date: Tue, 09 Apr 2013 13:00:06 -0400 > On Tuesday, April 09, 2013 12:56:35 PM David Miller wrote: >> From: Eric Dumazet <eric.dumazet@gmail.com> >> Date: Tue, 09 Apr 2013 08:32:56 -0700 >> >> > It looks like _I_ have to do your work. Sorry, I have no more time to >> > spend on this topic. You'll have to convince David, not me. >> >> I already have no interest in considering these changes seriously, >> and I've already asked Paul multiple times to drop this idea. > > If we address all of your technical concerns, why are you not interested in > allowing a security blob in the sk_buff? If you have a patch that removes space, rather than adds space, to sk_buff, then we can talk. Otherwise, please don't waste our time, because I'm not adding anything to sk_buff for LSM's sake. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
And please, until you have a real patch, stop wasting our time. Like Eric, as you can imagine, I have hundreds of other people's patches to review and reports to analyze. You're issue is not so important compared to all of the other ones that it should consume such an unbalanced amount of my time, sorry. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5d0b438..23cc295 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2705,6 +2705,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, dst_release(dst); return NULL; } + skb->sk = sk; + /* Reserve space for headers. */ skb_reserve(skb, MAX_TCP_HEADER);
Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted tcp_make_synack() to use alloc_skb() directly instead of calling sock_wmalloc(), the goal being the elimination of two atomic operations. Unfortunately, in doing so the change broke certain SELinux/NetLabel configurations by no longer correctly assigning the sock to the outgoing packet. This patch fixes this regression by doing the skb->sk assignment directly inside tcp_make_synack(). Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com> --- net/ipv4/tcp_output.c | 2 ++ 1 file changed, 2 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html