Message ID | 20091002141407.30224.54207.stgit@dev.haskins.net |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: Gregory Haskins <ghaskins@novell.com> Date: Fri, 02 Oct 2009 10:20:00 -0400 > The following is an RFC for an attempt at addressing a zero-copy solution. > > To be perfectly honest, I have no idea if this is the best solution, or if > there is truly a problem with skb->destructor that requires an alternate > mechanism. What I do know is that this patch seems to work, and I would > like to see some kind of solution available upstream. So I thought I would > send my hack out as at least a point of discussion. FWIW: This has been > tested heavily in my rig and is technically suitable for inclusion after > review as is, if that is decided to be the optimal path forward here. > > Thanks for your review and consideration, I have no fundamental objections, but when you submit this for real can you show the code that uses it so we can get a better idea about things? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote: > From: Gregory Haskins <ghaskins@novell.com> > Date: Fri, 02 Oct 2009 10:20:00 -0400 > >> The following is an RFC for an attempt at addressing a zero-copy solution. >> >> To be perfectly honest, I have no idea if this is the best solution, or if >> there is truly a problem with skb->destructor that requires an alternate >> mechanism. What I do know is that this patch seems to work, and I would >> like to see some kind of solution available upstream. So I thought I would >> send my hack out as at least a point of discussion. FWIW: This has been >> tested heavily in my rig and is technically suitable for inclusion after >> review as is, if that is decided to be the optimal path forward here. >> >> Thanks for your review and consideration, > > I have no fundamental objections, but when you submit this for > real can you show the code that uses it so we can get a better > idea about things? > > Thanks. Absolutely. I am not sure if this content would be appropriate for the patch header, so I will just reply to your request here. If you would like to see the patch resubmitted with the following in the header, let me know. The way we use this today is in the venet driver as part of the AlacrityVM hypervisor. We therefore have a guest and host environment, where the guest builds fully formed L2 frames, and the host generally acts as a conduit for passing those frames to a real physical device (such as through a soft-bridge, etc). We would like to do this without requiring copies for certain classes of packets (i.e. packets larger than a threshold). However we have to be smart about how we do this since the guest technically "owns" the pages, and therefore needs io-completion events to properly signify when pages are actually freed. The way this all looks today is (I hope this doesn't get mangled): ---------------------------------------------------------------------- | guest | host | ---------------------------------------------------------------------- | stack | venet | venetdev | phydev | ---------------------------------------------------------------------- | alloc_skb() | | dev_xmit() | | -> queue_tx(sg) | | -> dequeue_rx(sg) | | alloc_pskb() | | map_sg(sg)->pskb | | loop(get_page()) | | skb->release = cb | | -> dev_xmit() | | | | txc_isr() <- | | kfree_skb() | | skb->release() | | cb() <- | | queue_event(sg) | | txc_isr() <- | | kfree_skb() | | loop(put_page())| ---------------------------------------------------------------------- And here is the actual code in action (kernel/vbus/devices/venet/device.c) from the alacrityvm.git tree http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/devices/venet/device.c;h=d49ba7fa9f70cbb7e61c366d52d4c316d15f8b73;hb=HEAD Line 587 is the "dequeue_rx()" operation from the diagram. Line 627 is where we map in the guests pages to a scatterlist. Line 649 is where we update the skb_shinfo->frags with the mapping. And finally, Line 677 is where I register a callback for when the skb is released. Line 853 is the callback that the stack invokes when the phydev finally frees the packet. You can see that line 863 then sends an transmit-complete event back up to the guest. If this is not what you were looking for, please let me know. If this looks acceptable to you, please consider the original patch for inclusion at the next convenient merge window. Thanks David! -Greg
On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote: > (Applies to davem/net-2.6.git:4fdb78d30) > > Hi David, netdevs, > > The following is an RFC for an attempt at addressing a zero-copy solution. > > To be perfectly honest, I have no idea if this is the best solution, or if > there is truly a problem with skb->destructor that requires an alternate > mechanism. What I do know is that this patch seems to work, and I would > like to see some kind of solution available upstream. So I thought I would > send my hack out as at least a point of discussion. FWIW: This has been > tested heavily in my rig and is technically suitable for inclusion after > review as is, if that is decided to be the optimal path forward here. > > Thanks for your review and consideration, > > Kind regards, > -Greg > > ---------------------------------------- > From: Gregory Haskins <ghaskins@novell.com> > Subject: [RFC PATCH] net: add dataref destructor to sk_buff > > What: The skb->destructor field is reportedly unreliable for ensuring > that all shinfo users have dropped their references. Therefore, we add > a distinct ->release() method for the shinfo structure which is closely > tied to the underlying page resources we want to protect. > > Why: We want to add zero-copy transmit support for AlacrityVM guests. > In order to support this, the host kernel must map guest pages directly > into a paged-skb and send it as normal. put_page() alone is not > sufficient lifetime management since the pages are ultimately allocated > from within the guest. Therefore, we need higher-level notification > when the skb is finally freed on the host so we can then inject a proper > "tx-complete" event into the guest context. > > Signed-off-by: Gregory Haskins <ghaskins@novell.com> > --- > > include/linux/skbuff.h | 2 ++ > net/core/skbuff.c | 9 +++++++++ > 2 files changed, 11 insertions(+), 0 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index df7b23a..02cdab6 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -207,6 +207,8 @@ struct skb_shared_info { > /* Intermediate layers must ensure that destructor_arg > * remains valid until skb destructor */ > void * destructor_arg; > + void * priv; > + void (*release)(struct sk_buff *skb); > }; > > /* We divide dataref into two halves. The higher 16 bits hold references > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 80a9616..a7e40a9 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, > shinfo->tx_flags.flags = 0; > skb_frag_list_init(skb); > memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); > + shinfo->release = NULL; > + shinfo->priv = NULL; > > if (fclone) { > struct sk_buff *child = skb + 1; > @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb) > if (skb_has_frags(skb)) > skb_drop_fraglist(skb); > > + if (skb_shinfo(skb)->release) > + skb_shinfo(skb)->release(skb); > + > kfree(skb->head); > } > } > @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) > shinfo->tx_flags.flags = 0; > skb_frag_list_init(skb); > memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); > + shinfo->release = NULL; > + shinfo->priv = NULL; > > memset(skb, 0, offsetof(struct sk_buff, tail)); > skb->data = skb->head + NET_SKB_PAD; > @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > skb->hdr_len = 0; > skb->nohdr = 0; > atomic_set(&skb_shinfo(skb)->dataref, 1); > + skb_shinfo(skb)->release = NULL; > + skb_shinfo(skb)->priv = NULL; > return 0; > > nodata: This is basically subset of the skb data destructors patch, isn't it? Last time this was tried, this is the objection that was voiced: The problem with this patch is that it's tracking skb's, while you want use it to track pages for zero-copy. That just doesn't work. Through mechanisms like splice, individual pages in the skb can be detached and metastasize to other locations, e.g., the VFS. and I think this applies here. In other words, this only *seems* to work for you because you are not trying to do things like guest to host communication, with host doing smart things. Cc Herbert which was involved in the original discussion. In the specific case, it seems that things like pskb_copy, skb_split and others will also be broken, won't they?
>>> On 11/10/2009 at 6:53 AM, in message <20091110115335.GC6989@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote: >> (Applies to davem/net-2.6.git:4fdb78d30) >> >> Hi David, netdevs, >> >> The following is an RFC for an attempt at addressing a zero-copy solution. >> >> To be perfectly honest, I have no idea if this is the best solution, or if >> there is truly a problem with skb->destructor that requires an alternate >> mechanism. What I do know is that this patch seems to work, and I would >> like to see some kind of solution available upstream. So I thought I would >> send my hack out as at least a point of discussion. FWIW: This has been >> tested heavily in my rig and is technically suitable for inclusion after >> review as is, if that is decided to be the optimal path forward here. >> >> Thanks for your review and consideration, >> >> Kind regards, >> -Greg >> >> ---------------------------------------- >> From: Gregory Haskins <ghaskins@novell.com> >> Subject: [RFC PATCH] net: add dataref destructor to sk_buff >> >> What: The skb->destructor field is reportedly unreliable for ensuring >> that all shinfo users have dropped their references. Therefore, we add >> a distinct ->release() method for the shinfo structure which is closely >> tied to the underlying page resources we want to protect. >> >> Why: We want to add zero-copy transmit support for AlacrityVM guests. >> In order to support this, the host kernel must map guest pages directly >> into a paged-skb and send it as normal. put_page() alone is not >> sufficient lifetime management since the pages are ultimately allocated >> from within the guest. Therefore, we need higher-level notification >> when the skb is finally freed on the host so we can then inject a proper >> "tx-complete" event into the guest context. >> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >> --- >> >> include/linux/skbuff.h | 2 ++ >> net/core/skbuff.c | 9 +++++++++ >> 2 files changed, 11 insertions(+), 0 deletions(-) >> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h >> index df7b23a..02cdab6 100644 >> --- a/include/linux/skbuff.h >> +++ b/include/linux/skbuff.h >> @@ -207,6 +207,8 @@ struct skb_shared_info { >> /* Intermediate layers must ensure that destructor_arg >> * remains valid until skb destructor */ >> void * destructor_arg; >> + void * priv; >> + void (*release)(struct sk_buff *skb); >> }; >> >> /* We divide dataref into two halves. The higher 16 bits hold references >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index 80a9616..a7e40a9 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t > gfp_mask, >> shinfo->tx_flags.flags = 0; >> skb_frag_list_init(skb); >> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); >> + shinfo->release = NULL; >> + shinfo->priv = NULL; >> >> if (fclone) { >> struct sk_buff *child = skb + 1; >> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb) >> if (skb_has_frags(skb)) >> skb_drop_fraglist(skb); >> >> + if (skb_shinfo(skb)->release) >> + skb_shinfo(skb)->release(skb); >> + >> kfree(skb->head); >> } >> } >> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) >> shinfo->tx_flags.flags = 0; >> skb_frag_list_init(skb); >> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); >> + shinfo->release = NULL; >> + shinfo->priv = NULL; >> >> memset(skb, 0, offsetof(struct sk_buff, tail)); >> skb->data = skb->head + NET_SKB_PAD; >> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int > ntail, >> skb->hdr_len = 0; >> skb->nohdr = 0; >> atomic_set(&skb_shinfo(skb)->dataref, 1); >> + skb_shinfo(skb)->release = NULL; >> + skb_shinfo(skb)->priv = NULL; >> return 0; >> >> nodata: > > This is basically subset of the skb data destructors patch, isn't it? Sort of, but the emphasis is different. Here are the main differences: 1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level 2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to manage the buffer-space dynamically), whereas shinfo->release is designed to be used by "the owner" and is thus only set at creation. 3) shinfo->release tracks the lifetime of the pages, not the skb. This means it transcends the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages > Last time this was tried, this is the objection that was voiced: > > The problem with this patch is that it's tracking skb's, while > you want use it to track pages for zero-copy. That just doesn't > work. Through mechanisms like splice, individual pages in the > skb can be detached and metastasize to other locations, e.g., > the VFS. Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly track the page level avoid this issue. Multiple skb's can point to a single shinfo, iiuc. > > and I think this applies here. I don't think so, but if you think I missed something, do not be shy (not that you ever are). > In other words, this only *seems* > to work for you because you are not trying to do things like > guest to host communication, with host doing smart things. I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and it works quite nicely. I map the guest pages in, and when the last reference to the pages are dropped, I release the pages back to the guest. It doesn't matter if the skb egresses out a physical adapter or is received locally. All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly. > > Cc Herbert which was involved in the original discussion. > > In the specific case, it seems that things like pskb_copy, > skb_split and others will also be broken, won't they? Not to my knowledge. They up the reference to the shinfo before proceeding. Kind Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote: > >>> On 11/10/2009 at 6:53 AM, in message <20091110115335.GC6989@redhat.com>, > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote: > >> (Applies to davem/net-2.6.git:4fdb78d30) > >> > >> Hi David, netdevs, > >> > >> The following is an RFC for an attempt at addressing a zero-copy solution. > >> > >> To be perfectly honest, I have no idea if this is the best solution, or if > >> there is truly a problem with skb->destructor that requires an alternate > >> mechanism. What I do know is that this patch seems to work, and I would > >> like to see some kind of solution available upstream. So I thought I would > >> send my hack out as at least a point of discussion. FWIW: This has been > >> tested heavily in my rig and is technically suitable for inclusion after > >> review as is, if that is decided to be the optimal path forward here. > >> > >> Thanks for your review and consideration, > >> > >> Kind regards, > >> -Greg > >> > >> ---------------------------------------- > >> From: Gregory Haskins <ghaskins@novell.com> > >> Subject: [RFC PATCH] net: add dataref destructor to sk_buff > >> > >> What: The skb->destructor field is reportedly unreliable for ensuring > >> that all shinfo users have dropped their references. Therefore, we add > >> a distinct ->release() method for the shinfo structure which is closely > >> tied to the underlying page resources we want to protect. > >> > >> Why: We want to add zero-copy transmit support for AlacrityVM guests. > >> In order to support this, the host kernel must map guest pages directly > >> into a paged-skb and send it as normal. put_page() alone is not > >> sufficient lifetime management since the pages are ultimately allocated > >> from within the guest. Therefore, we need higher-level notification > >> when the skb is finally freed on the host so we can then inject a proper > >> "tx-complete" event into the guest context. > >> > >> Signed-off-by: Gregory Haskins <ghaskins@novell.com> > >> --- > >> > >> include/linux/skbuff.h | 2 ++ > >> net/core/skbuff.c | 9 +++++++++ > >> 2 files changed, 11 insertions(+), 0 deletions(-) > >> > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > >> index df7b23a..02cdab6 100644 > >> --- a/include/linux/skbuff.h > >> +++ b/include/linux/skbuff.h > >> @@ -207,6 +207,8 @@ struct skb_shared_info { > >> /* Intermediate layers must ensure that destructor_arg > >> * remains valid until skb destructor */ > >> void * destructor_arg; > >> + void * priv; > >> + void (*release)(struct sk_buff *skb); > >> }; > >> > >> /* We divide dataref into two halves. The higher 16 bits hold references > >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c > >> index 80a9616..a7e40a9 100644 > >> --- a/net/core/skbuff.c > >> +++ b/net/core/skbuff.c > >> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t > > gfp_mask, > >> shinfo->tx_flags.flags = 0; > >> skb_frag_list_init(skb); > >> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); > >> + shinfo->release = NULL; > >> + shinfo->priv = NULL; > >> > >> if (fclone) { > >> struct sk_buff *child = skb + 1; > >> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb) > >> if (skb_has_frags(skb)) > >> skb_drop_fraglist(skb); > >> > >> + if (skb_shinfo(skb)->release) > >> + skb_shinfo(skb)->release(skb); > >> + > >> kfree(skb->head); > >> } > >> } > >> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) > >> shinfo->tx_flags.flags = 0; > >> skb_frag_list_init(skb); > >> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); > >> + shinfo->release = NULL; > >> + shinfo->priv = NULL; > >> > >> memset(skb, 0, offsetof(struct sk_buff, tail)); > >> skb->data = skb->head + NET_SKB_PAD; > >> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int > > ntail, > >> skb->hdr_len = 0; > >> skb->nohdr = 0; > >> atomic_set(&skb_shinfo(skb)->dataref, 1); > >> + skb_shinfo(skb)->release = NULL; > >> + skb_shinfo(skb)->priv = NULL; > >> return 0; > >> > >> nodata: > > > > This is basically subset of the skb data destructors patch, isn't it? > > Sort of, but the emphasis is different. Here are the main differences: > > 1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level > > 2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to > manage the buffer-space dynamically), whereas shinfo->release is designed to be used by > "the owner" and is thus only set at creation. > > 3) shinfo->release tracks the lifetime of the pages, not the skb. This means it transcends > the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages You are comparing with skb destructors, not skb data destructors :) skb data destructor is Rusty's patch which he wanted to use for vringfd. I mean e.g. this: http://lists.openwall.net/netdev/2008/04/18/7 > > Last time this was tried, this is the objection that was voiced: > > > > The problem with this patch is that it's tracking skb's, while > > you want use it to track pages for zero-copy. That just doesn't > > work. Through mechanisms like splice, individual pages in the > > skb can be detached and metastasize to other locations, e.g., > > the VFS. > > Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly > track the page level avoid this issue. Multiple skb's can point to a single shinfo, iiuc. VFS does not know about shinfo either, does it? > > > > and I think this applies here. > > I don't think so, but if you think I missed something, do not be shy (not that you ever are). Well, I hope the reviews are helpful. I'll be happy if we learn to track pages involved in transmit, but need to be careful. > > In other words, this only *seems* > > to work for you because you are not trying to do things like > > guest to host communication, with host doing smart things. > > I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and > it works quite nicely. I map the guest pages in, and when the last reference to the pages are dropped, > I release the pages back to the guest. It doesn't matter if the skb egresses out a physical adapter or is > received locally. All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly. Not if someone else is referencing the pages without a reference to shinfo. > > > > Cc Herbert which was involved in the original discussion. > > > > In the specific case, it seems that things like pskb_copy, > > skb_split and others will also be broken, won't they? > > Not to my knowledge. They up the reference to the shinfo before proceeding. I don't seem to find where does skb_split reference the shinfo. It seems to do get_page on individual pages? > Kind Regards, > -Greg > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin wrote: > On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote: >>>>> On 11/10/2009 at 6:53 AM, in message <20091110115335.GC6989@redhat.com>, >> "Michael S. Tsirkin" <mst@redhat.com> wrote: >>> On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote: >>>> (Applies to davem/net-2.6.git:4fdb78d30) >>>> >>>> Hi David, netdevs, >>>> >>>> The following is an RFC for an attempt at addressing a zero-copy solution. >>>> >>>> To be perfectly honest, I have no idea if this is the best solution, or if >>>> there is truly a problem with skb->destructor that requires an alternate >>>> mechanism. What I do know is that this patch seems to work, and I would >>>> like to see some kind of solution available upstream. So I thought I would >>>> send my hack out as at least a point of discussion. FWIW: This has been >>>> tested heavily in my rig and is technically suitable for inclusion after >>>> review as is, if that is decided to be the optimal path forward here. >>>> >>>> Thanks for your review and consideration, >>>> >>>> Kind regards, >>>> -Greg >>>> >>>> ---------------------------------------- >>>> From: Gregory Haskins <ghaskins@novell.com> >>>> Subject: [RFC PATCH] net: add dataref destructor to sk_buff >>>> >>>> What: The skb->destructor field is reportedly unreliable for ensuring >>>> that all shinfo users have dropped their references. Therefore, we add >>>> a distinct ->release() method for the shinfo structure which is closely >>>> tied to the underlying page resources we want to protect. >>>> >>>> Why: We want to add zero-copy transmit support for AlacrityVM guests. >>>> In order to support this, the host kernel must map guest pages directly >>>> into a paged-skb and send it as normal. put_page() alone is not >>>> sufficient lifetime management since the pages are ultimately allocated >>>> from within the guest. Therefore, we need higher-level notification >>>> when the skb is finally freed on the host so we can then inject a proper >>>> "tx-complete" event into the guest context. >>>> >>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >>>> --- >>>> >>>> include/linux/skbuff.h | 2 ++ >>>> net/core/skbuff.c | 9 +++++++++ >>>> 2 files changed, 11 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h >>>> index df7b23a..02cdab6 100644 >>>> --- a/include/linux/skbuff.h >>>> +++ b/include/linux/skbuff.h >>>> @@ -207,6 +207,8 @@ struct skb_shared_info { >>>> /* Intermediate layers must ensure that destructor_arg >>>> * remains valid until skb destructor */ >>>> void * destructor_arg; >>>> + void * priv; >>>> + void (*release)(struct sk_buff *skb); >>>> }; >>>> >>>> /* We divide dataref into two halves. The higher 16 bits hold references >>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >>>> index 80a9616..a7e40a9 100644 >>>> --- a/net/core/skbuff.c >>>> +++ b/net/core/skbuff.c >>>> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t >>> gfp_mask, >>>> shinfo->tx_flags.flags = 0; >>>> skb_frag_list_init(skb); >>>> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); >>>> + shinfo->release = NULL; >>>> + shinfo->priv = NULL; >>>> >>>> if (fclone) { >>>> struct sk_buff *child = skb + 1; >>>> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb) >>>> if (skb_has_frags(skb)) >>>> skb_drop_fraglist(skb); >>>> >>>> + if (skb_shinfo(skb)->release) >>>> + skb_shinfo(skb)->release(skb); >>>> + >>>> kfree(skb->head); >>>> } >>>> } >>>> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) >>>> shinfo->tx_flags.flags = 0; >>>> skb_frag_list_init(skb); >>>> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); >>>> + shinfo->release = NULL; >>>> + shinfo->priv = NULL; >>>> >>>> memset(skb, 0, offsetof(struct sk_buff, tail)); >>>> skb->data = skb->head + NET_SKB_PAD; >>>> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int >>> ntail, >>>> skb->hdr_len = 0; >>>> skb->nohdr = 0; >>>> atomic_set(&skb_shinfo(skb)->dataref, 1); >>>> + skb_shinfo(skb)->release = NULL; >>>> + skb_shinfo(skb)->priv = NULL; >>>> return 0; >>>> >>>> nodata: >>> This is basically subset of the skb data destructors patch, isn't it? >> Sort of, but the emphasis is different. Here are the main differences: >> >> 1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level >> >> 2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to >> manage the buffer-space dynamically), whereas shinfo->release is designed to be used by >> "the owner" and is thus only set at creation. >> >> 3) shinfo->release tracks the lifetime of the pages, not the skb. This means it transcends >> the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages > > You are comparing with skb destructors, not skb data destructors :) skb > data destructor is Rusty's patch which he wanted to use for vringfd. I > mean e.g. this: > http://lists.openwall.net/netdev/2008/04/18/7 Ah, I wasn't aware. I believe Anthony Ligouri had pointed me at this same patch earlier this year. However, more recently when I saw skb->destructor() in mainline (seemingly from Rusty), I thought it had been accepted and didn't investigate further. I see now that you are talking about something else. > >>> Last time this was tried, this is the objection that was voiced: >>> >>> The problem with this patch is that it's tracking skb's, while >>> you want use it to track pages for zero-copy. That just doesn't >>> work. Through mechanisms like splice, individual pages in the >>> skb can be detached and metastasize to other locations, e.g., >>> the VFS. >> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly >> track the page level avoid this issue. Multiple skb's can point to a single shinfo, iiuc. > > VFS does not know about shinfo either, does it? I do not follow the reference. Where does VFS come into play? > >>> and I think this applies here. >> I don't think so, but if you think I missed something, do not be shy (not that you ever are). > > Well, I hope the reviews are helpful. Yes, I didn't mean it as a slam. Was only trying to convey that I didn't think I was wrong but am open minded to the possibility, so please keep the discussion going. > I'll be happy if we learn to > track pages involved in transmit, but need to be careful. > Agreed. >>> In other words, this only *seems* >>> to work for you because you are not trying to do things like >>> guest to host communication, with host doing smart things. >> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and >> it works quite nicely. I map the guest pages in, and when the last reference to the pages are dropped, >> I release the pages back to the guest. It doesn't matter if the skb egresses out a physical adapter or is >> received locally. All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly. > > Not if someone else is referencing the pages without a reference to shinfo. I agree that if we can reference pages outside of the skb/shinfo then there is a problem. I wasn't aware that we could do this, tbh. However, it seems to me that this is a problem with the overall stack, if true....isn't it? For instance, if I do a sendmsg() from a userspace app and block until its consumed, how can the system function sanely if the app returns from the call but something is still referencing the page(s)? This seems broken to me. > >>> Cc Herbert which was involved in the original discussion. >>> >>> In the specific case, it seems that things like pskb_copy, >>> skb_split and others will also be broken, won't they? >> Not to my knowledge. They up the reference to the shinfo before proceeding. > > I don't seem to find where does skb_split reference the shinfo. > It seems to do get_page on individual pages? Ill take a look. If so, one alternate solution that I had considered was to look at some kind of page->release() hook. There are obvious disadvantages to such a solution (for one, it has no such notion, and second we have to manage the aggregate collection which has overhead), so I shied away from that design in favor of this one. But perhaps we will ultimately have no choice. Kind Regards, -Greg
On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote: > >>>>> On 11/10/2009 at 6:53 AM, in message <20091110115335.GC6989@redhat.com>, > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > >>> On Fri, Oct 02, 2009 at 10:20:00AM -0400, Gregory Haskins wrote: > >>>> (Applies to davem/net-2.6.git:4fdb78d30) > >>>> > >>>> Hi David, netdevs, > >>>> > >>>> The following is an RFC for an attempt at addressing a zero-copy solution. > >>>> > >>>> To be perfectly honest, I have no idea if this is the best solution, or if > >>>> there is truly a problem with skb->destructor that requires an alternate > >>>> mechanism. What I do know is that this patch seems to work, and I would > >>>> like to see some kind of solution available upstream. So I thought I would > >>>> send my hack out as at least a point of discussion. FWIW: This has been > >>>> tested heavily in my rig and is technically suitable for inclusion after > >>>> review as is, if that is decided to be the optimal path forward here. > >>>> > >>>> Thanks for your review and consideration, > >>>> > >>>> Kind regards, > >>>> -Greg > >>>> > >>>> ---------------------------------------- > >>>> From: Gregory Haskins <ghaskins@novell.com> > >>>> Subject: [RFC PATCH] net: add dataref destructor to sk_buff > >>>> > >>>> What: The skb->destructor field is reportedly unreliable for ensuring > >>>> that all shinfo users have dropped their references. Therefore, we add > >>>> a distinct ->release() method for the shinfo structure which is closely > >>>> tied to the underlying page resources we want to protect. > >>>> > >>>> Why: We want to add zero-copy transmit support for AlacrityVM guests. > >>>> In order to support this, the host kernel must map guest pages directly > >>>> into a paged-skb and send it as normal. put_page() alone is not > >>>> sufficient lifetime management since the pages are ultimately allocated > >>>> from within the guest. Therefore, we need higher-level notification > >>>> when the skb is finally freed on the host so we can then inject a proper > >>>> "tx-complete" event into the guest context. > >>>> > >>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com> > >>>> --- > >>>> > >>>> include/linux/skbuff.h | 2 ++ > >>>> net/core/skbuff.c | 9 +++++++++ > >>>> 2 files changed, 11 insertions(+), 0 deletions(-) > >>>> > >>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > >>>> index df7b23a..02cdab6 100644 > >>>> --- a/include/linux/skbuff.h > >>>> +++ b/include/linux/skbuff.h > >>>> @@ -207,6 +207,8 @@ struct skb_shared_info { > >>>> /* Intermediate layers must ensure that destructor_arg > >>>> * remains valid until skb destructor */ > >>>> void * destructor_arg; > >>>> + void * priv; > >>>> + void (*release)(struct sk_buff *skb); > >>>> }; > >>>> > >>>> /* We divide dataref into two halves. The higher 16 bits hold references > >>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c > >>>> index 80a9616..a7e40a9 100644 > >>>> --- a/net/core/skbuff.c > >>>> +++ b/net/core/skbuff.c > >>>> @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t > >>> gfp_mask, > >>>> shinfo->tx_flags.flags = 0; > >>>> skb_frag_list_init(skb); > >>>> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); > >>>> + shinfo->release = NULL; > >>>> + shinfo->priv = NULL; > >>>> > >>>> if (fclone) { > >>>> struct sk_buff *child = skb + 1; > >>>> @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb) > >>>> if (skb_has_frags(skb)) > >>>> skb_drop_fraglist(skb); > >>>> > >>>> + if (skb_shinfo(skb)->release) > >>>> + skb_shinfo(skb)->release(skb); > >>>> + > >>>> kfree(skb->head); > >>>> } > >>>> } > >>>> @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) > >>>> shinfo->tx_flags.flags = 0; > >>>> skb_frag_list_init(skb); > >>>> memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); > >>>> + shinfo->release = NULL; > >>>> + shinfo->priv = NULL; > >>>> > >>>> memset(skb, 0, offsetof(struct sk_buff, tail)); > >>>> skb->data = skb->head + NET_SKB_PAD; > >>>> @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int > >>> ntail, > >>>> skb->hdr_len = 0; > >>>> skb->nohdr = 0; > >>>> atomic_set(&skb_shinfo(skb)->dataref, 1); > >>>> + skb_shinfo(skb)->release = NULL; > >>>> + skb_shinfo(skb)->priv = NULL; > >>>> return 0; > >>>> > >>>> nodata: > >>> This is basically subset of the skb data destructors patch, isn't it? > >> Sort of, but the emphasis is different. Here are the main differences: > >> > >> 1) skb->destructor() is on the skb level, shinfo->release() is at the shared-info/page level > >> > >> 2) skb->destructor is (iiuc) modified during the skb's lifetime (for instance rmem hooks here to > >> manage the buffer-space dynamically), whereas shinfo->release is designed to be used by > >> "the owner" and is thus only set at creation. > >> > >> 3) shinfo->release tracks the lifetime of the pages, not the skb. This means it transcends > >> the skb's lifetime (such as splits for clone, etc) and focuses on the shared component: the pages > > > > You are comparing with skb destructors, not skb data destructors :) skb > > data destructor is Rusty's patch which he wanted to use for vringfd. I > > mean e.g. this: > > http://lists.openwall.net/netdev/2008/04/18/7 > > Ah, I wasn't aware. I believe Anthony Ligouri had pointed me at this > same patch earlier this year. However, more recently when I saw > skb->destructor() in mainline (seemingly from Rusty), I thought it had > been accepted and didn't investigate further. I see now that you are > talking about something else. > > > > >>> Last time this was tried, this is the objection that was voiced: > >>> > >>> The problem with this patch is that it's tracking skb's, while > >>> you want use it to track pages for zero-copy. That just doesn't > >>> work. Through mechanisms like splice, individual pages in the > >>> skb can be detached and metastasize to other locations, e.g., > >>> the VFS. > >> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly > >> track the page level avoid this issue. Multiple skb's can point to a single shinfo, iiuc. > > > > VFS does not know about shinfo either, does it? > > I do not follow the reference. Where does VFS come into play? "Through mechanisms like splice, individual pages in the skb can be detached and metastasize to other locations, e.g., the VFS" > > > >>> and I think this applies here. > >> I don't think so, but if you think I missed something, do not be shy (not that you ever are). > > > > Well, I hope the reviews are helpful. > > Yes, I didn't mean it as a slam. Was only trying to convey that I > didn't think I was wrong but am open minded to the possibility, so > please keep the discussion going. > > > I'll be happy if we learn to > > track pages involved in transmit, but need to be careful. > > > > Agreed. > > >>> In other words, this only *seems* > >>> to work for you because you are not trying to do things like > >>> guest to host communication, with host doing smart things. > >> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and > >> it works quite nicely. I map the guest pages in, and when the last reference to the pages are dropped, > >> I release the pages back to the guest. It doesn't matter if the skb egresses out a physical adapter or is > >> received locally. All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly. > > > > Not if someone else is referencing the pages without a reference to shinfo. > > I agree that if we can reference pages outside of the skb/shinfo then > there is a problem. I wasn't aware that we could do this, tbh. > > However, it seems to me that this is a problem with the overall stack, > if true....isn't it? For instance, if I do a sendmsg() from a userspace > app and block until its consumed, consumed == memcpy_from_iovec? > how can the system function sanely if > the app returns from the call but something is still referencing the > page(s)? which pages? > This seems broken to me. > > > >>> Cc Herbert which was involved in the original discussion. > >>> > >>> In the specific case, it seems that things like pskb_copy, > >>> skb_split and others will also be broken, won't they? > >> Not to my knowledge. They up the reference to the shinfo before proceeding. > > > > I don't seem to find where does skb_split reference the shinfo. > > It seems to do get_page on individual pages? > > Ill take a look. > > If so, one alternate solution that I had considered was to look at some > kind of page->release() hook. There are obvious disadvantages to such a > solution (for one, it has no such notion, and second we have to manage > the aggregate collection which has overhead), so I shied away from that > design in favor of this one. But perhaps we will ultimately have no choice. > > Kind Regards, > -Greg > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote: > > You are comparing with skb destructors, not skb data destructors :) skb > > data destructor is Rusty's patch which he wanted to use for vringfd. I > > mean e.g. this: > > http://lists.openwall.net/netdev/2008/04/18/7 > > Ah, I wasn't aware. I believe Anthony Ligouri had pointed me at this > same patch earlier this year. However, more recently when I saw > skb->destructor() in mainline (seemingly from Rusty), I thought it had > been accepted and didn't investigate further. skb->destructor seems to be there as far back as 2.2.0 http://lxr.linux.no/#linux-old+v2.2.0/include/linux/skbuff.h
Michael S. Tsirkin wrote: > On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote: >> Michael S. Tsirkin wrote: >>> On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote: >>>>>>> On 11/10/2009 at 6:53 AM, in message <20091110115335.GC6989@redhat.com>, >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote: >> >>>>> Last time this was tried, this is the objection that was voiced: >>>>> >>>>> The problem with this patch is that it's tracking skb's, while >>>>> you want use it to track pages for zero-copy. That just doesn't >>>>> work. Through mechanisms like splice, individual pages in the >>>>> skb can be detached and metastasize to other locations, e.g., >>>>> the VFS. >>>> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly >>>> track the page level avoid this issue. Multiple skb's can point to a single shinfo, iiuc. >>> VFS does not know about shinfo either, does it? >> I do not follow the reference. Where does VFS come into play? > > "Through mechanisms like splice, individual pages in the > skb can be detached and metastasize to other locations, e.g., > the VFS" Right, understood. What I mean is: How is that actually used in real-life in a way that is valid? What I am getting at is as follows: From a real basic perspective, you can look at all of this as a simple synchronous call (i.e. sendmsg()). The "app" (be it a userspace app, or a guest) prepares a buffer for transmission, and offers it to the next layer in the stack. The app must maintain the integrity of that buffer at least until the layer below it signifies that it is "consumed". This may mean its a synchronous call, like sendmsg(), or it may be asynchronous, like AIO. But the key thing here is that at some point, the lower layer has to signify that the buffer stability constraint has been met. In either case, we have a clear delineated event: the io-completes = the buffer is free to be reused. In the simple case, the buffer in question is copied to a kernel buffer, and the io completes immediately. In other cases (such as zero copy), the buffer is mapped into the skb, and we have to wait for even lower layers to signify the completion. I am not a stack expert, but I was under the impression that we use this model for userspace pages today as well using the wmem callbacks in skb->destructor(). If so, I do not see how you could do something like detach a page from a pskb and still expect to have a proper event that delineates the io-completion to the higher layers. So the questions are: 1) do we in fact map userspace pages to pskbs today? 2a) if so, how do we delineate the completion event? 2b) and how do we prevent worrying about the get_page() issue you refer to. >> >>>>> In other words, this only *seems* >>>>> to work for you because you are not trying to do things like >>>>> guest to host communication, with host doing smart things. >>>> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and >>>> it works quite nicely. I map the guest pages in, and when the last reference to the pages are dropped, >>>> I release the pages back to the guest. It doesn't matter if the skb egresses out a physical adapter or is >>>> received locally. All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly. >>> Not if someone else is referencing the pages without a reference to shinfo. >> I agree that if we can reference pages outside of the skb/shinfo then >> there is a problem. I wasn't aware that we could do this, tbh. >> >> However, it seems to me that this is a problem with the overall stack, >> if true....isn't it? For instance, if I do a sendmsg() from a userspace >> app and block until its consumed, > > consumed == memcpy_from_iovec? For non-zero-copy, sure why not. > >> how can the system function sanely if >> the app returns from the call but something is still referencing the >> page(s)? > > which pages? You said that there are paths that get_page() out of shinfo without holding a shinfo reference. Kind Regards, -Greg
Michael S. Tsirkin wrote: > On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote: >>> You are comparing with skb destructors, not skb data destructors :) skb >>> data destructor is Rusty's patch which he wanted to use for vringfd. I >>> mean e.g. this: >>> http://lists.openwall.net/netdev/2008/04/18/7 >> Ah, I wasn't aware. I believe Anthony Ligouri had pointed me at this >> same patch earlier this year. However, more recently when I saw >> skb->destructor() in mainline (seemingly from Rusty), I thought it had >> been accepted and didn't investigate further. > > skb->destructor seems to be there as far back as 2.2.0 > http://lxr.linux.no/#linux-old+v2.2.0/include/linux/skbuff.h > Right, I meant recent activity around it for fixing it up related to leaks, etc.
On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Tue, Nov 10, 2009 at 09:11:10AM -0500, Gregory Haskins wrote: > >> Michael S. Tsirkin wrote: > >>> On Tue, Nov 10, 2009 at 05:40:50AM -0700, Gregory Haskins wrote: > >>>>>>> On 11/10/2009 at 6:53 AM, in message <20091110115335.GC6989@redhat.com>, > >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> > >>>>> Last time this was tried, this is the objection that was voiced: > >>>>> > >>>>> The problem with this patch is that it's tracking skb's, while > >>>>> you want use it to track pages for zero-copy. That just doesn't > >>>>> work. Through mechanisms like splice, individual pages in the > >>>>> skb can be detached and metastasize to other locations, e.g., > >>>>> the VFS. > >>>> Right, and I don't think this applies here because I specifically chose the shinfo level to try to properly > >>>> track the page level avoid this issue. Multiple skb's can point to a single shinfo, iiuc. > >>> VFS does not know about shinfo either, does it? > >> I do not follow the reference. Where does VFS come into play? > > > > "Through mechanisms like splice, individual pages in the > > skb can be detached and metastasize to other locations, e.g., > > the VFS" > > Right, understood. What I mean is: How is that actually used in > real-life in a way that is valid? > > What I am getting at is as follows: From a real basic perspective, you > can look at all of this as a simple synchronous call (i.e. sendmsg()). > The "app" (be it a userspace app, or a guest) prepares a buffer for > transmission, and offers it to the next layer in the stack. The app > must maintain the integrity of that buffer at least until the layer > below it signifies that it is "consumed". This may mean its a > synchronous call, like sendmsg(), or it may be asynchronous, like AIO. > > But the key thing here is that at some point, the lower layer has to > signify that the buffer stability constraint has been met. In either > case, we have a clear delineated event: the io-completes = the buffer is > free to be reused. > > In the simple case, the buffer in question is copied to a kernel buffer, > and the io completes immediately. In other cases (such as zero copy), > the buffer is mapped into the skb, and we have to wait for even lower > layers to signify the completion. > > I am not a stack expert, but I was under the impression that we use this > model for userspace pages today as well using the wmem callbacks in > skb->destructor(). If so, I do not see how you could do something like > detach a page from a pskb and still expect to have a proper event that > delineates the io-completion to the higher layers. I think linux only cares about that for accounting purposes (stuff like socket sndbuff size). If someone takes over the page, the socket can stop worrying about it. > So the questions are: > > 1) do we in fact map userspace pages to pskbs today? I don't think so. > 2a) if so, how do we delineate the completion event? > 2b) and how do we prevent worrying about the get_page() issue you refer > to. > > > >> > >>>>> In other words, this only *seems* > >>>>> to work for you because you are not trying to do things like > >>>>> guest to host communication, with host doing smart things. > >>>> I am not following what you mean here, as I do use this for guest->host and guest->host->remote, and > >>>> it works quite nicely. I map the guest pages in, and when the last reference to the pages are dropped, > >>>> I release the pages back to the guest. It doesn't matter if the skb egresses out a physical adapter or is > >>>> received locally. All that matters is the lifetime of the shinfo (and thus its pages) is handled correctly. > >>> Not if someone else is referencing the pages without a reference to shinfo. > >> I agree that if we can reference pages outside of the skb/shinfo then > >> there is a problem. I wasn't aware that we could do this, tbh. > >> > >> However, it seems to me that this is a problem with the overall stack, > >> if true....isn't it? For instance, if I do a sendmsg() from a userspace > >> app and block until its consumed, > > > > consumed == memcpy_from_iovec? > > For non-zero-copy, sure why not. > > > > >> how can the system function sanely if > >> the app returns from the call but something is still referencing the > >> page(s)? > > > > which pages? > > You said that there are paths that get_page() out of shinfo without > holding a shinfo reference. Without zero copy, application does not care about these, they have been allocated by kernel.
Michael S. Tsirkin wrote: > On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote: >> I am not a stack expert, but I was under the impression that we use this >> model for userspace pages today as well using the wmem callbacks in >> skb->destructor(). If so, I do not see how you could do something like >> detach a page from a pskb and still expect to have a proper event that >> delineates the io-completion to the higher layers. > > I think linux only cares about that for accounting purposes (stuff like > socket sndbuff size). If someone takes over the page, the socket can > stop worrying about it. Only if there isn't zero-copy. > >> So the questions are: >> >> 1) do we in fact map userspace pages to pskbs today? > > I don't think so. What about things like sendfile()? There has to be *some* way to synchronize with the io-completion event, I would think. Whatever that is, I'd like to tap into it. >>> which pages? >> >> You said that there are paths that get_page() out of shinfo without >> holding a shinfo reference. > > Without zero copy, application does not care about these, > they have been allocated by kernel. Agreed in the non-zero copy case. I am not yet convinced that we do not do zero copy in some form, however. Ill have to dig through the code when I get a chance to confirm. Kind Regards, -Greg
On Tue, Nov 10, 2009 at 01:36:23PM -0500, Gregory Haskins (ghaskins@novell.com) wrote: > What about things like sendfile()? There has to be *some* way to > synchronize with the io-completion event, I would think. Whatever that > is, I'd like to tap into it. All skb manipulation functions properly maintain data reference counters, so pages will not be freed until all data is consumed. But there is no guarantee that data placed in given page will not be overwritten while page is being held somewhere in the stack. Putting shared info destructor will allow to get notification, that given shared info processing is over, i.e. that network stack does not use data placed in shared info for given skb, but if it was copied or VFS hold those pages, they may or may not be freed.
On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote: > > What I am getting at is as follows: From a real basic perspective, you > can look at all of this as a simple synchronous call (i.e. sendmsg()). > The "app" (be it a userspace app, or a guest) prepares a buffer for > transmission, and offers it to the next layer in the stack. The app > must maintain the integrity of that buffer at least until the layer > below it signifies that it is "consumed". This may mean its a > synchronous call, like sendmsg(), or it may be asynchronous, like AIO. Neither sendmsg() nor sendfile() is synchronous in the way you imagine. Cheers,
Herbert Xu wrote: > On Tue, Nov 10, 2009 at 10:45:16AM -0500, Gregory Haskins wrote: >> What I am getting at is as follows: From a real basic perspective, you >> can look at all of this as a simple synchronous call (i.e. sendmsg()). >> The "app" (be it a userspace app, or a guest) prepares a buffer for >> transmission, and offers it to the next layer in the stack. The app >> must maintain the integrity of that buffer at least until the layer >> below it signifies that it is "consumed". This may mean its a >> synchronous call, like sendmsg(), or it may be asynchronous, like AIO. > > Neither sendmsg() nor sendfile() is synchronous in the way you > imagine. Well, not with respect to the overall protocol, of course not. But with respect to the buffer in question, it _has_ to be. Or am I missing something? Kind Regards, -Greg
On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote: > > Well, not with respect to the overall protocol, of course not. But with > respect to the buffer in question, it _has_ to be. Or am I missing > something? sendfile() has never guaranteed that the kernel is finished with the underlying pages when it returns. Cheers,
Herbert Xu wrote: > On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote: >> Well, not with respect to the overall protocol, of course not. But with >> respect to the buffer in question, it _has_ to be. Or am I missing >> something? > > sendfile() has never guaranteed that the kernel is finished with > the underlying pages when it returns. > > Cheers, Clearly there must be _some_ mechanism to synchronize (e.g. flush/barrier) though, right? Otherwise, that interface would seem to be quite prone to races and would likely be unusable. So what does said flush use to know when the buffer is free? -Greg
On Fri, Nov 13, 2009 at 09:27:57PM -0500, Gregory Haskins wrote: > > Clearly there must be _some_ mechanism to synchronize (e.g. > flush/barrier) though, right? Otherwise, that interface would seem to > be quite prone to races and would likely be unusable. So what does > said flush use to know when the buffer is free? It is up to the application to devise its own sync mechanism. For TCP, you may query the kernel for the last sequence number acked by the other side. Cheers,
On Fri, 13 Nov 2009 21:27:57 -0500 Gregory Haskins <gregory.haskins@gmail.com> wrote: > Herbert Xu wrote: > > On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote: > >> Well, not with respect to the overall protocol, of course not. But with > >> respect to the buffer in question, it _has_ to be. Or am I missing > >> something? > > > > sendfile() has never guaranteed that the kernel is finished with > > the underlying pages when it returns. > > > > Cheers, > > Clearly there must be _some_ mechanism to synchronize (e.g. > flush/barrier) though, right? Otherwise, that interface would seem to > be quite prone to races and would likely be unusable. So what does > said flush use to know when the buffer is free? No all the interfaces require a copy. Actually, sendfile makes no guarantee about synchronization because the receiver of said file could be arbitrarily slow, and any attempt at locking down current contents of file is a denial of service exposure. People have tried doing copy-less send by page flipping, but the overhead of the IPI to invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in at Linux Symposium (Ottawa) several years ago. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 13, 2009 at 06:45:03PM -0800, Stephen Hemminger wrote: > > No all the interfaces require a copy. Actually, sendfile makes no guarantee about synchronization Actually sendfile does not require a copy. > because the receiver of said file could be arbitrarily slow, and any attempt at locking down > current contents of file is a denial of service exposure. Agreed. Cheers,
From: Gregory Haskins <gregory.haskins@gmail.com> Date: Fri, 13 Nov 2009 20:33:35 -0500 > Well, not with respect to the overall protocol, of course not. But with > respect to the buffer in question, it _has_ to be. Or am I missing > something? sendfile() absolutely, and positively, is not. Any entity can write to the pages being send via sendfile(), at will, and those writes will show up in the packet stream if they occur before the NIC DMA's the memory backed by those pages into it's buffer. There is zero data synchronization whatsoever, we don't lock the pages, we don't block their usage while they are queued up in the socket send queue, nothing like that. The user returns long before it every hits the wire and there is zero "notification" to the user that the pages in question for the sendfile() request are no longer in use. It seems that your understanding of how buffering and synchronization works in the TCP stack has come out of a fairy tale :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Gregory Haskins <gregory.haskins@gmail.com> Date: Fri, 13 Nov 2009 21:27:57 -0500 > Clearly there must be _some_ mechanism to synchronize (e.g. > flush/barrier) though, right? Otherwise, that interface would seem to > be quite prone to races and would likely be unusable. So what does > said flush use to know when the buffer is free? There is no such synchronization at all. If some synchronization is required, it must be done at a higher level. For example, SAMBA can only use sendfile() to serve file contents when the client holds an SMB "OP Lock" on the file. Luckily, by default pretty much all SMB client implementations hold the OP Lock on a file even just to read it (this is so that self-modifying autoexec.bat scripts work :-) But frankly most sendfile() consumers don't care, and the only thing that really matters is that the checksum on the final TCP packet that goes out is correct, and since we require card checksums to sendfile() without copying, that is taken care of transparently. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger wrote: > On Fri, 13 Nov 2009 21:27:57 -0500 > Gregory Haskins <gregory.haskins@gmail.com> wrote: > >> Herbert Xu wrote: >>> On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote: >>>> Well, not with respect to the overall protocol, of course not. But with >>>> respect to the buffer in question, it _has_ to be. Or am I missing >>>> something? >>> sendfile() has never guaranteed that the kernel is finished with >>> the underlying pages when it returns. >>> >>> Cheers, >> Clearly there must be _some_ mechanism to synchronize (e.g. >> flush/barrier) though, right? Otherwise, that interface would seem to >> be quite prone to races and would likely be unusable. So what does >> said flush use to know when the buffer is free? > > No all the interfaces require a copy. I'm sorry, but I do not think that is correct. As others have pointed out, that would not appear to be true for at least sendfile. > Actually, sendfile makes no guarantee about synchronization > because the receiver of said file could be arbitrarily slow, and any attempt at locking down > current contents of file is a denial of service exposure. I think you are inverting the problem space. It is fully expected that changing the "file", or the pages that represent the file before the packet is queued would constitute the modification of the stream on the wire. I am more thinking about the applications of mmap+sendfile to implement a zero-copy interface. As David mentions in another mail, it seems that there is no sync mechanism available, so this would not appear to be a viable use case today, unfortunately. > > People have tried doing copy-less send by page flipping, but the overhead of the IPI to > invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in > at Linux Symposium (Ottawa) several years ago. I think you are confusing copy-less tx with copy-less rx. You can try to do copy-less rx with page flipping, which has the IPI/TLB thrashing properties you mention, and I agree is problematic. We are talking about copy-less tx here, however, and therefore no page-flipping is involved. Rather, we are just posting SG lists of pages directly to the NIC (assuming the nic supports HIGH_DMA, etc). You do not need to flip the page, or invalidate the TLB (and thus IPI the other cores) to do this to my knowledge. Kind Regards, -Greg
David Miller wrote: > From: Gregory Haskins <gregory.haskins@gmail.com> > Date: Fri, 13 Nov 2009 20:33:35 -0500 > >> Well, not with respect to the overall protocol, of course not. But with >> respect to the buffer in question, it _has_ to be. Or am I missing >> something? > > sendfile() absolutely, and positively, is not. > > Any entity can write to the pages being send via sendfile(), at will, > and those writes will show up in the packet stream if they occur > before the NIC DMA's the memory backed by those pages into it's > buffer. Right, understood. > > There is zero data synchronization whatsoever, we don't lock the > pages, we don't block their usage while they are queued up in the > socket send queue, nothing like that. Understood. > > The user returns long before it every hits the wire and there is zero > "notification" to the user that the pages in question for the > sendfile() request are no longer in use. Ok, this was the part I didn't know. > > It seems that your understanding of how buffering and synchronization > works in the TCP stack has come out of a fairy tale :-) I understand that we do not protect the buffers from modification from other entities in process. This was purely a question of synchronization from the producers standpoint. IOW: for (;;) { char buf[512]; memcpy(buf, next, sizeof(buf)); write(fd, buf); } would work without worrying that the producer will stomp on buf itself. It is now my understanding that for things other than sendfile, this works because the buffer is copied before it returns control to the app. For sendfile(), the producer is more or less on its own and therefore has to be careful if they are reusing previous mmapped buffers. Ok. But really, this is somewhat orthogonal to the original problem, so let me see if we can bring it back on topic. Michael stated that this patch in question may be problematic because there are places in the stack that can get_page() without also maintaining a reference to the shinfo object. Evgeniy seems to say the opposite. I am not sure who is right, or if I misunderstood one or both of them. Any thoughts? Kind Regards, -Greg
On 11/14/2009 05:04 AM, David Miller wrote: > From: Gregory Haskins<gregory.haskins@gmail.com> > Date: Fri, 13 Nov 2009 20:33:35 -0500 > > >> Well, not with respect to the overall protocol, of course not. But with >> respect to the buffer in question, it _has_ to be. Or am I missing >> something? >> > sendfile() absolutely, and positively, is not. > > Any entity can write to the pages being send via sendfile(), at will, > and those writes will show up in the packet stream if they occur > before the NIC DMA's the memory backed by those pages into it's > buffer. > > There is zero data synchronization whatsoever, we don't lock the > pages, we don't block their usage while they are queued up in the > socket send queue, nothing like that. > > But it must maintain a reference count on the page being dmaed and drop it only after dma is complete. Otherwise we risk the page being recycled and arbitrary memory sent out on the wire; and an application can trivially cause this by truncate()ing a sendfile. > The user returns long before it every hits the wire and there is zero > "notification" to the user that the pages in question for the > sendfile() request are no longer in use. > The put_page() is a notification except it doesn't reach the caller. Gregory's patch (and previous shared info destructor patches) is an attempt to make it reach the caller, IIUC.
On Sat, 14 Nov 2009 00:27:46 -0500 Gregory Haskins <gregory.haskins@gmail.com> wrote: > Stephen Hemminger wrote: > > On Fri, 13 Nov 2009 21:27:57 -0500 > > Gregory Haskins <gregory.haskins@gmail.com> wrote: > > > >> Herbert Xu wrote: > >>> On Fri, Nov 13, 2009 at 08:33:35PM -0500, Gregory Haskins wrote: > >>>> Well, not with respect to the overall protocol, of course not. But with > >>>> respect to the buffer in question, it _has_ to be. Or am I missing > >>>> something? > >>> sendfile() has never guaranteed that the kernel is finished with > >>> the underlying pages when it returns. > >>> > >>> Cheers, > >> Clearly there must be _some_ mechanism to synchronize (e.g. > >> flush/barrier) though, right? Otherwise, that interface would seem to > >> be quite prone to races and would likely be unusable. So what does > >> said flush use to know when the buffer is free? > > > > No all the interfaces require a copy. > > I'm sorry, but I do not think that is correct. As others have pointed > out, that would not appear to be true for at least sendfile. Correct. > > > Actually, sendfile makes no guarantee about synchronization > > because the receiver of said file could be arbitrarily slow, and any attempt at locking down > > current contents of file is a denial of service exposure. > > I think you are inverting the problem space. It is fully expected that > changing the "file", or the pages that represent the file before the > packet is queued would constitute the modification of the stream on the > wire. > > I am more thinking about the applications of mmap+sendfile to implement > a zero-copy interface. As David mentions in another mail, it seems that > there is no sync mechanism available, so this would not appear to be a > viable use case today, unfortunately. yes, if you do mmap/sendfile then there is no synchronization, and the stack can hold onto your data for an arbitrary time. The file and mapping's can be closed but that risks tying up all of memory. > > > > People have tried doing copy-less send by page flipping, but the overhead of the IPI to > > invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in > > at Linux Symposium (Ottawa) several years ago. > > I think you are confusing copy-less tx with copy-less rx. You can try > to do copy-less rx with page flipping, which has the IPI/TLB thrashing > properties you mention, and I agree is problematic. We are talking > about copy-less tx here, however, and therefore no page-flipping is > involved. Rather, we are just posting SG lists of pages directly to the > NIC (assuming the nic supports HIGH_DMA, etc). You do not need to flip > the page, or invalidate the TLB (and thus IPI the other cores) to do > this to my knowledge. > If you want to do copy-less tx for all applications, you have to do COW to handle the trivial case of : while (cc = read(infd, buffer, sizeof buffer)) { send(outsock, buffer, cc); }
Stephen Hemminger wrote: > On Sat, 14 Nov 2009 00:27:46 -0500 > Gregory Haskins <gregory.haskins@gmail.com> wrote: > >> Stephen Hemminger wrote: > >>> People have tried doing copy-less send by page flipping, but the overhead of the IPI to >>> invalidate the TLB exceeded the overhead of the copy. There was an Intel paper on this in >>> at Linux Symposium (Ottawa) several years ago. >> I think you are confusing copy-less tx with copy-less rx. You can try >> to do copy-less rx with page flipping, which has the IPI/TLB thrashing >> properties you mention, and I agree is problematic. We are talking >> about copy-less tx here, however, and therefore no page-flipping is >> involved. Rather, we are just posting SG lists of pages directly to the >> NIC (assuming the nic supports HIGH_DMA, etc). You do not need to flip >> the page, or invalidate the TLB (and thus IPI the other cores) to do >> this to my knowledge. >> > > If you want to do copy-less tx for all applications, you have to > do COW to handle the trivial case of : > > while (cc = read(infd, buffer, sizeof buffer)) { > send(outsock, buffer, cc); > } > > You certainly _could_ implement this as a COW I suppose, but that would be insane. If someone did do this, you are right: you need TLB invalidation. However, if I were going to actually propose the changeover of the system calls to use zero-copy (note that I am not), it would be based on the concept in this patch. That is: the send() would block until the NIC completes the DMA and the shinfo block is freed. Alternate implementations would be AIO based, where the shinfo destructor signifies the generation of the completion event. FWIW: The latter is conceptually similar to how this is being used in AlacrityVM. HTH Kind Regards, -Greg
On Mon, Nov 16, 2009 at 09:22:57AM -0500, Gregory Haskins wrote: > > But really, this is somewhat orthogonal to the original problem, so let > me see if we can bring it back on topic. Michael stated that this patch > in question may be problematic because there are places in the stack > that can get_page() without also maintaining a reference to the shinfo > object. Evgeniy seems to say the opposite. I am not sure who is right, > or if I misunderstood one or both of them. Any thoughts? There are loads of places where this can happen. Start with pskb_expand_head. Cheers,
Herbert Xu wrote: > On Mon, Nov 16, 2009 at 09:22:57AM -0500, Gregory Haskins wrote: >> But really, this is somewhat orthogonal to the original problem, so let >> me see if we can bring it back on topic. Michael stated that this patch >> in question may be problematic because there are places in the stack >> that can get_page() without also maintaining a reference to the shinfo >> object. Evgeniy seems to say the opposite. I am not sure who is right, >> or if I misunderstood one or both of them. Any thoughts? > > There are loads of places where this can happen. Start with > pskb_expand_head. Indeed, I see your point. Looks like we can potentially solve that with an extra level of indirection. E.g. skb->shinfo->owner, with a ref-count+callback. Thoughts? Kind Regards, -Greg
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index df7b23a..02cdab6 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -207,6 +207,8 @@ struct skb_shared_info { /* Intermediate layers must ensure that destructor_arg * remains valid until skb destructor */ void * destructor_arg; + void * priv; + void (*release)(struct sk_buff *skb); }; /* We divide dataref into two halves. The higher 16 bits hold references diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 80a9616..a7e40a9 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -219,6 +219,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, shinfo->tx_flags.flags = 0; skb_frag_list_init(skb); memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); + shinfo->release = NULL; + shinfo->priv = NULL; if (fclone) { struct sk_buff *child = skb + 1; @@ -350,6 +352,9 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb_shinfo(skb)->release) + skb_shinfo(skb)->release(skb); + kfree(skb->head); } } @@ -514,6 +519,8 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) shinfo->tx_flags.flags = 0; skb_frag_list_init(skb); memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps)); + shinfo->release = NULL; + shinfo->priv = NULL; memset(skb, 0, offsetof(struct sk_buff, tail)); skb->data = skb->head + NET_SKB_PAD; @@ -856,6 +863,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, skb->hdr_len = 0; skb->nohdr = 0; atomic_set(&skb_shinfo(skb)->dataref, 1); + skb_shinfo(skb)->release = NULL; + skb_shinfo(skb)->priv = NULL; return 0; nodata: