Message ID | 494012C4.7090304@vlnb.net |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Hi Vladislav. On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: > In the chosen approach new optional field void *net_priv was added to > struct page. It is enclosed by There is a huge no-no in networking land on increasing skb. Reason is simple every skb will carry potentially unneded data as long as given option is enabled, and most of the time it will. To break this barrier one has to have (I wanted to write ego, but then decided to replace it with mojo) so huge reason to do this, that it is almost impossible to have. Something tells me that increasing page structure with 8 bytes because of zero-copy iscsi transfer is not that great idea, since basically every user out there will have it enabled in the distro config and will waste noticeble amount of ram. The same problem of not sending any kind of notification to the user when his pages are 'acked' by receiving some packet or freeing the data exists long ago and was tried to be fixed several times. The most applicable to your case maybe DST experience. DST is a block layer device and all its pages starting from quite recent kernels are not allowed to be slab ones (xfs was the last one who provided slab pages in the bios), so each page has two 'unused' pointers in lru list entry, which you may reuse. If scsi layer may have slab pages from some place (although this does not sound like a good idea, ->sendpage() will bug on on them anyway), this hack will not work, otherwise you only need to have net_page_get/put stuff in and do not mess with increasing page. And this was tested 3-4 kernel releases ago, so things may be changed. Another appropach is to increase skb's shared data (at the end of the skb->data), and this approach was not frowned upon too much either, but it requires to mess with skb->destructor, which may not be appropriate in some cases. If iscsi does not use sockets (it does iirc), things are much simpler. Hope this helps.
Hi Evgeniy, Evgeniy Polyakov wrote: > Hi Vladislav. > > On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: >> In the chosen approach new optional field void *net_priv was added to >> struct page. It is enclosed by > > There is a huge no-no in networking land on increasing skb. > Reason is simple every skb will carry potentially unneded data as long > as given option is enabled, and most of the time it will. > To break this barrier one has to have (I wanted to write ego, but then > decided to replace it with mojo) so huge reason to do this, that it is > almost impossible to have. > > Something tells me that increasing page structure with 8 bytes because > of zero-copy iscsi transfer is not that great idea, since basically every > user out there will have it enabled in the distro config and will waste > noticeble amount of ram. The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps, not noticeable for an average user of distro kernels at all. Embedded people, who count each byte, almost always don't need iSCSI, so won't have any problems to disable TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option. Also, as I wrote, iSCSI-SCST can work without this patch or with TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option disabled, but with user space device handlers it will work considerably worse. Only few distro kernels users need an iSCSI target and only few among such users need to use a user space device handler. So, option TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION can be safely disabled in default configs of distro kernels. People who need both iSCSI target *and* fast working user space device handler would simply enable that option and rebuild the kernel. Rejecting this patch provides much worse alternative: those people would also have to *patch* the kernel at first, only then enable that option, then rebuild the kernel. > The same problem of not sending any kind of notification to the user > when his pages are 'acked' by receiving some packet or freeing the data > exists long ago and was tried to be fixed several times. > > The most applicable to your case maybe DST experience. DST is a block > layer device and all its pages starting from quite recent kernels are > not allowed to be slab ones (xfs was the last one who provided slab > pages in the bios), so each page has two 'unused' pointers in lru list > entry, which you may reuse. If scsi layer may have slab pages from some > place (although this does not sound like a good idea, ->sendpage() will > bug on on them anyway), this hack will not work, otherwise you only need > to have net_page_get/put stuff in and do not mess with increasing page. > And this was tested 3-4 kernel releases ago, so things may be changed. I've just rechecked if any of struct page fields can be shared with net_priv field. Unfortunately, it looks like none among the mandatory fields: - Transmitting pages can be mapped in some user space, so sharing in unions with _mapcount, mapping and index fields isn't possible - Transmitting pages can be in the page cache, so sharing lru field isn't possible as well It might, however, be shared with field "virtual", since SCST allocates only lowmem pages, but on non-HIGHMEM kernels and most HIGHMEM kernels WANT_PAGE_VIRTUAL isn't defined, so it will change basically nothing. > Another appropach is to increase skb's shared data (at the end of the > skb->data), and this approach was not frowned upon too much either, but > it requires to mess with skb->destructor, which may not be appropriate > in some cases. If iscsi does not use sockets (it does iirc), things are > much simpler. As I wrote, approach to use net_priv on the skb level was examined, but rejected as unpractical. Simply too many places should be modified to prevent merging skb's with different net_priv-like labels and those places are too inobvious. Implementation of this approach and, especially, maintenance it would be just a nightmare. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-12-11 at 21:16 +0300, Vladislav Bolkhovitin wrote: > Hi Evgeniy, > > Evgeniy Polyakov wrote: > > Hi Vladislav. > > > > On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: > >> In the chosen approach new optional field void *net_priv was added to > >> struct page. It is enclosed by > > > > There is a huge no-no in networking land on increasing skb. > > Reason is simple every skb will carry potentially unneded data as long > > as given option is enabled, and most of the time it will. > > To break this barrier one has to have (I wanted to write ego, but then > > decided to replace it with mojo) so huge reason to do this, that it is > > almost impossible to have. > > > > Something tells me that increasing page structure with 8 bytes because > > of zero-copy iscsi transfer is not that great idea, since basically every > > user out there will have it enabled in the distro config and will waste > > noticeble amount of ram. > > The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps, > not noticeable for an average user of distro kernels at all. Embedded > people, who count each byte, almost always don't need iSCSI, so won't > have any problems to disable > TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option. Actually, there are several other considerations: 1. struct page is a lowmem structure, so increasing its size becomes problematic on x86 PAE systems. 2. The current 64 bit struct page seems to be exactly pushing a cacheline boundary. Increasing it so it spills over will have a performance impact It's the performance problems that will be most critical, I suspect, so you'll need mm people buy in for doing this. One thing that leaps immediately to mind is that you could isolate this to the net layer by putting it in skb_frag_struct. However, such a move would require a proper API for this in net ... right now it looks like you're using the struct page addition to carry this information from SCSI to net, which is a bit of a layering violation. James -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley wrote: > On Thu, 2008-12-11 at 21:16 +0300, Vladislav Bolkhovitin wrote: >> Hi Evgeniy, >> >> Evgeniy Polyakov wrote: >>> Hi Vladislav. >>> >>> On Wed, Dec 10, 2008 at 10:04:36PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: >>>> In the chosen approach new optional field void *net_priv was added to >>>> struct page. It is enclosed by >>> There is a huge no-no in networking land on increasing skb. >>> Reason is simple every skb will carry potentially unneded data as long >>> as given option is enabled, and most of the time it will. >>> To break this barrier one has to have (I wanted to write ego, but then >>> decided to replace it with mojo) so huge reason to do this, that it is >>> almost impossible to have. >>> >>> Something tells me that increasing page structure with 8 bytes because >>> of zero-copy iscsi transfer is not that great idea, since basically every >>> user out there will have it enabled in the distro config and will waste >>> noticeble amount of ram. >> The waste will be only 0.2% of RAM or 2MB per 1GB. Not much. Perhaps, >> not noticeable for an average user of distro kernels at all. Embedded >> people, who count each byte, almost always don't need iSCSI, so won't >> have any problems to disable >> TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option. > > Actually, there are several other considerations: > > 1. struct page is a lowmem structure, so increasing its size > becomes problematic on x86 PAE systems. > 2. The current 64 bit struct page seems to be exactly pushing a > cacheline boundary. Increasing it so it spills over will have a > performance impact This is why I suggest to have CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION disabled in general kernels. ISCSI-SCST will still work with almost no performance loss for in-kernel backend and people would better recompile kernel, then patch it, then recompile. > It's the performance problems that will be most critical, I suspect, so > you'll need mm people buy in for doing this. I'll ask in linux-mm, thanks for the suggestion. > One thing that leaps immediately to mind is that you could isolate this > to the net layer by putting it in skb_frag_struct. However, such a move > would require a proper API for this in net ... To have net_priv analog in skb was the first idea I was tried. But I quickly gave up, because it would required that all the pages in each skb_frag_struct be from the same originator, i.e. with the same net_priv. It is unpractical to change all the operations with skb's to forbid merging them, if they have different net_priv. There are too many such places in very not obvious code pieces. > right now it looks like > you're using the struct page addition to carry this information from > SCSI to net, which is a bit of a layering violation. I don't think there is any layering violation here. Just lower layer notifies upper layer that transmission of a page has finished. It's done a bit not straightforward, but still basically the same as, for instance, on_free_cmd() callbacks which SCST core uses to notify target drivers and dev handlers that the corresponding command is about to be freed, so they can free associated with it data as well. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-12-12 at 22:25 +0300, Vladislav Bolkhovitin wrote: > > One thing that leaps immediately to mind is that you could isolate this > > to the net layer by putting it in skb_frag_struct. However, such a move > > would require a proper API for this in net ... > > To have net_priv analog in skb was the first idea I was tried. But I > quickly gave up, because it would required that all the pages in each > skb_frag_struct be from the same originator, i.e. with the same > net_priv. It is unpractical to change all the operations with skb's to > forbid merging them, if they have different net_priv. There are too many > such places in very not obvious code pieces. Actually, I said carry in skb_frag_struct not skb ... that allows for merging of skbs with different page sources. The API changes would have to allow setting at this level. > > right now it looks like > > you're using the struct page addition to carry this information from > > SCSI to net, which is a bit of a layering violation. > > I don't think there is any layering violation here. Just lower layer > notifies upper layer that transmission of a page has finished. It's done > a bit not straightforward, but still basically the same as, for > instance, on_free_cmd() callbacks which SCST core uses to notify target > drivers and dev handlers that the corresponding command is about to be > freed, so they can free associated with it data as well. The way you transmit the information you want notification is done by a private tag in struct page, so you're carrying the information on an object that belongs to neither layer ... that's the violation. It's essentially an extension of the net API that goes via the mm layer. James -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley wrote: > On Fri, 2008-12-12 at 22:25 +0300, Vladislav Bolkhovitin wrote: >>> One thing that leaps immediately to mind is that you could isolate this >>> to the net layer by putting it in skb_frag_struct. However, such a move >>> would require a proper API for this in net ... >> To have net_priv analog in skb was the first idea I was tried. But I >> quickly gave up, because it would required that all the pages in each >> skb_frag_struct be from the same originator, i.e. with the same >> net_priv. It is unpractical to change all the operations with skb's to >> forbid merging them, if they have different net_priv. There are too many >> such places in very not obvious code pieces. > > Actually, I said carry in skb_frag_struct not skb ... that allows for > merging of skbs with different page sources. The API changes would have > to allow setting at this level. Possibly, you are right, but the amount of changes would be very big, while the net_priv patch I did is just 309 lines long including all the descriptions and comments. And it's really simple and straightforward. >>> right now it looks like >>> you're using the struct page addition to carry this information from >>> SCSI to net, which is a bit of a layering violation. >> I don't think there is any layering violation here. Just lower layer >> notifies upper layer that transmission of a page has finished. It's done >> a bit not straightforward, but still basically the same as, for >> instance, on_free_cmd() callbacks which SCST core uses to notify target >> drivers and dev handlers that the corresponding command is about to be >> freed, so they can free associated with it data as well. > > The way you transmit the information you want notification is done by a > private tag in struct page, so you're carrying the information on an > object that belongs to neither layer ... that's the violation. It's > essentially an extension of the net API that goes via the mm layer. I understand your fears, but I think there can be another view on this. There are 2 layers: TCP and iSCSI. ISCSI asks TCP to send data and TCP performs that. ISCSI communicates with TCP using sendpage() function, it asks TCP to send pages. So, the unit of communications is page. When TCP finished transmitting a page, it notifies iSCSI that the page was transmitted. Now iSCSI needs to find out its own command, associated with that page. It does that using page->net_priv, which keeps pointer to the associated iSCSI command. I don't think there is any layer violation here, because iSCSI and TCP communicate using pages, not any other objects. If they communicated using, e.g., skb, no doubts, it would be a layers violation. But, since iSCSI and TCP communicate using pages, the situation is the same as between SCST core and a target driver in the on_free_cmd() example I used before. When SCST core notifies the target driver using on_free_cmd() callback that a command finished and is about to be freed, the target driver also needs to find out its own associated with the command data and it does that using tgt_priv pointer in struct scst_cmd. Similarly, for instance, struct scsi_cmnd has host_scribble pointer for low level drivers. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
If you don't believe that increasing struct page size for a fringe feature is a no-go submit the patch to the VM people and wait.. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 10, 2008 at 10:45 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote: > Another approach is to increase skb's shared data (at the end of the > skb->data), and this approach was not frowned upon too much either, but > it requires to mess with skb->destructor, which may not be appropriate > in some cases. If iscsi does not use sockets (it does iirc), things are > much simpler. Hello Evgeniy, Any idea whether it would be acceptable to extend "struct sock" with one or two pointers to callback functions ? These callback functions could be called from skb_release_data() etc. through the "struct sock* sk" member of "struct sk_buff". Bart. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bart. On Tue, Dec 16, 2008 at 05:00:07PM +0100, Bart Van Assche (bart.vanassche@gmail.com) wrote: > Any idea whether it would be acceptable to extend "struct sock" with > one or two pointers to callback functions ? These callback functions > could be called from skb_release_data() etc. through the "struct sock* > sk" member of "struct sk_buff". That's unlikely, since again this will be unused by the all but iscsi users. But you can be tricky and replace skb destructor with your own pointer and call 'old' destructor yourself. Its main goal is to adjust socket memory statistics and since all iscsi sockets are under your full control, you can play with it. Although this sounds like a hack, with proper implementation it is not that bad idea. As additional pointer you can use sk_user_data which is used by rpc socket calls only iirc.
Christoph Hellwig wrote: > If you don't believe that increasing struct page size for a fringe > feature is a no-go submit the patch to the VM people and wait.. I guessed, it can be no-go, this is why I wrote that this feature isn't required for iSCSI-SCST functionality. But, since the implementation is *so* simple and doesn't do any layering violation, I have a hope that once it disabled by default it will be harmless and, hence, could be accepted. Only few people need this feature. Otherwise there will be an alternative for them between enable that feature, then recompile, vs patch the kernel, enable that feature, then recompile. When I was developing it my main goal was to do it as simple as possible. I believe, I succeeded in it. If it's rejected, it will simply live out of tree until me or somebody else finds time to reimplement it in an acceptable, although at least 10 times more complicated manner. Thanks, I'll follow your advise. Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello linux-mm, Recently I submitted a new SCSI target framework (SCST) and 4 target drivers for it for the first iteration of review and comments. See http://lkml.org/lkml/2008/12/10/245 for details. An iSCSI target driver iSCSI-SCST was a part of the patchset (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to have TCP zero-copy transmit of user space data was implemented. Patch, implementing this optimization was also sent in the patchset, see http://lkml.org/lkml/2008/12/10/296. I would like to ask, if the approach used in this patch can be acceptable from your point of view? I understand, that extending struct page is a very much undesirable, but, from other side: - This approach is very simple and straightforward. The patch is only 309 lines long, including comments. All other alternative implementations would be at least an order of magnitude more complicated. - Related kernel config option TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION should be disabled by default in general distro kernels, so the would be no harm at all from this patch. ISCSI-SCST can work without this patch or with TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION option disabled, although with user space device handlers it will work considerably worse. Only few distro kernels users need an iSCSI target and only few among such users need to use user space device handlers. People who need both iSCSI target *and* fast working user space device handlers would simply enable that option and rebuild the kernel. Rejecting this patch provides much worse alternative: those people would also have to *patch* the kernel at first, only then enable that option, then rebuild the kernel. - Although usage of struct page to keep network related pointer might look as a layering violation, it isn't. I wrote in http://lkml.org/lkml/2008/12/15/190 why. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote: > An iSCSI target driver iSCSI-SCST was a part of the patchset > (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to > have TCP zero-copy transmit of user space data was implemented. Patch, > implementing this optimization was also sent in the patchset, see > http://lkml.org/lkml/2008/12/10/296. I'm probably ignorant of about 90% of the context here, but isn't this the sort of problem that was supposed to have been solved by vmsplice(2)? - DML -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin <vst@vlnb.net> writes: > > - Although usage of struct page to keep network related pointer might > look as a layering violation, it isn't. I wrote in > http://lkml.org/lkml/2008/12/15/190 why. Sorry but extending struct page for this is really a bad idea because of the extreme memory overhead even when it's not used (which is a problem on distribution kernels) Find some other way to store this information. Even for patches with more general value it was not acceptable. -Andi
David M. Lloyd, on 12/18/2008 09:43 PM wrote: > On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote: >> An iSCSI target driver iSCSI-SCST was a part of the patchset >> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to >> have TCP zero-copy transmit of user space data was implemented. Patch, >> implementing this optimization was also sent in the patchset, see >> http://lkml.org/lkml/2008/12/10/296. > > I'm probably ignorant of about 90% of the context here, but isn't this the > sort of problem that was supposed to have been solved by vmsplice(2)? No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, even if it was a user space driver, vmsplice wouldn't change anything much. It doesn't have a possibility for a user to know, when transmission of the data finished. So, it is intended to be used as: vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> vmsplice() it. But on the mmap() stage kernel has to zero all the newly mapped pages and zeroing memory isn't much faster, than copying it. Hence, there would be no considerable performance increase. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen, on 12/19/2008 02:27 PM wrote: > Vladislav Bolkhovitin <vst@vlnb.net> writes: >> - Although usage of struct page to keep network related pointer might >> look as a layering violation, it isn't. I wrote in >> http://lkml.org/lkml/2008/12/15/190 why. > > Sorry but extending struct page for this is really a bad idea because > of the extreme memory overhead even when it's not used (which is a > problem on distribution kernels) Find some other way to store this > information. Even for patches with more general value it was not > acceptable. Sure, this is why I propose to disable that option by default in distribution kernels, so it would produce no harm. ISCSI-SCST can work in this configuration quite well too. People who need both iSCSI target *and* fast working user space device handlers would simply enable that option and rebuild the kernel. Rejecting this patch provides much worse alternative: those people would also have to *patch* the kernel at first, only then enable that option, then rebuild the kernel. (I'm repeating it to make sure you didn't miss this my point; it was in the part of my original message, which you cut out.) Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen, on 12/19/2008 09:00 PM wrote: >> Sure, this is why I propose to disable that option by default in >> distribution kernels, so it would produce no harm. > > That would make the option useless for most users. You might as well > not bother merging then. I believe 99.(9)% of users prefer don't patch kernel, if possible. >> first, only then enable that option, then rebuild the kernel. (I'm >> repeating it to make sure you didn't miss this my point; it was in the >> part of my original message, which you cut out.) > > That was such a ridiculous suggestion, I didn't take it seriously. > > Also it should be really not rocket science to use a separate > table for this. Sorry, what do you mean? If usage of something like a hash table to map pages to the corresponding iSCSI commands, this approach was evaluated and rejected, because it wouldn't provide much performance increase, which would worth the effort. See details in the end of the patch description in http://lkml.org/lkml/2008/12/10/296 Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Sure, this is why I propose to disable that option by default in > distribution kernels, so it would produce no harm. That would make the option useless for most users. You might as well not bother merging then. > first, only then enable that option, then rebuild the kernel. (I'm > repeating it to make sure you didn't miss this my point; it was in the > part of my original message, which you cut out.) That was such a ridiculous suggestion, I didn't take it seriously. Also it should be really not rocket science to use a separate table for this. -Andi
On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote: > David M. Lloyd, on 12/18/2008 09:43 PM wrote: > >On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote: > >>An iSCSI target driver iSCSI-SCST was a part of the patchset > >>(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to > >>have TCP zero-copy transmit of user space data was implemented. Patch, > >>implementing this optimization was also sent in the patchset, see > >>http://lkml.org/lkml/2008/12/10/296. > > > >I'm probably ignorant of about 90% of the context here, but isn't this the > >sort of problem that was supposed to have been solved by vmsplice(2)? > > No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, > even if it was a user space driver, vmsplice wouldn't change anything > much. It doesn't have a possibility for a user to know, when > transmission of the data finished. So, it is intended to be used as: > vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> > vmsplice() it. But on the mmap() stage kernel has to zero all the newly > mapped pages and zeroing memory isn't much faster, than copying it. > Hence, there would be no considerable performance increase. vmsplice() isn't the right choice, but splice() very well could be. You could easily use splice internally as well. The vmsplice() part sort-of applies in the sense that you want to fill pages into a pipe, which is essentially what vmsplice() does. You'd need some helper to do that. And the ack-on-xmit-done bits is something that splice-to-socket needs anyway, so I think it'd be quite a suitable choice for this.
Jens Axboe, on 12/19/2008 10:07 PM wrote: > On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote: >> David M. Lloyd, on 12/18/2008 09:43 PM wrote: >>> On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote: >>>> An iSCSI target driver iSCSI-SCST was a part of the patchset >>>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to >>>> have TCP zero-copy transmit of user space data was implemented. Patch, >>>> implementing this optimization was also sent in the patchset, see >>>> http://lkml.org/lkml/2008/12/10/296. >>> I'm probably ignorant of about 90% of the context here, but isn't this the >>> sort of problem that was supposed to have been solved by vmsplice(2)? >> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, >> even if it was a user space driver, vmsplice wouldn't change anything >> much. It doesn't have a possibility for a user to know, when >> transmission of the data finished. So, it is intended to be used as: >> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> >> vmsplice() it. But on the mmap() stage kernel has to zero all the newly >> mapped pages and zeroing memory isn't much faster, than copying it. >> Hence, there would be no considerable performance increase. > > vmsplice() isn't the right choice, but splice() very well could be. You > could easily use splice internally as well. The vmsplice() part sort-of > applies in the sense that you want to fill pages into a pipe, which is > essentially what vmsplice() does. You'd need some helper to do that. Sorry, Jens, but splice() works only if there is a file handle on the another side, so user space doesn't see data buffers. But SCST needs to serve a wider usage cases, like reading data with decompression from a virtual tape, where decompression is done in user space. For those only complete zero-copy network send, which I implemented, can give the best performance. > And > the ack-on-xmit-done bits is something that splice-to-socket needs > anyway, so I think it'd be quite a suitable choice for this. So, are you writing that splice() could also benefit from the zero-copy transmit feature, like I implemented? Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote: > Jens Axboe, on 12/19/2008 10:07 PM wrote: > >On Fri, Dec 19 2008, Vladislav Bolkhovitin wrote: > >>David M. Lloyd, on 12/18/2008 09:43 PM wrote: > >>>On 12/18/2008 12:35 PM, Vladislav Bolkhovitin wrote: > >>>>An iSCSI target driver iSCSI-SCST was a part of the patchset > >>>>(http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to > >>>>have TCP zero-copy transmit of user space data was implemented. Patch, > >>>>implementing this optimization was also sent in the patchset, see > >>>>http://lkml.org/lkml/2008/12/10/296. > >>>I'm probably ignorant of about 90% of the context here, but isn't this > >>>the sort of problem that was supposed to have been solved by vmsplice(2)? > >>No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, > >>even if it was a user space driver, vmsplice wouldn't change anything > >>much. It doesn't have a possibility for a user to know, when > >>transmission of the data finished. So, it is intended to be used as: > >>vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> > >>vmsplice() it. But on the mmap() stage kernel has to zero all the newly > >>mapped pages and zeroing memory isn't much faster, than copying it. > >>Hence, there would be no considerable performance increase. > > > >vmsplice() isn't the right choice, but splice() very well could be. You > >could easily use splice internally as well. The vmsplice() part sort-of > >applies in the sense that you want to fill pages into a pipe, which is > >essentially what vmsplice() does. You'd need some helper to do that. > > Sorry, Jens, but splice() works only if there is a file handle on the > another side, so user space doesn't see data buffers. But SCST needs to > serve a wider usage cases, like reading data with decompression from a > virtual tape, where decompression is done in user space. For those only > complete zero-copy network send, which I implemented, can give the best > performance. __splice_from_pipe() takes a pipe, a descriptor and an actor. There's absolutely ZERO reason you could not reuse most of that for this implementation. The big bonus here is that getting the put correct from networking would even make splice() better for everyone. Win for Linux, win for you since it'll make it MUCH easier for you to get this stuff in. Looking at your original patch and I almost think it's a flame bait to induce discussion (nothing wrong with that, that approach works quite well and has been used before). There's no way in HELL that it'd ever be a merge candidate. And I suspect you know that, at least I hope you do or you are farther away from going forward with this than you think. So don't look at splice() the system call, look at the infrastructure and check if that could be useful for your case. To me it looks absolutely like it could, if you goal is just zero-copy transmit. The only missing piece is dropping the reference and signalling page consumption at the right point, which is when the data is safe to be reused. That very bit is missing, but that should be all as far as I can tell. > >And > >the ack-on-xmit-done bits is something that splice-to-socket needs > >anyway, so I think it'd be quite a suitable choice for this. > > So, are you writing that splice() could also benefit from the zero-copy > transmit feature, like I implemented? I like how you want to reinvent everything, perhaps you should spend a little more time looking into various other approaches? splice() already does zero-copy network transmit, there are no copies going on. Ideally, you'd have zero copies moving data into your pipe, but migrade/move isn't quite there yet. But that doesn't apply to your case at all. What is missing, as I wrote, is the 'release on ack' and not on pipe buffer release. This is similar to the get_page/put_page stuff you did in your patch, but don't go claiming that zero-copy transmit is a Vladislav original - the ->sendpage() does no copies.
Vladislav Bolkhovitin wrote: > This patch implements support for zero-copy TCP transmit of user space > data. It is necessary in iSCSI-SCST target driver for transmitting data > from user space buffers, supplied by user space backend handlers. In > this case SCST core needs to know when TCP finished transmitting the > data, so the corresponding buffers can be reused or freed. Without this > patch it isn't possible, so iSCSI-SCST has to use data copying to TCP > send buffers function sock_sendpage(). ISCSI-SCST also works without > this patch, but that this patch gives a nice performance improvement. > In Xen networking it looks like we're going to need to solve a very similar problem. When a guest (non-privileged, with no direct hardware access) wants to send a network packet, it passes it over to the privileged (host) domain, who then puts it into the network stack for transmission. The packet gets passed over in a page granted (read "borrowed") from the guest domain. We can't return it to the guest while its tangled up in the host's network stack, so we need notification of when the stack has finished with the page. The out of tree Xen patches do this by marking a page as having been allocated by a foreign allocator, and overloads the private memory of struct page with a destructor function pointer, which put_page calls as appropriate. We can do this because the page is definitely "owned" by the Xen subsystem, so most of the fields are available for recycling; the main problem is that we need to grab another page flag. Your case sounds more complex because the source page can be mapped by userspace and/or be in the pagecache, so everything is already claimed. As with your case, we can simply copy the page data if this mechanism isn't available. But it would be nice if it were. > 1. Add net_priv analog in struct sk_buff, not in struct page. But then > it would be required that all the pages in each skb must be from the > same originator, i.e. with the same net_priv. It is unpractical to > change all the operations with skb's to forbid merging them, if they > have different net_priv. I tried, but quickly gave up. There are too > many such places in very not obvious code pieces. > I think Rusty has a patch to put some kind of put notifier in struct skb_shared_info, but I'm not sure of the details. > 2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a > simple search function. This approach was rejected, because to copy a > page a modern CPU needs using MMX about 1500 ticks. Is that the cold cache timing? > It was observed, > that each page can be referenced by TCP during transmit about 20 times > or even more. So, if each search needs, say, 20 ticks, the overall > search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this > approach would considerably worse performance-wise to the chosen > approach and provide not too much benefit. > Wouldn't you only need to do the lookup on the last put? An external lookup table might well for for us, if the net_put_page() change is acceptable to the network folk. J -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 19, 2008 at 08:27:36PM +0100, Jens Axboe (jens.axboe@oracle.com) wrote: > What is missing, as I wrote, is the 'release on ack' and not on pipe > buffer release. This is similar to the get_page/put_page stuff you did > in your patch, but don't go claiming that zero-copy transmit is a > Vladislav original - the ->sendpage() does no copies. Just my small rant: it does, when underlying device does not support hardware tx checksumming and scatter/gather, which is likely exception than a rule for the modern NICs. As of having notifications of the received ack (or from user's point of view notification of the freeing of the buffer), I have following idea in mind: extend skb ahsred info by copy of the frag array and additional destructor field, which will be invoked when not only skb but also all its clones are freed (that's when shared info is freed), so that user could save some per-page context in fraglist and work with it when data is not used anymore. Extending page or skb structure is a no-go for sure, and actually even shared info is not rubber, but there we can at least add something... If only destructor field is allowed (similar patch was not rejected), scsi can save its pages in the tree (indexed by the page pointer) and traverse it when destructor is invoked selecting pages found in the freed skb.
On Fri, Dec 19, 2008 at 12:21:41PM -0800, Jeremy Fitzhardinge (jeremy@goop.org) wrote: > I think Rusty has a patch to put some kind of put notifier in struct > skb_shared_info, but I'm not sure of the details. Yes, he added destructor callback into shared info. There maybe a problem though, if iscsi will run over xen network, but in this case xen may copy the data, or iscsi may do that after determining that underlying network device does not allow shared info destructor (vis device/route flag for example). > Wouldn't you only need to do the lookup on the last put? > > An external lookup table might well for for us, if the net_put_page() > change is acceptable to the network folk. That sounds like the best solution for this problem. David, will you accept shared info destructor? And second fraglist? (Put to the new line to easily say no here :)
Evgeniy Polyakov wrote: > On Fri, Dec 19, 2008 at 12:21:41PM -0800, Jeremy Fitzhardinge (jeremy@goop.org) wrote: > >> I think Rusty has a patch to put some kind of put notifier in struct >> skb_shared_info, but I'm not sure of the details. >> > > Yes, he added destructor callback into shared info. > > There maybe a problem though, if iscsi will run over xen network, but in > this case xen may copy the data, or iscsi may do that after determining > that underlying network device does not allow shared info destructor > (vis device/route flag for example). > Xen only needs the callback in the case of traffic originating in another Xen domain. If iscsi is involved, it will be running in the other domain, and so all its callbacks and so on will happen there. There's no conflict. >> Wouldn't you only need to do the lookup on the last put? >> >> An external lookup table might well for for us, if the net_put_page() >> change is acceptable to the network folk. >> > > That sounds like the best solution for this problem. > An external lookup would just need to change put_page -> net_put_page, I think. > David, will you accept shared info destructor? > I'm not very familiar with the network stack, but am I right in assuming that the shared_info destructor would be called when the network stack has finished with all the pages it refers to? And those pages could be the combination of pages separately submitted in different skbs? J -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 19, 2008 at 02:21:18PM -0800, Jeremy Fitzhardinge (jeremy@goop.org) wrote: > >There maybe a problem though, if iscsi will run over xen network, but in > >this case xen may copy the data, or iscsi may do that after determining > >that underlying network device does not allow shared info destructor > >(vis device/route flag for example). > > > > Xen only needs the callback in the case of traffic originating in > another Xen domain. If iscsi is involved, it will be running in the > other domain, and so all its callbacks and so on will happen there. > There's no conflict. That's good news. > >>Wouldn't you only need to do the lookup on the last put? > >> > >>An external lookup table might well for for us, if the net_put_page() > >>change is acceptable to the network folk. > >> > > > >That sounds like the best solution for this problem. > > > > An external lookup would just need to change put_page -> net_put_page, I > think. It is not needed I think, having skb shared info destructor has access to the all pages in that skb. > >David, will you accept shared info destructor? > > > > I'm not very familiar with the network stack, but am I right in assuming > that the shared_info destructor would be called when the network stack > has finished with all the pages it refers to? And those pages could be > the combination of pages separately submitted in different skbs? Shared info is freed when there are no skbs referring to the shared info in question. Skb holds all pages in shared info in the fraglist array, so when it is about to be freed, it means that network stack does not use it (particulary it will putpage every page in fraglist). Usually there are two skbs in the network stack per packet in TCP (allocated at once though via fastclone mechanims): one is provided to the device (and will be freed there) and another one is placed into retransmit queue, where it will be located and freed when ack has been received. There may be another layers which may clone skb, but its shared info structure (shared between the clones) will only be freed when all users freed appropriate cloned skbs.
Evgeniy Polyakov wrote: > Shared info is freed when there are no skbs referring to the shared info > in question. Skb holds all pages in shared info in the fraglist array, > so when it is about to be freed, it means that network stack does not > use it (particulary it will putpage every page in fraglist). Usually > there are two skbs in the network stack per packet in TCP (allocated at > once though via fastclone mechanims): one is provided to the device > (and will be freed there) and another one is placed into retransmit > queue, where it will be located and freed when ack has been received. > > There may be another layers which may clone skb, but its shared info > structure (shared between the clones) will only be freed when all users > freed appropriate cloned skbs. > I see. One more question: how would I go about submitting some data with the callback attached to it? J -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 19, 2008 at 05:56:11PM -0800, Jeremy Fitzhardinge wrote: > Evgeniy Polyakov wrote: >> Shared info is freed when there are no skbs referring to the shared info >> in question. Skb holds all pages in shared info in the fraglist array, >> so when it is about to be freed, it means that network stack does not >> use it (particulary it will putpage every page in fraglist). Usually >> there are two skbs in the network stack per packet in TCP (allocated at >> once though via fastclone mechanims): one is provided to the device >> (and will be freed there) and another one is placed into retransmit >> queue, where it will be located and freed when ack has been received. >> >> There may be another layers which may clone skb, but its shared info >> structure (shared between the clones) will only be freed when all users >> freed appropriate cloned skbs. This is all correct. However, please note that that if any clone does a pskb_expand_head then it will get its own private copy of of the shared info. So you can't use the shared info to ref count the pages in it. Cheers,
Herbert Xu wrote: > On Fri, Dec 19, 2008 at 05:56:11PM -0800, Jeremy Fitzhardinge wrote: > >> Evgeniy Polyakov wrote: >> >>> Shared info is freed when there are no skbs referring to the shared info >>> in question. Skb holds all pages in shared info in the fraglist array, >>> so when it is about to be freed, it means that network stack does not >>> use it (particulary it will putpage every page in fraglist). Usually >>> there are two skbs in the network stack per packet in TCP (allocated at >>> once though via fastclone mechanims): one is provided to the device >>> (and will be freed there) and another one is placed into retransmit >>> queue, where it will be located and freed when ack has been received. >>> >>> There may be another layers which may clone skb, but its shared info >>> structure (shared between the clones) will only be freed when all users >>> freed appropriate cloned skbs. >>> > > This is all correct. However, please note that that if any clone > does a pskb_expand_head then it will get its own private copy of > of the shared info. So you can't use the shared info to ref count > the pages in it. Ah, so the lifetime of the shared_info structure doesn't match the lifetime of the underlying pages, and this mechanism would be insufficient for my purposes? If so, how can it be solved? Thanks, J -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 19, 2008 at 10:14:47PM -0800, Jeremy Fitzhardinge wrote: > > Ah, so the lifetime of the shared_info structure doesn't match the > lifetime of the underlying pages, and this mechanism would be > insufficient for my purposes? Right. > If so, how can it be solved? Well since each individual page can be pulled out I think you just have to track them. Cheers,
Herbert Xu wrote: >> If so, how can it be solved? >> > > Well since each individual page can be pulled out I think you > just have to track them. > Hm. So if I get a destructor call from the shared_info, can I go an inspect the page refcounts to see if its really the last use? J -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 19, 2008 at 11:43:34PM -0800, Jeremy Fitzhardinge wrote: > > Hm. So if I get a destructor call from the shared_info, can I go an > inspect the page refcounts to see if its really the last use? The pages that were originally in the shared_info at creation time may no longer be there by the time it's freed because of pskb_pull_tail. Cheers,
On Sat, Dec 20, 2008 at 07:10:45PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > Hm. So if I get a destructor call from the shared_info, can I go an > > inspect the page refcounts to see if its really the last use? > > The pages that were originally in the shared_info at creation > time may no longer be there by the time it's freed because of > pskb_pull_tail. Things should work fine, since pskb_expand_head() copies whole shared info structure (and thus will copy destructor), get all pages and then copy all pointers into the new skb, and then release old skb's data. So destructor for the pages should not rely on which skb it is called on and check if pages are about to be really freed (i.e. check theirs reference counter). __pskb_pull_tail() is tricky, it just puts some pages it does not want to be present in the skb, but it could be possible to add there destructor callback from the original skb with partial flag (or just having destructor with two parameters: skb and page, and if page is not NULL, then actually only given page is freed, otherwise the whole skb).
Evgeniy Polyakov wrote: > Things should work fine, since pskb_expand_head() copies whole shared > info structure (and thus will copy destructor), get all pages and then > copy all pointers into the new skb, and then release old skb's data. > > So destructor for the pages should not rely on which skb it is called on > and check if pages are about to be really freed (i.e. check theirs > reference counter). > OK. > __pskb_pull_tail() is tricky, it just puts some pages it does not want > to be present in the skb, but it could be possible to add there > destructor callback from the original skb with partial flag (or just > having destructor with two parameters: skb and page, and if page is not > NULL, then actually only given page is freed, otherwise the whole skb). > Yes, that doesn't sound too bad. J -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe, on 12/19/2008 10:27 PM wrote: >>>>>> An iSCSI target driver iSCSI-SCST was a part of the patchset >>>>>> (http://lkml.org/lkml/2008/12/10/293). For it a nice optimization to >>>>>> have TCP zero-copy transmit of user space data was implemented. Patch, >>>>>> implementing this optimization was also sent in the patchset, see >>>>>> http://lkml.org/lkml/2008/12/10/296. >>>>> I'm probably ignorant of about 90% of the context here, but isn't this >>>>> the sort of problem that was supposed to have been solved by vmsplice(2)? >>>> No, vmsplice can't help here. ISCSI-SCST is a kernel space driver. But, >>>> even if it was a user space driver, vmsplice wouldn't change anything >>>> much. It doesn't have a possibility for a user to know, when >>>> transmission of the data finished. So, it is intended to be used as: >>>> vmsplice() buffer -> munmap() the buffer -> mmap() new buffer -> >>>> vmsplice() it. But on the mmap() stage kernel has to zero all the newly >>>> mapped pages and zeroing memory isn't much faster, than copying it. >>>> Hence, there would be no considerable performance increase. >>> vmsplice() isn't the right choice, but splice() very well could be. You >>> could easily use splice internally as well. The vmsplice() part sort-of >>> applies in the sense that you want to fill pages into a pipe, which is >>> essentially what vmsplice() does. You'd need some helper to do that. >> Sorry, Jens, but splice() works only if there is a file handle on the >> another side, so user space doesn't see data buffers. But SCST needs to >> serve a wider usage cases, like reading data with decompression from a >> virtual tape, where decompression is done in user space. For those only >> complete zero-copy network send, which I implemented, can give the best >> performance. > > __splice_from_pipe() takes a pipe, a descriptor and an actor. There's > absolutely ZERO reason you could not reuse most of that for this > implementation. The big bonus here is that getting the put correct from > networking would even make splice() better for everyone. Win for Linux, > win for you since it'll make it MUCH easier for you to get this stuff > in. Looking at your original patch and I almost think it's a flame bait > to induce discussion (nothing wrong with that, that approach works quite > well and has been used before). There's no way in HELL that it'd ever be > a merge candidate. And I suspect you know that, at least I hope you do > or you are farther away from going forward with this than you think. > > So don't look at splice() the system call, look at the infrastructure > and check if that could be useful for your case. To me it looks > absolutely like it could, if you goal is just zero-copy transmit. I looked at the splice code again to make sure I don't miss anything. __splice_from_pipe() leads to pipe_to_sendpage(), which leads to sock_sendpage, then to sock->sendpage(). Sorry, but I don't see any point why to go over all the complicated splice infrastructure instead of directly call sock->sendpage(), as I do. > The > only missing piece is dropping the reference and signalling page > consumption at the right point, which is when the data is safe to be > reused. That very bit is missing, but that should be all as far as I can > tell. This is exactly what I implemented in the patch we are discussing. >>> And >>> the ack-on-xmit-done bits is something that splice-to-socket needs >>> anyway, so I think it'd be quite a suitable choice for this. >> So, are you writing that splice() could also benefit from the zero-copy >> transmit feature, like I implemented? > > I like how you want to reinvent everything, perhaps you should spend a > little more time looking into various other approaches? splice() already > does zero-copy network transmit, there are no copies going on. Ideally, > you'd have zero copies moving data into your pipe, but migrade/move > isn't quite there yet. But that doesn't apply to your case at all. > > What is missing, as I wrote, is the 'release on ack' and not on pipe > buffer release. This is similar to the get_page/put_page stuff you did > in your patch, but don't go claiming that zero-copy transmit is a > Vladislav original - the ->sendpage() does no copies. Jens, I have never claimed I reinvented ->sendpage(). Quite opposite, I use it. I only extended it by a missing feature. Although, seems, since you were misleaded, I should apologize for not too good description of the patch. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeremy Fitzhardinge, on 12/19/2008 11:21 PM wrote: [...] > As with your case, we can simply copy the page data if this mechanism > isn't available. But it would be nice if it were. > >> 1. Add net_priv analog in struct sk_buff, not in struct page. But then >> it would be required that all the pages in each skb must be from the >> same originator, i.e. with the same net_priv. It is unpractical to >> change all the operations with skb's to forbid merging them, if they >> have different net_priv. I tried, but quickly gave up. There are too >> many such places in very not obvious code pieces. >> > > I think Rusty has a patch to put some kind of put notifier in struct > skb_shared_info, but I'm not sure of the details. > >> 2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a >> simple search function. This approach was rejected, because to copy a >> page a modern CPU needs using MMX about 1500 ticks. > > Is that the cold cache timing? Should be L2 cache hot, which is almost always the case if FILEIO is used, because data are just copied from the page cache. Although, frankly, at the moment I can't find from where I got that number.. >> It was observed, >> that each page can be referenced by TCP during transmit about 20 times >> or even more. So, if each search needs, say, 20 ticks, the overall >> search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this >> approach would considerably worse performance-wise to the chosen >> approach and provide not too much benefit. > > Wouldn't you only need to do the lookup on the last put? No, because you can't say which one is the last. E.g., a page can be mmaped to another process, while it's being transmitted. So, the only possible way is to track all gets and puts done by networking using some external reference counting (net_ref_cnt in case if iscsi-scst). Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov, on 12/20/2008 01:32 PM wrote: > On Sat, Dec 20, 2008 at 07:10:45PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: >>> Hm. So if I get a destructor call from the shared_info, can I go an >>> inspect the page refcounts to see if its really the last use? >> The pages that were originally in the shared_info at creation >> time may no longer be there by the time it's freed because of >> pskb_pull_tail. > > Things should work fine, since pskb_expand_head() copies whole shared > info structure (and thus will copy destructor), get all pages and then > copy all pointers into the new skb, and then release old skb's data. > > So destructor for the pages should not rely on which skb it is called on > and check if pages are about to be really freed (i.e. check theirs > reference counter). > > __pskb_pull_tail() is tricky, it just puts some pages it does not want > to be present in the skb, but it could be possible to add there > destructor callback from the original skb with partial flag (or just > having destructor with two parameters: skb and page, and if page is not > NULL, then actually only given page is freed, otherwise the whole skb). Actually, there's another way, which seems to be a lot simpler. Alexey Kuznetsov privately suggested it to me. In skb_shared_info new pointer transaction_token would be added, which would point on: struct sk_transaction_token { atomic_t io_count; struct sk_transaction_token *next; unsigned long token; unsigned long private; void (*finish_callback)(struct sk_transaction_token *); }; When skb is translated, transaction_token inherited. If 2 skb are merged (the same places where I put net_get_page's in my patch), the *older* token is inherited. This is the main point of this idea. Before starting new asynchronous send a client would open a new token. Everything sent then would receive that token. Finish_callback() would be called and the corresponding token freed, when io_count == 0 *AND* all previous tokens closed. This idea seems to be simpler, than even what Rusty implemented. Correct me, if I wrong. But, unfortunately, in the near future I will have no time to develop it.. :-( Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 23, 2008 at 10:16:25PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: > Actually, there's another way, which seems to be a lot simpler. Alexey > Kuznetsov privately suggested it to me. > > In skb_shared_info new pointer transaction_token would be added, which > would point on: > > struct sk_transaction_token > { > atomic_t io_count; > struct sk_transaction_token *next; > unsigned long token; > unsigned long private; > void (*finish_callback)(struct > sk_transaction_token *); > }; > > When skb is translated, transaction_token inherited. If 2 skb are merged > (the same places where I put net_get_page's in my patch), the *older* > token is inherited. This is the main point of this idea. > > Before starting new asynchronous send a client would open a new token. > Everything sent then would receive that token. Finish_callback() would > be called and the corresponding token freed, when io_count == 0 *AND* > all previous tokens closed. > > This idea seems to be simpler, than even what Rusty implemented. Correct > me, if I wrong. But, unfortunately, in the near future I will have no > time to develop it.. :-( Yes, it is simpler and cleaner, but it requires additional allocation. This is additional (and quite noticeble) overhead.
Evgeniy Polyakov, on 12/24/2008 12:38 AM wrote: > On Tue, Dec 23, 2008 at 10:16:25PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: >> Actually, there's another way, which seems to be a lot simpler. Alexey >> Kuznetsov privately suggested it to me. >> >> In skb_shared_info new pointer transaction_token would be added, which >> would point on: >> >> struct sk_transaction_token >> { >> atomic_t io_count; >> struct sk_transaction_token *next; >> unsigned long token; >> unsigned long private; >> void (*finish_callback)(struct >> sk_transaction_token *); >> }; >> >> When skb is translated, transaction_token inherited. If 2 skb are merged >> (the same places where I put net_get_page's in my patch), the *older* >> token is inherited. This is the main point of this idea. >> >> Before starting new asynchronous send a client would open a new token. >> Everything sent then would receive that token. Finish_callback() would >> be called and the corresponding token freed, when io_count == 0 *AND* >> all previous tokens closed. >> >> This idea seems to be simpler, than even what Rusty implemented. Correct >> me, if I wrong. But, unfortunately, in the near future I will have no >> time to develop it.. :-( > > Yes, it is simpler and cleaner, but it requires additional allocation. > This is additional (and quite noticeble) overhead. Not necessary requires. For instance, in iscsi-scst sk_transaction_token can (and should) be part of iSCSI cmd structure, so no additional allocations would be needed. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 24, 2008 at 05:37:51PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: > >Yes, it is simpler and cleaner, but it requires additional allocation. > >This is additional (and quite noticeble) overhead. > > Not necessary requires. For instance, in iscsi-scst sk_transaction_token > can (and should) be part of iSCSI cmd structure, so no additional > allocations would be needed. This is special case, I'm not sure it is always possible to cache that token and attach to every skb, but if it can be done, then of course this does not end up with additional overhead.
Evgeniy Polyakov, on 12/24/2008 05:44 PM wrote: > On Wed, Dec 24, 2008 at 05:37:51PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: >>> Yes, it is simpler and cleaner, but it requires additional allocation. >>> This is additional (and quite noticeble) overhead. >> Not necessary requires. For instance, in iscsi-scst sk_transaction_token >> can (and should) be part of iSCSI cmd structure, so no additional >> allocations would be needed. > > This is special case, I'm not sure it is always possible to cache that > token and attach to every skb, but if it can be done, then of course > this does not end up with additional overhead. I think in most cases there would be possibility to embed sk_transaction_token to some higher level structure. E.g. Xen apparently should have something to track packets passed through host/guest boundary. From other side, kmem cache is too well polished to have much overhead. I doubt, you would even notice it in this application. In most cases allocation of such small object in it using SLUB is just about the same as a list_del() under disabled IRQs. Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 24, 2008 at 08:46:56PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: > I think in most cases there would be possibility to embed > sk_transaction_token to some higher level structure. E.g. Xen apparently > should have something to track packets passed through host/guest > boundary. From other side, kmem cache is too well polished to have much > overhead. I doubt, you would even notice it in this application. In most > cases allocation of such small object in it using SLUB is just about the > same as a list_del() under disabled IRQs. I definitely would not rely on that, especially at cache reclaim time. But it of course depends on the workload and maybe appropriate for the cases in question. The best solution I think is to combine tag and separate destructur, so that those who do not want to allocate a token could still get notification via destructor callback.
Evgeniy Polyakov, on 12/24/2008 09:08 PM wrote: > On Wed, Dec 24, 2008 at 08:46:56PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: >> I think in most cases there would be possibility to embed >> sk_transaction_token to some higher level structure. E.g. Xen apparently >> should have something to track packets passed through host/guest >> boundary. From other side, kmem cache is too well polished to have much >> overhead. I doubt, you would even notice it in this application. In most >> cases allocation of such small object in it using SLUB is just about the >> same as a list_del() under disabled IRQs. > > I definitely would not rely on that, especially at cache reclaim time. > But it of course depends on the workload and maybe appropriate for the > cases in question. The best solution I think is to combine tag and > separate destructur, so that those who do not want to allocate a token > could still get notification via destructor callback. Although I agree that any additional allocation is something, which should be avoided, *if possible*. But you shouldn't overestimate the overhead of the sk_transaction_token allocation in cases, when it would be needed. At first, sk_transaction_token is quite small, so a single page in the kmem cache would keep about 100 of them, hence the slow allocation path would be called only once per 100 objects. Second, in many cases ->sendpages() needs to allocate a new skb, so already there is at least one such allocations on the fast path. Actually, it doesn't look like the skb shared info destructor alone can't solve the task we are solving, because we need to know not when an skb transmittion finished, but when transmittion of our *set of pages* finished. Hence, with skb shared info destructor we would need also to invent some way to track set of pages <-> set of skbs translation (you refer it as combining tag and separate destructor), which would bring this solution on the entire new complexity level for no gain over the sk_transaction_token solution. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Vlad. On Tue, Dec 30, 2008 at 08:37:00PM +0300, Vladislav Bolkhovitin (vst@vlnb.net) wrote: > Although I agree that any additional allocation is something, which > should be avoided, *if possible*. But you shouldn't overestimate the > overhead of the sk_transaction_token allocation in cases, when it would > be needed. At first, sk_transaction_token is quite small, so a single > page in the kmem cache would keep about 100 of them, hence the slow > allocation path would be called only once per 100 objects. Second, in > many cases ->sendpages() needs to allocate a new skb, so already there > is at least one such allocations on the fast path. Once per 100 objects? With millions of packets per second at extreme cases this does not scale. Even more common thousand of usual packets per second with 1.5k mtu will show up (especially freeing actually). Any additional overhead has to be avoided if possible, even if it looks innocent. BSD guys already learned this lesson with packet processing tags at every layer. > Actually, it doesn't look like the skb shared info destructor alone > can't solve the task we are solving, because we need to know not when an > skb transmittion finished, but when transmittion of our *set of pages* > finished. Hence, with skb shared info destructor we would need also to > invent some way to track set of pages <-> set of skbs translation (you > refer it as combining tag and separate destructor), which would bring > this solution on the entire new complexity level for no gain over the > sk_transaction_token solution. You really do not need to know when transmission is over, but when remote side acks it (or connection is reset by the timeout). There is no way to know when transmission is over without creating own skbs and submitting them avoiding usual tcp/ip stack machinery. You do not need to know which skbs contain which pages, system only should track page pointers freed at skb destruction (shared info destruction actually) time, no matter who owns those pages (since new pages can be added into the page and some of the old ones can be freed early). This will be effectively the same token, but it does not mean that everyone who needs notification will have to perform additional allocation. Put two pointers: destructor and token and do whatever you like if one of them is non-empty, but try to avoid unneded overhead when it is possible.
diff -upr linux-2.6.26/include/linux/mm_types.h linux-2.6.26/include/linux/mm_types.h --- linux-2.6.26/include/linux/mm_types.h 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/include/linux/mm_types.h 2008-07-22 20:30:21.000000000 +0400 @@ -92,6 +92,18 @@ struct page { void *virtual; /* Kernel virtual address (NULL if not kmapped, ie. highmem) */ #endif /* WANT_PAGE_VIRTUAL */ + +#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION) + /* + * Used to implement support for notification on zero-copy TCP transfer + * completion. It might look as not good to have this field here and + * it's better to have it in struct sk_buff, but it would make the code + * much more complicated and fragile, since all skb then would have to + * contain only pages with the same value in this field. + */ + void *net_priv; +#endif + #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long page_cgroup; #endif diff -upr linux-2.6.26/include/linux/net.h linux-2.6.26/include/linux/net.h --- linux-2.6.26/include/linux/net.h 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/include/linux/net.h 2008-07-29 20:48:07.000000000 +0400 @@ -57,6 +57,7 @@ typedef enum { #include <linux/random.h> #include <linux/wait.h> #include <linux/fcntl.h> /* For O_CLOEXEC and O_NONBLOCK */ +#include <linux/mm.h> struct poll_table_struct; struct pipe_inode_info; @@ -354,5 +354,44 @@ extern int net_msg_cost; extern struct ratelimit_state net_ratelimit_state; #endif +#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION) +/* Support for notification on zero-copy TCP transfer completion */ +typedef void (*net_get_page_callback_t)(struct page *page); +typedef void (*net_put_page_callback_t)(struct page *page); + +extern net_get_page_callback_t net_get_page_callback; +extern net_put_page_callback_t net_put_page_callback; + +extern int net_set_get_put_page_callbacks( + net_get_page_callback_t get_callback, + net_put_page_callback_t put_callback); + +/* + * See comment for net_set_get_put_page_callbacks() why those functions + * don't need any protection. + */ +static inline void net_get_page(struct page *page) +{ + if (page->net_priv != 0) + net_get_page_callback(page); + get_page(page); +} +static inline void net_put_page(struct page *page) +{ + if (page->net_priv != 0) + net_put_page_callback(page); + put_page(page); +} +#else +static inline void net_get_page(struct page *page) +{ + get_page(page); +} +static inline void net_put_page(struct page *page) +{ + put_page(page); +} +#endif /* CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION */ + #endif /* __KERNEL__ */ #endif /* _LINUX_NET_H */ diff -upr linux-2.6.26/net/core/skbuff.c linux-2.6.26/net/core/skbuff.c --- linux-2.6.26/net/core/skbuff.c 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/net/core/skbuff.c 2008-07-22 20:28:41.000000000 +0400 @@ -319,7 +319,7 @@ static void skb_release_data(struct sk_b if (skb_shinfo(skb)->nr_frags) { int i; for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) - put_page(skb_shinfo(skb)->frags[i].page); + net_put_page(skb_shinfo(skb)->frags[i].page); } if (skb_shinfo(skb)->frag_list) @@ -658,7 +658,7 @@ struct sk_buff *pskb_copy(struct sk_buff for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i]; - get_page(skb_shinfo(n)->frags[i].page); + net_get_page(skb_shinfo(n)->frags[i].page); } skb_shinfo(n)->nr_frags = i; } @@ -721,7 +721,7 @@ int pskb_expand_head(struct sk_buff *skb sizeof(struct skb_shared_info)); for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) - get_page(skb_shinfo(skb)->frags[i].page); + net_get_page(skb_shinfo(skb)->frags[i].page); if (skb_shinfo(skb)->frag_list) skb_clone_fraglist(skb); @@ -990,7 +990,7 @@ drop_pages: skb_shinfo(skb)->nr_frags = i; for (; i < nfrags; i++) - put_page(skb_shinfo(skb)->frags[i].page); + net_put_page(skb_shinfo(skb)->frags[i].page); if (skb_shinfo(skb)->frag_list) skb_drop_fraglist(skb); @@ -1159,7 +1159,7 @@ pull_pages: k = 0; for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { if (skb_shinfo(skb)->frags[i].size <= eat) { - put_page(skb_shinfo(skb)->frags[i].page); + net_put_page(skb_shinfo(skb)->frags[i].page); eat -= skb_shinfo(skb)->frags[i].size; } else { skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i]; @@ -1916,7 +1916,7 @@ static inline void skb_split_no_header(s * where splitting is expensive. * 2. Split is accurately. We make this. */ - get_page(skb_shinfo(skb)->frags[i].page); + net_get_page(skb_shinfo(skb)->frags[i].page); skb_shinfo(skb1)->frags[0].page_offset += len - pos; skb_shinfo(skb1)->frags[0].size -= len - pos; skb_shinfo(skb)->frags[i].size = len - pos; @@ -2284,7 +2284,7 @@ struct sk_buff *skb_segment(struct sk_bu BUG_ON(i >= nfrags); *frag = skb_shinfo(skb)->frags[i]; - get_page(frag->page); + net_get_page(frag->page); size = frag->size; if (pos < offset) { diff -upr linux-2.6.26/net/ipv4/ip_output.c linux-2.6.26/net/ipv4/ip_output.c --- linux-2.6.26/net/ipv4/ip_output.c 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/net/ipv4/ip_output.c 2008-07-22 20:28:41.000000000 +0400 @@ -1007,7 +1007,7 @@ alloc_new_skb: err = -EMSGSIZE; goto error; } - get_page(page); + net_get_page(page); skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0); frag = &skb_shinfo(skb)->frags[i]; } @@ -1165,7 +1165,7 @@ ssize_t ip_append_page(struct sock *sk, if (skb_can_coalesce(skb, i, page, offset)) { skb_shinfo(skb)->frags[i-1].size += len; } else if (i < MAX_SKB_FRAGS) { - get_page(page); + net_get_page(page); skb_fill_page_desc(skb, i, page, offset, len); } else { err = -EMSGSIZE; diff -upr linux-2.6.26/net/ipv4/Makefile linux-2.6.26/net/ipv4/Makefile --- linux-2.6.26/net/ipv4/Makefile 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/net/ipv4/Makefile 2008-07-22 20:35:05.000000000 +0400 @@ -50,6 +50,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o +obj-$(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION) += tcp_zero_copy.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o diff -upr linux-2.6.26/net/ipv4/tcp.c linux-2.6.26/net/ipv4/tcp.c --- linux-2.6.26/net/ipv4/tcp.c 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/net/ipv4/tcp.c 2008-07-22 20:28:41.000000000 +0400 @@ -712,7 +712,7 @@ new_segment: if (can_coalesce) { skb_shinfo(skb)->frags[i - 1].size += copy; } else { - get_page(page); + net_get_page(page); skb_fill_page_desc(skb, i, page, offset, copy); } @@ -917,7 +917,7 @@ new_segment: goto new_segment; } else if (page) { if (off == PAGE_SIZE) { - put_page(page); + net_put_page(page); TCP_PAGE(sk) = page = NULL; off = 0; } @@ -958,9 +958,9 @@ new_segment: } else { skb_fill_page_desc(skb, i, page, off, copy); if (TCP_PAGE(sk)) { - get_page(page); + net_get_page(page); } else if (off + copy < PAGE_SIZE) { - get_page(page); + net_get_page(page); TCP_PAGE(sk) = page; } } diff -upr linux-2.6.26/net/ipv4/tcp_output.c linux-2.6.26/net/ipv4/tcp_output.c --- linux-2.6.26/net/ipv4/tcp_output.c 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/net/ipv4/tcp_output.c 2008-07-22 20:28:41.000000000 +0400 @@ -854,7 +854,7 @@ static void __pskb_trim_head(struct sk_b k = 0; for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { if (skb_shinfo(skb)->frags[i].size <= eat) { - put_page(skb_shinfo(skb)->frags[i].page); + net_put_page(skb_shinfo(skb)->frags[i].page); eat -= skb_shinfo(skb)->frags[i].size; } else { skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i]; diff -upr linux-2.6.26/net/ipv4/tcp_zero_copy.c linux-2.6.26/net/ipv4/tcp_zero_copy.c --- linux-2.6.26/net/ipv4/tcp_zero_copy.c 2008-07-22 20:12:35.000000000 +0400 +++ linux-2.6.26/net/ipv4/tcp_zero_copy.c 2008-07-31 21:21:13.000000000 +0400 @@ -0,0 +1,49 @@ +/* + * Support routines for TCP zero copy transmit + * + * Created by Vladislav Bolkhovitin + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + */ + +#include <linux/skbuff.h> + +net_get_page_callback_t net_get_page_callback __read_mostly; +EXPORT_SYMBOL(net_get_page_callback); + +net_put_page_callback_t net_put_page_callback __read_mostly; +EXPORT_SYMBOL(net_put_page_callback); + +/* + * Caller of this function must ensure that at the moment when it's called + * there are no pages in the system with net_priv field set to non-zero + * value. Hence, this function, as well as net_get_page() and net_put_page(), + * don't need any protection. + */ +int net_set_get_put_page_callbacks( + net_get_page_callback_t get_callback, + net_put_page_callback_t put_callback) +{ + int res = 0; + + if ((net_get_page_callback != NULL) && (get_callback != NULL) && + (net_get_page_callback != get_callback)) { + res = -EBUSY; + goto out; + } + + if ((net_put_page_callback != NULL) && (put_callback != NULL) && + (net_put_page_callback != put_callback)) { + res = -EBUSY; + goto out; + } + + net_get_page_callback = get_callback; + net_put_page_callback = put_callback; + +out: + return res; +} +EXPORT_SYMBOL(net_set_get_put_page_callbacks); diff -upr linux-2.6.26/net/ipv6/ip6_output.c linux-2.6.26/net/ipv6/ip6_output.c --- linux-2.6.26/net/ipv6/ip6_output.c 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/net/ipv6/ip6_output.c 2008-07-22 20:28:41.000000000 +0400 @@ -1349,7 +1349,7 @@ alloc_new_skb: err = -EMSGSIZE; goto error; } - get_page(page); + net_get_page(page); skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0); frag = &skb_shinfo(skb)->frags[i]; } diff -upr linux-2.6.26/net/Kconfig linux-2.6.26/net/Kconfig --- linux-2.6.26/net/Kconfig 2008-07-14 01:51:29.000000000 +0400 +++ linux-2.6.26/net/Kconfig 2008-07-29 21:15:39.000000000 +0400 @@ -59,6 +59,18 @@ config INET Short answer: say Y. +config TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION + bool "TCP/IP zero-copy transfer completion notification" + depends on INET + default SCST_ISCSI + ---help--- + Adds support for sending a notification upon completion of a + zero-copy TCP/IP transfer. This can speed up certain TCP/IP + software. Currently this is only used by the iSCSI target driver + iSCSI-SCST. + + If unsure, say N. + if INET source "net/ipv4/Kconfig" source "net/ipv6/Kconfig"
This patch implements support for zero-copy TCP transmit of user space data. It is necessary in iSCSI-SCST target driver for transmitting data from user space buffers, supplied by user space backend handlers. In this case SCST core needs to know when TCP finished transmitting the data, so the corresponding buffers can be reused or freed. Without this patch it isn't possible, so iSCSI-SCST has to use data copying to TCP send buffers function sock_sendpage(). ISCSI-SCST also works without this patch, but that this patch gives a nice performance improvement. In the chosen approach new optional field void *net_priv was added to struct page. It is enclosed by #if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION), so if one doesn't need this functionality, net_priv won't consume space in struct page. Then, 2 new global callbacks net_get_page_callback and net_put_page_callback together with 2 new inline functions net_get_page() and net_put_page() were added. If CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION not defined net_get_page() and net_put_page() effectively become get_page() and put_page() correspondingly. Those functions, if the corresponding net_get_page_callback or net_put_page_callback assigned, call it, then do get_page() or put_page(). Then in net/ subdirectory all get_page() calls were replaced by net_get_page() and put_page() - by net_put_page(). How it works. ISCSI-SCST assigns net_get_page_callback and net_put_page_callback to its internal functions. Each page before being sent to TCP's sendpage has net_priv field set to pointer to the corresponding iSCSI command. Then in each net_get_page_callback handler reference counter for that command increased and in each net_put_page_callback - decreased. When it reaches zero, then all the data for this command were transferred, so the command and its buffer can be freed. You can find how it used in the iSCSI-SCST patch (number 21 in this series). Global callbacks were chosen, because this is the simplest and most performance effective approach, fully following section 2 subsection 4 of SubmittingPatches file: "Don't over-design". If accepted, iSCSI-SCST will be the only user of this functionality. Requirements to call net_set_get_put_page_callbacks() (see comment in the patch) allows to not protect those callbacks anyhow. Then, if in the future there is another user of that functionality, it will be possible to convert those callbacks to RCU-protected list of callbacks. But for now there's no need to overcomplicate the code. During development the following approaches were also examined and rejected: 1. Add net_priv analog in struct sk_buff, not in struct page. But then it would be required that all the pages in each skb must be from the same originator, i.e. with the same net_priv. It is unpractical to change all the operations with skb's to forbid merging them, if they have different net_priv. I tried, but quickly gave up. There are too many such places in very not obvious code pieces. 2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a simple search function. This approach was rejected, because to copy a page a modern CPU needs using MMX about 1500 ticks. It was observed, that each page can be referenced by TCP during transmit about 20 times or even more. So, if each search needs, say, 20 ticks, the overall search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this approach would considerably worse performance-wise to the chosen approach and provide not too much benefit. Please, if you reject this approach, advice any other way to implement the required functionality. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- include/linux/mm_types.h | 12 +++++++++++ include/linux/net.h | 40 ++++++++++++++++++++++++++++++++++++++ net/Kconfig | 12 +++++++++++ net/core/skbuff.c | 14 ++++++------- net/ipv4/Makefile | 1 net/ipv4/ip_output.c | 4 +-- net/ipv4/tcp.c | 8 +++---- net/ipv4/tcp_output.c | 2 - net/ipv4/tcp_zero_copy.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++ net/ipv6/ip6_output.c | 2 - 10 files changed, 129 insertions(+), 15 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html