[RFC,23/23] : Support for zero-copy TCP transmit of user space data

This patch implements support for zero-copy TCP transmit of user space 
data. It is necessary in iSCSI-SCST target driver for transmitting data 
from user space buffers, supplied by user space backend handlers. In 
this case SCST core needs to know when TCP finished transmitting the 
data, so the corresponding buffers can be reused or freed. Without this 
patch it isn't possible, so iSCSI-SCST has to use data copying to TCP 
send buffers function sock_sendpage(). ISCSI-SCST also works without 
this patch, but that this patch gives a nice performance improvement.

In the chosen approach new optional field void *net_priv was added to 
struct page. It is enclosed by

#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION),

so if one doesn't need this functionality, net_priv won't consume space 
in struct page.

Then, 2 new global callbacks net_get_page_callback and 
net_put_page_callback together with 2 new inline functions 
net_get_page() and net_put_page() were added. If 
CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION not defined 
net_get_page() and net_put_page() effectively become get_page() and 
put_page() correspondingly.

Those functions, if the corresponding net_get_page_callback or 
net_put_page_callback assigned, call it, then do get_page() or put_page().

Then in net/ subdirectory all get_page() calls were replaced by 
net_get_page() and put_page() - by net_put_page().

How it works. ISCSI-SCST assigns net_get_page_callback and 
net_put_page_callback to its internal functions. Each page before being 
sent to TCP's sendpage has net_priv field set to pointer to the 
corresponding iSCSI command. Then in each net_get_page_callback handler 
reference counter for that command increased and in each 
net_put_page_callback - decreased. When it reaches zero, then all the 
data for this command were transferred, so the command and its buffer 
can be freed.

You can find how it used in the iSCSI-SCST patch (number 21 in this series).

Global callbacks were chosen, because this is the simplest and most
performance effective approach, fully following section 2 subsection 4 
of SubmittingPatches file: "Don't over-design". If accepted, iSCSI-SCST 
will be the only user of this functionality. Requirements to call 
net_set_get_put_page_callbacks() (see comment in the patch) allows to 
not protect those callbacks anyhow. Then, if in the future there is 
another user of that functionality, it will be possible to convert those 
callbacks to RCU-protected list of callbacks. But for now there's no 
need to overcomplicate the code.

During development the following approaches were also examined and rejected:

1. Add net_priv analog in struct sk_buff, not in struct page. But then 
it would be required that all the pages in each skb must be from the 
same originator, i.e. with the same net_priv. It is unpractical to 
change all the operations with skb's to forbid merging them, if they 
have different net_priv. I tried, but quickly gave up. There are too 
many such places in very not obvious code pieces.

2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a 
simple search function. This approach was rejected, because to copy a 
page a modern CPU needs using MMX about 1500 ticks. It was observed, 
that each page can be referenced by TCP during transmit about 20 times 
or even more. So, if each search needs, say, 20 ticks, the overall 
search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this 
approach would considerably worse performance-wise to the chosen 
approach and provide not too much benefit.

Please, if you reject this approach, advice any other way to implement 
the required functionality.

Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net>
---
  include/linux/mm_types.h |   12 +++++++++++
  include/linux/net.h      |   40 ++++++++++++++++++++++++++++++++++++++
  net/Kconfig              |   12 +++++++++++
  net/core/skbuff.c        |   14 ++++++-------
  net/ipv4/Makefile        |    1
  net/ipv4/ip_output.c     |    4 +--
  net/ipv4/tcp.c           |    8 +++----
  net/ipv4/tcp_output.c    |    2 -
  net/ipv4/tcp_zero_copy.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++
  net/ipv6/ip6_output.c    |    2 -
  10 files changed, 129 insertions(+), 15 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Message ID	494012C4.7090304@vlnb.net
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 01E61474F1 for <patchwork-incoming@ozlabs.org>; Thu, 11 Dec 2008 06:05:07 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755694AbYLJTEz (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Wed, 10 Dec 2008 14:04:55 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753648AbYLJTEy (ORCPT <rfc822; netdev-outgoing>); Wed, 10 Dec 2008 14:04:54 -0500 Received: from moutng.kundenserver.de ([212.227.17.9]:52156 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752431AbYLJTEw (ORCPT <rfc822;netdev@vger.kernel.org>); Wed, 10 Dec 2008 14:04:52 -0500 Received: from [10.44.38.79] (nat2.dinfo.ru [212.45.15.2]) by mrelayeu.kundenserver.de (node=mrelayeu0) with ESMTP (Nemesis) id 0MKwh2-1LAUM83A0C-0007MJ; Wed, 10 Dec 2008 20:04:24 +0100 Message-ID: <494012C4.7090304@vlnb.net> Date: Wed, 10 Dec 2008 22:04:36 +0300 From: Vladislav Bolkhovitin <vst@vlnb.net> User-Agent: Thunderbird 2.0.0.9 (X11/20071115) MIME-Version: 1.0 To: linux-scsi@vger.kernel.org CC: James Bottomley <James.Bottomley@HansenPartnership.com>, Andrew Morton <akpm@linux-foundation.org>, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>, Mike Christie <michaelc@cs.wisc.edu>, Jeff Garzik <jeff@garzik.org>, Boaz Harrosh <bharrosh@panasas.com>, Linus Torvalds <torvalds@linux-foundation.org>, linux-kernel@vger.kernel.org, scst-devel@lists.sourceforge.net, Bart Van Assche <bart.vanassche@gmail.com>, "Nicholas A. Bellinger" <nab@linux-iscsi.org>, netdev@vger.kernel.org Subject: [PATCH][RFC 23/23]: Support for zero-copy TCP transmit of user space data References: <494009D7.4020602@vlnb.net> In-Reply-To: <494009D7.4020602@vlnb.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX181aXTbgk9Gu1hF+xQnZOx+DamRqFO0Mea0e7q f9XG/bPSoN5el1XYdPVglhoE532O6DyTks6j5/YibPL+1Gb7w5 /TFZxGIK6WM7HxqVW35Aw== Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

[RFC,23/23] : Support for zero-copy TCP transmit of user space data

Commit Message

Comments

Patch