From patchwork Thu May 3 14:56:08 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Ian Campbell X-Patchwork-Id: 156715 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 7D61CB6FAC for ; Fri, 4 May 2012 00:59:35 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757432Ab2ECO70 (ORCPT ); Thu, 3 May 2012 10:59:26 -0400 Received: from smtp02.citrix.com ([66.165.176.63]:43303 "EHLO SMTP02.CITRIX.COM" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757075Ab2ECO7W (ORCPT ); Thu, 3 May 2012 10:59:22 -0400 X-IronPort-AV: E=Sophos;i="4.75,523,1330923600"; d="scan'208";a="193248766" Received: from ftlpmailmx02.citrite.net ([10.13.107.66]) by FTLPIPO02.CITRIX.COM with ESMTP/TLS/RC4-MD5; 03 May 2012 10:56:12 -0400 Received: from ukmail1.uk.xensource.com (10.80.16.128) by smtprelay.citrix.com (10.13.107.66) with Microsoft SMTP Server id 8.3.213.0; Thu, 3 May 2012 10:56:12 -0400 Received: from cosworth.uk.xensource.com ([10.80.16.52] ident=ianc) by ukmail1.uk.xensource.com with esmtp (Exim 4.69) (envelope-from ) id 1SPxRz-0002nJ-VW; Thu, 03 May 2012 15:56:12 +0100 From: Ian Campbell To: netdev@vger.kernel.org CC: David Miller , Eric Dumazet , "Michael S. Tsirkin" , Ian Campbell , =?UTF-8?q?Micha=C5=82=20Miros=C5=82aw?= Subject: [PATCH 6/9] net: add support for per-paged-fragment destructors Date: Thu, 3 May 2012 15:56:08 +0100 Message-ID: <1336056971-7839-6-git-send-email-ian.campbell@citrix.com> X-Mailer: git-send-email 1.7.2.5 In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com> References: <1336056915.20716.96.camel@zakaz.uk.xensource.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Entities which care about the complete lifecycle of pages which they inject into the network stack via an skb paged fragment can choose to set this destructor in order to receive a callback when the stack is really finished with a page (including all clones, retransmits, pull-ups etc etc). This destructor will always be propagated alongside the struct page when copying skb_frag_t->page. This is the reason I chose to embed the destructor in a "struct { } page" within the skb_frag_t, rather than as a separate field, since it allows existing code which propagates ->frags[N].page to Just Work(tm). When the destructor is present the page reference counting is done slightly differently. No references are held by the network stack on the struct page (it is up to the caller to manage this as necessary) instead the network stack will track references via the count embedded in the destructor structure. When this reference count reaches zero then the destructor will be called and the caller can take the necesary steps to release the page (i.e. release the struct page reference itself). The intention is that callers can use this callback to delay completion to _their_ callers until the network stack has completely released the page, in order to prevent use-after-free or modification of data pages which are still in use by the stack. It is allowable (indeed expected) for a caller to share a single destructor instance between multiple pages injected into the stack e.g. a group of pages included in a single higher level operation might share a destructor which is used to complete that higher level operation. Previous changes have ensured that, even with the increase in frag size, the hot fields (nr_frags through to at least frags[0]) fit with and are aligned to a 64 byte cache line. Signed-off-by: Ian Campbell Cc: "David S. Miller" Cc: Eric Dumazet Cc: "Michał Mirosław" Cc: netdev@vger.kernel.org --- include/linux/skbuff.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++- net/core/skbuff.c | 18 +++++++++++++++++ net/ipv4/ip_output.c | 2 +- net/ipv4/tcp.c | 4 +- 4 files changed, 69 insertions(+), 5 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 3698625..ccc7d93 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -168,9 +168,15 @@ struct sk_buff; typedef struct skb_frag_struct skb_frag_t; +struct skb_frag_destructor { + atomic_t ref; + int (*destroy)(struct skb_frag_destructor *destructor); +}; + struct skb_frag_struct { struct { struct page *p; + struct skb_frag_destructor *destructor; } page; #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536) __u32 page_offset; @@ -1232,6 +1238,31 @@ static inline int skb_pagelen(const struct sk_buff *skb) } /** + * skb_frag_set_destructor - set destructor for a paged fragment + * @skb: buffer containing fragment to be initialised + * @i: paged fragment index to initialise + * @destroy: the destructor to use for this fragment + * + * Sets @destroy as the destructor to be called when all references to + * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups, + * etc) are released. + * + * When a destructor is set then reference counting is performed on + * @destroy->ref. When the ref reaches zero then @destroy->destroy + * will be called. The caller is responsible for holding and managing + * any other references (such a the struct page reference count). + * + * This function must be called before any use of skb_frag_ref() or + * skb_frag_unref(). + */ +static inline void skb_frag_set_destructor(struct sk_buff *skb, int i, + struct skb_frag_destructor *destroy) +{ + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + frag->page.destructor = destroy; +} + +/** * __skb_fill_page_desc - initialise a paged fragment in an skb * @skb: buffer containing fragment to be initialised * @i: paged fragment index to initialise @@ -1250,6 +1281,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; frag->page.p = page; + frag->page.destructor = NULL; frag->page_offset = off; skb_frag_size_set(frag, size); } @@ -1766,6 +1798,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) return frag->page.p; } +extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy); +extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy); + /** * __skb_frag_ref - take an addition reference on a paged fragment. * @frag: the paged fragment @@ -1774,6 +1809,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) */ static inline void __skb_frag_ref(skb_frag_t *frag) { + if (unlikely(frag->page.destructor)) { + skb_frag_destructor_ref(frag->page.destructor); + return; + } get_page(skb_frag_page(frag)); } @@ -1797,6 +1836,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) */ static inline void __skb_frag_unref(skb_frag_t *frag) { + if (unlikely(frag->page.destructor)) { + skb_frag_destructor_unref(frag->page.destructor); + return; + } put_page(skb_frag_page(frag)); } @@ -1994,13 +2037,16 @@ static inline int skb_add_data(struct sk_buff *skb, } static inline bool skb_can_coalesce(struct sk_buff *skb, int i, - const struct page *page, int off) + const struct page *page, + const struct skb_frag_destructor *destroy, + int off) { if (i) { const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1]; return page == skb_frag_page(frag) && - off == frag->page_offset + skb_frag_size(frag); + off == frag->page_offset + skb_frag_size(frag) && + frag->page.destructor == destroy; } return false; } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index fab6de0..945b807 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -353,6 +353,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length) } EXPORT_SYMBOL(dev_alloc_skb); +void skb_frag_destructor_ref(struct skb_frag_destructor *destroy) +{ + BUG_ON(destroy == NULL); + atomic_inc(&destroy->ref); +} +EXPORT_SYMBOL(skb_frag_destructor_ref); + +void skb_frag_destructor_unref(struct skb_frag_destructor *destroy) +{ + if (destroy == NULL) + return; + + if (atomic_dec_and_test(&destroy->ref)) + destroy->destroy(destroy); +} +EXPORT_SYMBOL(skb_frag_destructor_unref); + static void skb_drop_list(struct sk_buff **listp) { struct sk_buff *list = *listp; @@ -2334,6 +2351,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen) */ if (!to || !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom), + fragfrom->page.destructor, fragfrom->page_offset)) { merge = -1; } else { diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 4910176..7652751 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1242,7 +1242,7 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page, i = skb_shinfo(skb)->nr_frags; if (len > size) len = size; - if (skb_can_coalesce(skb, i, page, offset)) { + if (skb_can_coalesce(skb, i, page, NULL, offset)) { skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len); } else if (i < MAX_SKB_FRAGS) { get_page(page); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 9670af3..2d590ca 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -870,7 +870,7 @@ new_segment: copy = size; i = skb_shinfo(skb)->nr_frags; - can_coalesce = skb_can_coalesce(skb, i, page, offset); + can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset); if (!can_coalesce && i >= MAX_SKB_FRAGS) { tcp_mark_push(tp, skb); goto new_segment; @@ -1124,7 +1124,7 @@ new_segment: off = sk->sk_sndmsg_off; - if (skb_can_coalesce(skb, i, page, off) && + if (skb_can_coalesce(skb, i, page, NULL, off) && off != PAGE_SIZE) { /* We can extend the last page * fragment. */