diff mbox

[3/4] llc: use a device based hash table to speed up multicast delivery

Message ID 1259879498-27860-4-git-send-email-opurdila@ixiacom.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Octavian Purdila Dec. 3, 2009, 10:31 p.m. UTC
This patch adds a per SAP device based hash table to solve the
multicast delivery scalability issues for the case where the are a
large number of interfaces and a large number of sockets (bound to the
same SAP) are used.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---
 include/net/llc.h      |   20 ++++++++++++++++----
 include/net/llc_conn.h |    1 +
 net/llc/llc_conn.c     |   18 +++++++++++++++++-
 net/llc/llc_core.c     |    3 +++
 net/llc/llc_sap.c      |   11 ++++++-----
 5 files changed, 43 insertions(+), 10 deletions(-)

Comments

Eric Dumazet Dec. 3, 2009, 10:59 p.m. UTC | #1
Octavian Purdila a écrit :
> This patch adds a per SAP device based hash table to solve the
> multicast delivery scalability issues for the case where the are a
> large number of interfaces and a large number of sockets (bound to the
> same SAP) are used.
> 
> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
> ---
>  include/net/llc.h      |   20 ++++++++++++++++----
>  include/net/llc_conn.h |    1 +
>  net/llc/llc_conn.c     |   18 +++++++++++++++++-
>  net/llc/llc_core.c     |    3 +++
>  net/llc/llc_sap.c      |   11 ++++++-----
>  5 files changed, 43 insertions(+), 10 deletions(-)
> 
> diff --git a/include/net/llc.h b/include/net/llc.h
> index 7940da1..31e9902 100644
> --- a/include/net/llc.h
> +++ b/include/net/llc.h
> @@ -31,6 +31,14 @@ struct llc_addr {
>  #define LLC_SAP_STATE_INACTIVE	1
>  #define LLC_SAP_STATE_ACTIVE	2
>  
> +#define LLC_SK_DEV_HASH_BITS 10
> +#define LLC_SK_DEV_HASH_ENTRIES (1<<LLC_SK_DEV_HASH_BITS)
> +
> +struct llc_sk_list {
> +	rwlock_t lock;
> +	struct hlist_head list;
> +};
> +

This patch introduces a big hash table, and 1024 rwlocks, for IXIACOM need.

Is the problem both on lock contention and lookup ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Dec. 3, 2009, 11:25 p.m. UTC | #2
On Fri,  4 Dec 2009 00:31:37 +0200
Octavian Purdila <opurdila@ixiacom.com> wrote:

> This patch adds a per SAP device based hash table to solve the
> multicast delivery scalability issues for the case where the are a
> large number of interfaces and a large number of sockets (bound to the
> same SAP) are used.

Rather than adding hash table and rwlock, why not hash list RCU
and a single spin lock
Octavian Purdila Dec. 3, 2009, 11:30 p.m. UTC | #3
On Friday 04 December 2009 00:59:58 you wrote:
> Octavian Purdila a écrit :
> > This patch adds a per SAP device based hash table to solve the
> > multicast delivery scalability issues for the case where the are a
> > large number of interfaces and a large number of sockets (bound to the
> > same SAP) are used.
> >
> > Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
> > ---
> >  include/net/llc.h      |   20 ++++++++++++++++----
> >  include/net/llc_conn.h |    1 +
> >  net/llc/llc_conn.c     |   18 +++++++++++++++++-
> >  net/llc/llc_core.c     |    3 +++
> >  net/llc/llc_sap.c      |   11 ++++++-----
> >  5 files changed, 43 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/net/llc.h b/include/net/llc.h
> > index 7940da1..31e9902 100644
> > --- a/include/net/llc.h
> > +++ b/include/net/llc.h
> > @@ -31,6 +31,14 @@ struct llc_addr {
> >  #define LLC_SAP_STATE_INACTIVE	1
> >  #define LLC_SAP_STATE_ACTIVE	2
> >
> > +#define LLC_SK_DEV_HASH_BITS 10
> > +#define LLC_SK_DEV_HASH_ENTRIES (1<<LLC_SK_DEV_HASH_BITS)
> > +
> > +struct llc_sk_list {
> > +	rwlock_t lock;
> > +	struct hlist_head list;
> > +};
> > +
> 
> This patch introduces a big hash table, and 1024 rwlocks, for IXIACOM need.
> 

Yes, that is probably not appropriate for upstream. What would be a good 
value?

> Is the problem both on lock contention and lookup ?

Since at this point we are using UP ports contention is not really an issue 
for us. I've extrapolated this (lock per hash bucket) based on how locking is 
done in other places, like UDP. 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 3, 2009, 11:52 p.m. UTC | #4
Octavian Purdila a écrit :
> 
> Yes, that is probably not appropriate for upstream. What would be a good 
> value?
> 

A small one to begin (say 64).

> Since at this point we are using UP ports contention is not really an issue 
> for us. I've extrapolated this (lock per hash bucket) based on how locking is 
> done in other places, like UDP. 

Yes but you know we want to remove those locks per UDP hash bucket, since we dont
really need them anymore. ;)


If you remember, we had in the past one rwlock for the whole UDP table.

Then this was converted to one spinlock per hash slot (128 slots) + RCU lookups for unicast RX

Then we dynamically sized udp table at boot (up to 65536 slots)

multicast optimization (holding lock for small duration + double hashing)

bind optimization (thanks to double hashing)

To be done :

1) multicast RX can be done without taking any lock, and RCU lookups
2) zap all locks and use one lock, or a small array of hashed spinlocks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Dec. 3, 2009, 11:53 p.m. UTC | #5
On Friday 04 December 2009 01:25:13 you wrote:
> On Fri,  4 Dec 2009 00:31:37 +0200
> 
> Octavian Purdila <opurdila@ixiacom.com> wrote:
> > This patch adds a per SAP device based hash table to solve the
> > multicast delivery scalability issues for the case where the are a
> > large number of interfaces and a large number of sockets (bound to the
> > same SAP) are used.
> 
> Rather than adding hash table and rwlock, why not hash list RCU
> and a single spin lock
> 

I have a partial version with RCU and single spinlock, but then I ran into a 
(Eric's I think) patch which moved the UDP lock per bucket. And since RCU 
can't help on the write side (in this instance each time we bound or delete 
the socket) it was not clear to me what is the best approach.

Should I go ahead with the RCU and single lock?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Dec. 4, 2009, 12:15 a.m. UTC | #6
On Friday 04 December 2009 01:52:44 you wrote:
> Octavian Purdila a écrit :

> > Since at this point we are using UP ports contention is not really an
> > issue for us. I've extrapolated this (lock per hash bucket) based on how
> > locking is done in other places, like UDP.
> 
> Yes but you know we want to remove those locks per UDP hash bucket, since
>  we dont really need them anymore. ;)
> 
> If you remember, we had in the past one rwlock for the whole UDP table.
> 
> Then this was converted to one spinlock per hash slot (128 slots) + RCU
>  lookups for unicast RX
> 
> Then we dynamically sized udp table at boot (up to 65536 slots)
> 
> multicast optimization (holding lock for small duration + double hashing)
> 
> bind optimization (thanks to double hashing)
> 
> To be done :
> 
> 1) multicast RX can be done without taking any lock, and RCU lookups
> 2) zap all locks and use one lock, or a small array of hashed spinlocks
> 

Thanks for the nice summary Eric !

I still have one doubt related to this: we still need locking for creating and 
destroying sockets to insert/remove them into/from the hash, RCU can't help us 
here, right? 

In that case wouldn't spinlock contention become an issue for short lived 
connections? Probably not for UDP (or LLC), but for TCP I certainly can think 
of a few usecases for short lived connections.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 4, 2009, 12:28 a.m. UTC | #7
Octavian Purdila a écrit :
> On Friday 04 December 2009 01:52:44 you wrote:
>> Octavian Purdila a écrit :
> 
>>> Since at this point we are using UP ports contention is not really an
>>> issue for us. I've extrapolated this (lock per hash bucket) based on how
>>> locking is done in other places, like UDP.
>> Yes but you know we want to remove those locks per UDP hash bucket, since
>>  we dont really need them anymore. ;)
>>
>> If you remember, we had in the past one rwlock for the whole UDP table.
>>
>> Then this was converted to one spinlock per hash slot (128 slots) + RCU
>>  lookups for unicast RX
>>
>> Then we dynamically sized udp table at boot (up to 65536 slots)
>>
>> multicast optimization (holding lock for small duration + double hashing)
>>
>> bind optimization (thanks to double hashing)
>>
>> To be done :
>>
>> 1) multicast RX can be done without taking any lock, and RCU lookups
>> 2) zap all locks and use one lock, or a small array of hashed spinlocks
>>
> 
> Thanks for the nice summary Eric !
> 
> I still have one doubt related to this: we still need locking for creating and 
> destroying sockets to insert/remove them into/from the hash, RCU can't help us 
> here, right? 

Sure. RCU is used for readers only. We still need locks to protect writers
against them.

> 
> In that case wouldn't spinlock contention become an issue for short lived 
> connections? Probably not for UDP (or LLC), but for TCP I certainly can think 
> of a few usecases for short lived connections.

Yes, this is why an array of hashed spinlocks would be good, (as already done with TCP
for example, or IP route cache)

(Say you have a table of 65536 UDP slots on your high performance server,
handling millions of udp sockets, you dont _need_ 65536 spinlocks, but some number
related to number of cpus)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Dec. 4, 2009, 12:37 a.m. UTC | #8
On Fri, 4 Dec 2009 01:53:11 +0200
Octavian Purdila <opurdila@ixiacom.com> wrote:

> On Friday 04 December 2009 01:25:13 you wrote:
> > On Fri,  4 Dec 2009 00:31:37 +0200
> > 
> > Octavian Purdila <opurdila@ixiacom.com> wrote:
> > > This patch adds a per SAP device based hash table to solve the
> > > multicast delivery scalability issues for the case where the are a
> > > large number of interfaces and a large number of sockets (bound to the
> > > same SAP) are used.
> > 
> > Rather than adding hash table and rwlock, why not hash list RCU
> > and a single spin lock
> > 
> 
> I have a partial version with RCU and single spinlock, but then I ran into a 
> (Eric's I think) patch which moved the UDP lock per bucket. And since RCU 
> can't help on the write side (in this instance each time we bound or delete 
> the socket) it was not clear to me what is the best approach.

The lock is held for such a brief period on connection setup that a single
spinlock is probably ok.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/llc.h b/include/net/llc.h
index 7940da1..31e9902 100644
--- a/include/net/llc.h
+++ b/include/net/llc.h
@@ -31,6 +31,14 @@  struct llc_addr {
 #define LLC_SAP_STATE_INACTIVE	1
 #define LLC_SAP_STATE_ACTIVE	2
 
+#define LLC_SK_DEV_HASH_BITS 10
+#define LLC_SK_DEV_HASH_ENTRIES (1<<LLC_SK_DEV_HASH_BITS)
+
+struct llc_sk_list {
+	rwlock_t lock;
+	struct hlist_head list;
+};
+
 /**
  * struct llc_sap - Defines the SAP component
  *
@@ -53,12 +61,16 @@  struct llc_sap {
 				     struct net_device *orig_dev);
 	struct llc_addr	 laddr;
 	struct list_head node;
-	struct {
-		rwlock_t	  lock;
-		struct hlist_head list;
-	} sk_list;
+	struct llc_sk_list sk_list;
+	struct llc_sk_list sk_dev_hash[LLC_SK_DEV_HASH_ENTRIES];
 };
 
+static inline
+struct llc_sk_list *llc_sk_dev_hash(struct llc_sap *sap, int ifindex)
+{
+	return &sap->sk_dev_hash[ifindex % LLC_SK_DEV_HASH_ENTRIES];
+}
+
 #define LLC_DEST_INVALID         0      /* Invalid LLC PDU type */
 #define LLC_DEST_SAP             1      /* Type 1 goes here */
 #define LLC_DEST_CONN            2      /* Type 2 goes here */
diff --git a/include/net/llc_conn.h b/include/net/llc_conn.h
index fe982fd..2f97d8d 100644
--- a/include/net/llc_conn.h
+++ b/include/net/llc_conn.h
@@ -77,6 +77,7 @@  struct llc_sock {
 					      received and caused sending FRMR.
 					      Used for resending FRMR */
 	u32		    cmsg_flags;
+	struct hlist_node   dev_hash_node;
 };
 
 static inline struct llc_sock *llc_sk(const struct sock *sk)
diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index c6bab39..c3f47ec 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -651,11 +651,19 @@  static int llc_find_offset(int state, int ev_type)
  */
 void llc_sap_add_socket(struct llc_sap *sap, struct sock *sk)
 {
+	struct llc_sock *llc = llc_sk(sk);
+	struct llc_sk_list *dev_hb = llc_sk_dev_hash(sap, llc->dev->ifindex);
+
 	llc_sap_hold(sap);
+	llc->sap = sap;
+
 	write_lock_bh(&sap->sk_list.lock);
-	llc_sk(sk)->sap = sap;
 	sk_add_node(sk, &sap->sk_list.list);
 	write_unlock_bh(&sap->sk_list.lock);
+
+	write_lock_bh(&dev_hb->lock);
+	hlist_add_head(&llc->dev_hash_node, &dev_hb->list);
+	write_unlock_bh(&dev_hb->lock);
 }
 
 /**
@@ -668,9 +676,17 @@  void llc_sap_add_socket(struct llc_sap *sap, struct sock *sk)
  */
 void llc_sap_remove_socket(struct llc_sap *sap, struct sock *sk)
 {
+	struct llc_sock *llc = llc_sk(sk);
+	struct llc_sk_list *dev_hb = llc_sk_dev_hash(sap, llc->dev->ifindex);
+
 	write_lock_bh(&sap->sk_list.lock);
 	sk_del_node_init(sk);
 	write_unlock_bh(&sap->sk_list.lock);
+
+	write_lock_bh(&dev_hb->lock);
+	hlist_del(&llc->dev_hash_node);
+	write_unlock_bh(&dev_hb->lock);
+
 	llc_sap_put(sap);
 }
 
diff --git a/net/llc/llc_core.c b/net/llc/llc_core.c
index ff4c0ab..1d79cda 100644
--- a/net/llc/llc_core.c
+++ b/net/llc/llc_core.c
@@ -33,12 +33,15 @@  DEFINE_RWLOCK(llc_sap_list_lock);
 static struct llc_sap *llc_sap_alloc(void)
 {
 	struct llc_sap *sap = kzalloc(sizeof(*sap), GFP_ATOMIC);
+	int i;
 
 	if (sap) {
 		/* sap->laddr.mac - leave as a null, it's filled by bind */
 		sap->state = LLC_SAP_STATE_ACTIVE;
 		rwlock_init(&sap->sk_list.lock);
 		atomic_set(&sap->refcnt, 1);
+		for (i = 0; i < LLC_SK_DEV_HASH_ENTRIES; i++)
+			rwlock_init(&sap->sk_dev_hash[i].lock);
 	}
 	return sap;
 }
diff --git a/net/llc/llc_sap.c b/net/llc/llc_sap.c
index 008de1f..11bad55 100644
--- a/net/llc/llc_sap.c
+++ b/net/llc/llc_sap.c
@@ -340,12 +340,13 @@  static void llc_sap_mcast(struct llc_sap *sap,
 			  const struct llc_addr *laddr,
 			  struct sk_buff *skb)
 {
-	struct sock *sk;
+	struct llc_sock *llc;
 	struct hlist_node *node;
+	struct llc_sk_list *dev_hb = llc_sk_dev_hash(sap, skb->dev->ifindex);
 
-	read_lock_bh(&sap->sk_list.lock);
-	sk_for_each(sk, node, &sap->sk_list.list) {
-		struct llc_sock *llc = llc_sk(sk);
+	read_lock_bh(&dev_hb->lock);
+	hlist_for_each_entry(llc, node, &dev_hb->list, dev_hash_node) {
+		struct sock *sk = &llc->sk;
 		struct sk_buff *skb1;
 
 		if (sk->sk_type != SOCK_DGRAM)
@@ -365,7 +366,7 @@  static void llc_sap_mcast(struct llc_sap *sap,
 		llc_sap_rcv(sap, skb1, sk);
 		sock_put(sk);
 	}
-	read_unlock_bh(&sap->sk_list.lock);
+	read_unlock_bh(&dev_hb->lock);
 }