diff mbox

[4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock

Message ID 1450178464-27721-5-git-send-email-tom.leiming@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Ming Lei Dec. 15, 2015, 11:21 a.m. UTC
Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.

The per-hashtable spinlock is used just for protecting bucket's
hlist, and per-bucket lock should be enough. This patch converts
the per-hashtable lock into per-bucket bit spinlock, so that
contention can be decreased a lot, and no extra memory can be
consumed for these locks.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/bpf/hashtab.c | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

Comments

Alexei Starovoitov Dec. 15, 2015, 10:51 p.m. UTC | #1
On Tue, Dec 15, 2015 at 07:21:02PM +0800, Ming Lei wrote:
> Both htab_map_update_elem() and htab_map_delete_elem() can be
> called from eBPF program, and they may be in kernel hot path,
> so it isn't efficient to use a per-hashtable lock in this two
> helpers.
> 
> The per-hashtable spinlock is used just for protecting bucket's
> hlist, and per-bucket lock should be enough. This patch converts
> the per-hashtable lock into per-bucket bit spinlock, so that
> contention can be decreased a lot, and no extra memory can be
> consumed for these locks.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

thank you for working on this.
Interesting stuff!

>  	/* bpf_map_update_elem() can be called in_irq() */
> -	raw_spin_lock_irqsave(&htab->lock, flags);
> +	raw_local_irq_save(flags);
> +	bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);

can you add a helper for bit_spin_lock/unlock as well so that whole hlist+bit
api looks consistent?

>  
> -	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
> +	l_old = lookup_elem_raw(hlist_get_head_lock(head, &h), l_new->hash,
> +			key, key_size);
>  
>  	if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
>  		/* if elem with this 'key' doesn't exist and we've reached
> @@ -278,18 +284,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	/* add new element to the head of the list, so that concurrent
>  	 * search will find it before old elem
>  	 */
> -	hlist_add_head_rcu(&l_new->hash_node, head);
> +	hlist_add_head_rcu_lock(&l_new->hash_node, head);

I think the new macros have confusing names:

+#define hlist_get_first(h)     \
+       (struct hlist_node *)((unsigned long)(h)->first & ~HLIST_LOCK_MASK)
+#define hlist_get_lock(h)    ((unsigned long)(h)->first & HLIST_LOCK_MASK)
+#define hlist_make_1st_node(h, n) \
+       (struct hlist_node *)((unsigned long)(n) | hlist_get_lock(h))
+
+static inline struct hlist_head *hlist_get_head_lock(
+               struct hlist_head *head, struct hlist_head *new_head)
+{
+       new_head->first = hlist_get_first(head);
+       return new_head;
+}

This new hlist_add_head_rcu_lock() adds new element and it doesn't take new lock.
May be rename this new api as 'locked_hlist' ?
Then all normal helpers will convert like:
hlist_add_head_rcu() -> locked_hlist_add_head_rcu()

>  	if (l_old) {
> -		hlist_del_rcu(&l_old->hash_node);
> +		hlist_del_rcu_lock(&l_old->hash_node);

and here it will be:
hlist_del_rcu() -> locked_hlist_del_rcu()

Also is there a race here ?
+static inline void hlist_add_head_rcu_lock(struct hlist_node *n,
+                                          struct hlist_head *h)
+{
+       struct hlist_node *first = hlist_get_first(h);
+
+       n->next = first;
+       n->pprev = &h->first;
+       rcu_assign_pointer(hlist_first_rcu(h), hlist_make_1st_node(h, n));
+       if (first)
+               first->pprev = &n->next;
+}

Do you need cmpxchg when updatding hlist_head->first ?

Overall looks like a very interesting idea, but I'm not sure about
trade-off of saving 8 bytes for rounded-up spinlock per bucket.
The loss of lockdep is concerning.
Have you compared performance of this bit-lock embedded inside hlist_head vs
just adding spinlock_t for every hlist_head?
Another concern is that hlist walking may be more costly due to
requirement of having to clear that bit while doing the walk in lookup().
I guess I would prefer normal spinlock per bucket. Additional 8 * n_buckets bytes
don't seem to be worth optimizing.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Dec. 16, 2015, 2:57 a.m. UTC | #2
Hi Alexei,

On Wed, Dec 16, 2015 at 6:51 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Dec 15, 2015 at 07:21:02PM +0800, Ming Lei wrote:
>> Both htab_map_update_elem() and htab_map_delete_elem() can be
>> called from eBPF program, and they may be in kernel hot path,
>> so it isn't efficient to use a per-hashtable lock in this two
>> helpers.
>>
>> The per-hashtable spinlock is used just for protecting bucket's
>> hlist, and per-bucket lock should be enough. This patch converts
>> the per-hashtable lock into per-bucket bit spinlock, so that
>> contention can be decreased a lot, and no extra memory can be
>> consumed for these locks.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>
> thank you for working on this.
> Interesting stuff!

Thanks for your review!

>
>>       /* bpf_map_update_elem() can be called in_irq() */
>> -     raw_spin_lock_irqsave(&htab->lock, flags);
>> +     raw_local_irq_save(flags);
>> +     bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
>
> can you add a helper for bit_spin_lock/unlock as well so that whole hlist+bit
> api looks consistent?

OK, I will try to add it in V1.

>
>>
>> -     l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
>> +     l_old = lookup_elem_raw(hlist_get_head_lock(head, &h), l_new->hash,
>> +                     key, key_size);
>>
>>       if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
>>               /* if elem with this 'key' doesn't exist and we've reached
>> @@ -278,18 +284,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>       /* add new element to the head of the list, so that concurrent
>>        * search will find it before old elem
>>        */
>> -     hlist_add_head_rcu(&l_new->hash_node, head);
>> +     hlist_add_head_rcu_lock(&l_new->hash_node, head);
>
> I think the new macros have confusing names:
>
> +#define hlist_get_first(h)     \
> +       (struct hlist_node *)((unsigned long)(h)->first & ~HLIST_LOCK_MASK)
> +#define hlist_get_lock(h)    ((unsigned long)(h)->first & HLIST_LOCK_MASK)
> +#define hlist_make_1st_node(h, n) \
> +       (struct hlist_node *)((unsigned long)(n) | hlist_get_lock(h))
> +
> +static inline struct hlist_head *hlist_get_head_lock(
> +               struct hlist_head *head, struct hlist_head *new_head)
> +{
> +       new_head->first = hlist_get_first(head);
> +       return new_head;
> +}
>
> This new hlist_add_head_rcu_lock() adds new element and it doesn't take new lock.
> May be rename this new api as 'locked_hlist' ?

Looks much better, :-)

> Then all normal helpers will convert like:
> hlist_add_head_rcu() -> locked_hlist_add_head_rcu()
>
>>       if (l_old) {
>> -             hlist_del_rcu(&l_old->hash_node);
>> +             hlist_del_rcu_lock(&l_old->hash_node);
>
> and here it will be:
> hlist_del_rcu() -> locked_hlist_del_rcu()
>
> Also is there a race here ?

Both locked_hlist_add_head_rcu() and locked_hlist_del_rcu()
won't change the lock bit of hlist_head->first, also the lock has
to be held before calling the two helpers.

So I don't think there is a race.

> +static inline void hlist_add_head_rcu_lock(struct hlist_node *n,
> +                                          struct hlist_head *h)
> +{
> +       struct hlist_node *first = hlist_get_first(h);
> +
> +       n->next = first;
> +       n->pprev = &h->first;
> +       rcu_assign_pointer(hlist_first_rcu(h), hlist_make_1st_node(h, n));
> +       if (first)
> +               first->pprev = &n->next;
> +}
>
> Do you need cmpxchg when updatding hlist_head->first ?

Firstly it is just one pointer update, and it is guaranteed implicitly that
updating the pointer is atomic.

Secondly the bit spinlock has been held before calling the helper, and
other concurrent hlist add/del can't be possible.

So cmpxchg isn't needed here.

>
> Overall looks like a very interesting idea, but I'm not sure about
> trade-off of saving 8 bytes for rounded-up spinlock per bucket.

The hashtab can be very big, and one of reason why hlist_head is
defined as single pointer is for saving memory.

> The loss of lockdep is concerning.

Currently we have been using raw_spin_lock/unlock, so lockdep
is bypassed already, also no other lock is nested inside
the bit lock, and lockdep isn't needed.

> Have you compared performance of this bit-lock embedded inside hlist_head vs
> just adding spinlock_t for every hlist_head?

Yes, I did, and no difference can be observed.

> Another concern is that hlist walking may be more costly due to
> requirement of having to clear that bit while doing the walk in lookup().

No mater clearning the bit or not, the head variable need to be
loaded to cache and register, and the clearing zero bit op is
just one or zero instruction, so I think the cost can be or close to nop.

> I guess I would prefer normal spinlock per bucket. Additional 8 * n_buckets bytes
> don't seem to be worth optimizing.

As I mentioned, the hashtable can be very big, and that is why
hlist_head is defined as single pointer.  Also it is enough to
use bit spinlock for the very very low contention case, :-)

Even in the future, we may extend it to generic hashtable
using pattern, :-)

Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Dec. 16, 2015, 6:58 a.m. UTC | #3
On Wed, Dec 16, 2015 at 1:01 PM, Yang Shi <shy828301@gmail.com> wrote:

>
> I recalled Steven confirmed raw_spin_lock has the lockdep benefit too in the
> patch review for changing to raw lock.
>
> Please check this thread out
>  http://lists.openwall.net/netdev/2015/10/31/7

OK, looks I was wrong about the lockdep benifit, :-(

But for this lock, I think lockdep isn't such important, because it is the
intermost lock, and it can be used just for protecting the bucket list
and nothing else need to be covered.


Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Dec. 18, 2015, 6:20 a.m. UTC | #4
On Wed, Dec 16, 2015 at 02:58:08PM +0800, Ming Lei wrote:
> On Wed, Dec 16, 2015 at 1:01 PM, Yang Shi <shy828301@gmail.com> wrote:
> 
> >
> > I recalled Steven confirmed raw_spin_lock has the lockdep benefit too in the
> > patch review for changing to raw lock.
> >
> > Please check this thread out
> >  http://lists.openwall.net/netdev/2015/10/31/7
> 
> OK, looks I was wrong about the lockdep benifit, :-(
> 
> But for this lock, I think lockdep isn't such important, because it is the
> intermost lock, and it can be used just for protecting the bucket list
> and nothing else need to be covered.

I still think that overhead of normal spinlock per bucket is acceptable.
Makes the whole thing much easier to read.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Dec. 26, 2015, 8:58 a.m. UTC | #5
On Fri, Dec 18, 2015 at 2:20 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Dec 16, 2015 at 02:58:08PM +0800, Ming Lei wrote:
>> On Wed, Dec 16, 2015 at 1:01 PM, Yang Shi <shy828301@gmail.com> wrote:
>>
>> >
>> > I recalled Steven confirmed raw_spin_lock has the lockdep benefit too in the
>> > patch review for changing to raw lock.
>> >
>> > Please check this thread out
>> >  http://lists.openwall.net/netdev/2015/10/31/7
>>
>> OK, looks I was wrong about the lockdep benifit, :-(
>>
>> But for this lock, I think lockdep isn't such important, because it is the
>> intermost lock, and it can be used just for protecting the bucket list
>> and nothing else need to be covered.
>
> I still think that overhead of normal spinlock per bucket is acceptable.
> Makes the whole thing much easier to read.

OK, let's use per-bucket spinlock first.

--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d857fcb..8543fea 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -17,7 +17,6 @@ 
 struct bpf_htab {
 	struct bpf_map map;
 	struct hlist_head *buckets;
-	raw_spinlock_t lock;
 	atomic_t count;	/* number of elements in this hashtable */
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
@@ -105,7 +104,6 @@  static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < htab->n_buckets; i++)
 		INIT_HLIST_HEAD(&htab->buckets[i]);
 
-	raw_spin_lock_init(&htab->lock);
 	atomic_set(&htab->count, 0);
 
 	return &htab->map;
@@ -142,6 +140,7 @@  static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct hlist_head *head;
+	struct hlist_head h;
 	struct htab_elem *l;
 	u32 hash, key_size;
 
@@ -153,6 +152,7 @@  static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
 	hash = htab_map_hash(key, key_size);
 
 	head = select_bucket(htab, hash);
+	head = hlist_get_head_lock(head, &h);
 
 	l = lookup_elem_raw(head, hash, key, key_size);
 
@@ -167,6 +167,7 @@  static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct hlist_head *head;
+	struct hlist_head h;
 	struct htab_elem *l, *next_l;
 	u32 hash, key_size;
 	int i;
@@ -178,6 +179,7 @@  static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	hash = htab_map_hash(key, key_size);
 
 	head = select_bucket(htab, hash);
+	head = hlist_get_head_lock(head, &h);
 
 	/* lookup the key */
 	l = lookup_elem_raw(head, hash, key, key_size);
@@ -205,6 +207,7 @@  find_first_elem:
 	/* iterate over buckets */
 	for (; i < htab->n_buckets; i++) {
 		head = select_bucket(htab, i);
+		head = hlist_get_head_lock(head, &h);
 
 		/* pick first element in the bucket */
 		next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
@@ -227,6 +230,7 @@  static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct htab_elem *l_new, *l_old;
 	struct hlist_head *head;
+	struct hlist_head h;
 	unsigned long flags;
 	u32 key_size;
 	int ret;
@@ -251,9 +255,11 @@  static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	head = select_bucket(htab, l_new->hash);
 
 	/* bpf_map_update_elem() can be called in_irq() */
-	raw_spin_lock_irqsave(&htab->lock, flags);
+	raw_local_irq_save(flags);
+	bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
 
-	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
+	l_old = lookup_elem_raw(hlist_get_head_lock(head, &h), l_new->hash,
+			key, key_size);
 
 	if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
 		/* if elem with this 'key' doesn't exist and we've reached
@@ -278,18 +284,20 @@  static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	/* add new element to the head of the list, so that concurrent
 	 * search will find it before old elem
 	 */
-	hlist_add_head_rcu(&l_new->hash_node, head);
+	hlist_add_head_rcu_lock(&l_new->hash_node, head);
 	if (l_old) {
-		hlist_del_rcu(&l_old->hash_node);
+		hlist_del_rcu_lock(&l_old->hash_node);
 		kfree_rcu(l_old, rcu);
 	} else {
 		atomic_inc(&htab->count);
 	}
-	raw_spin_unlock_irqrestore(&htab->lock, flags);
+	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
+	raw_local_irq_restore(flags);
 
 	return 0;
 err:
-	raw_spin_unlock_irqrestore(&htab->lock, flags);
+	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
+	raw_local_irq_restore(flags);
 	kfree(l_new);
 	return ret;
 }
@@ -299,6 +307,7 @@  static int htab_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct hlist_head *head;
+	struct hlist_head h;
 	struct htab_elem *l;
 	unsigned long flags;
 	u32 hash, key_size;
@@ -311,18 +320,20 @@  static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	hash = htab_map_hash(key, key_size);
 	head = select_bucket(htab, hash);
 
-	raw_spin_lock_irqsave(&htab->lock, flags);
+	raw_local_irq_save(flags);
+	bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
 
-	l = lookup_elem_raw(head, hash, key, key_size);
+	l = lookup_elem_raw(hlist_get_head_lock(head, &h), hash, key, key_size);
 
 	if (l) {
-		hlist_del_rcu(&l->hash_node);
+		hlist_del_rcu_lock(&l->hash_node);
 		atomic_dec(&htab->count);
 		kfree_rcu(l, rcu);
 		ret = 0;
 	}
 
-	raw_spin_unlock_irqrestore(&htab->lock, flags);
+	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
+	raw_local_irq_restore(flags);
 	return ret;
 }
 
@@ -332,9 +343,12 @@  static void delete_all_elements(struct bpf_htab *htab)
 
 	for (i = 0; i < htab->n_buckets; i++) {
 		struct hlist_head *head = select_bucket(htab, i);
+		struct hlist_head h;
 		struct hlist_node *n;
 		struct htab_elem *l;
 
+		head = hlist_get_head_lock(head, &h);
+
 		hlist_for_each_entry_safe(l, n, head, hash_node) {
 			hlist_del_rcu(&l->hash_node);
 			atomic_dec(&htab->count);