Message ID | 20200214224302.229920-1-brianvv@google.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | [bpf] bpf: Do not grab the bucket spinlock by default on htab batch ops | expand |
On 2/14/20 2:43 PM, Brian Vazquez wrote: > Grabbing the spinlock for every bucket even if it's empty, was causing > significant perfomance cost when traversing htab maps that have only a > few entries. This patch addresses the issue by checking first the > bucket_cnt, if the bucket has some entries then we go and grab the > spinlock and proceed with the batching. > > Tested with a htab of size 50K and different value of populated entries. > > Before: > Benchmark Time(ns) CPU(ns) > --------------------------------------------- > BM_DumpHashMap/1 2759655 2752033 > BM_DumpHashMap/10 2933722 2930825 > BM_DumpHashMap/200 3171680 3170265 > BM_DumpHashMap/500 3639607 3635511 > BM_DumpHashMap/1000 4369008 4364981 > BM_DumpHashMap/5k 11171919 11134028 > BM_DumpHashMap/20k 69150080 69033496 > BM_DumpHashMap/39k 190501036 190226162 > > After: > Benchmark Time(ns) CPU(ns) > --------------------------------------------- > BM_DumpHashMap/1 202707 200109 > BM_DumpHashMap/10 213441 210569 > BM_DumpHashMap/200 478641 472350 > BM_DumpHashMap/500 980061 967102 > BM_DumpHashMap/1000 1863835 1839575 > BM_DumpHashMap/5k 8961836 8902540 > BM_DumpHashMap/20k 69761497 69322756 > BM_DumpHashMap/39k 187437830 186551111 > > Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") > Cc: Yonghong Song <yhs@fb.com> > Signed-off-by: Brian Vazquez <brianvv@google.com> Acked-by: Yonghong Song <yhs@fb.com>
On 2/18/20 4:43 PM, Yonghong Song wrote: > On 2/14/20 2:43 PM, Brian Vazquez wrote: >> Grabbing the spinlock for every bucket even if it's empty, was causing >> significant perfomance cost when traversing htab maps that have only a >> few entries. This patch addresses the issue by checking first the >> bucket_cnt, if the bucket has some entries then we go and grab the >> spinlock and proceed with the batching. >> >> Tested with a htab of size 50K and different value of populated entries. >> >> Before: >> Benchmark Time(ns) CPU(ns) >> --------------------------------------------- >> BM_DumpHashMap/1 2759655 2752033 >> BM_DumpHashMap/10 2933722 2930825 >> BM_DumpHashMap/200 3171680 3170265 >> BM_DumpHashMap/500 3639607 3635511 >> BM_DumpHashMap/1000 4369008 4364981 >> BM_DumpHashMap/5k 11171919 11134028 >> BM_DumpHashMap/20k 69150080 69033496 >> BM_DumpHashMap/39k 190501036 190226162 >> >> After: >> Benchmark Time(ns) CPU(ns) >> --------------------------------------------- >> BM_DumpHashMap/1 202707 200109 >> BM_DumpHashMap/10 213441 210569 >> BM_DumpHashMap/200 478641 472350 >> BM_DumpHashMap/500 980061 967102 >> BM_DumpHashMap/1000 1863835 1839575 >> BM_DumpHashMap/5k 8961836 8902540 >> BM_DumpHashMap/20k 69761497 69322756 >> BM_DumpHashMap/39k 187437830 186551111 >> >> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") >> Cc: Yonghong Song <yhs@fb.com> >> Signed-off-by: Brian Vazquez <brianvv@google.com> > > Acked-by: Yonghong Song <yhs@fb.com> I must probably be missing something, but how is this safe? Presume we traverse in the walk with bucket_cnt = 0. Meanwhile a different CPU added entries to this bucket since not locked. Same reader on the other CPU with bucket_cnt = 0 then starts to traverse the second hlist_nulls_for_each_entry_safe() unlocked e.g. deleting entries?
On 2/18/20 7:56 AM, Daniel Borkmann wrote: > On 2/18/20 4:43 PM, Yonghong Song wrote: >> On 2/14/20 2:43 PM, Brian Vazquez wrote: >>> Grabbing the spinlock for every bucket even if it's empty, was causing >>> significant perfomance cost when traversing htab maps that have only a >>> few entries. This patch addresses the issue by checking first the >>> bucket_cnt, if the bucket has some entries then we go and grab the >>> spinlock and proceed with the batching. >>> >>> Tested with a htab of size 50K and different value of populated entries. >>> >>> Before: >>> Benchmark Time(ns) CPU(ns) >>> --------------------------------------------- >>> BM_DumpHashMap/1 2759655 2752033 >>> BM_DumpHashMap/10 2933722 2930825 >>> BM_DumpHashMap/200 3171680 3170265 >>> BM_DumpHashMap/500 3639607 3635511 >>> BM_DumpHashMap/1000 4369008 4364981 >>> BM_DumpHashMap/5k 11171919 11134028 >>> BM_DumpHashMap/20k 69150080 69033496 >>> BM_DumpHashMap/39k 190501036 190226162 >>> >>> After: >>> Benchmark Time(ns) CPU(ns) >>> --------------------------------------------- >>> BM_DumpHashMap/1 202707 200109 >>> BM_DumpHashMap/10 213441 210569 >>> BM_DumpHashMap/200 478641 472350 >>> BM_DumpHashMap/500 980061 967102 >>> BM_DumpHashMap/1000 1863835 1839575 >>> BM_DumpHashMap/5k 8961836 8902540 >>> BM_DumpHashMap/20k 69761497 69322756 >>> BM_DumpHashMap/39k 187437830 186551111 >>> >>> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") >>> Cc: Yonghong Song <yhs@fb.com> >>> Signed-off-by: Brian Vazquez <brianvv@google.com> >> >> Acked-by: Yonghong Song <yhs@fb.com> > > I must probably be missing something, but how is this safe? Presume we > traverse in the walk with bucket_cnt = 0. Meanwhile a different CPU added > entries to this bucket since not locked. Same reader on the other CPU with > bucket_cnt = 0 then starts to traverse the second > hlist_nulls_for_each_entry_safe() unlocked e.g. deleting entries? Thanks for pointing this out. Yes, you are correct. If bucket_cnt is 0 and buck->lock is not held, we should skip the hlist_nulls_for_each_entry_safe(l, n, head, hash_node) { ... } as another cpu may traverse the bucket in parallel by adding/deleting the elements.
On Tue, Feb 18, 2020 at 8:34 AM Yonghong Song <yhs@fb.com> wrote: > > > > On 2/18/20 7:56 AM, Daniel Borkmann wrote: > > On 2/18/20 4:43 PM, Yonghong Song wrote: > >> On 2/14/20 2:43 PM, Brian Vazquez wrote: > >>> Grabbing the spinlock for every bucket even if it's empty, was causing > >>> significant perfomance cost when traversing htab maps that have only a > >>> few entries. This patch addresses the issue by checking first the > >>> bucket_cnt, if the bucket has some entries then we go and grab the > >>> spinlock and proceed with the batching. > >>> > >>> Tested with a htab of size 50K and different value of populated entries. > >>> > >>> Before: > >>> Benchmark Time(ns) CPU(ns) > >>> --------------------------------------------- > >>> BM_DumpHashMap/1 2759655 2752033 > >>> BM_DumpHashMap/10 2933722 2930825 > >>> BM_DumpHashMap/200 3171680 3170265 > >>> BM_DumpHashMap/500 3639607 3635511 > >>> BM_DumpHashMap/1000 4369008 4364981 > >>> BM_DumpHashMap/5k 11171919 11134028 > >>> BM_DumpHashMap/20k 69150080 69033496 > >>> BM_DumpHashMap/39k 190501036 190226162 > >>> > >>> After: > >>> Benchmark Time(ns) CPU(ns) > >>> --------------------------------------------- > >>> BM_DumpHashMap/1 202707 200109 > >>> BM_DumpHashMap/10 213441 210569 > >>> BM_DumpHashMap/200 478641 472350 > >>> BM_DumpHashMap/500 980061 967102 > >>> BM_DumpHashMap/1000 1863835 1839575 > >>> BM_DumpHashMap/5k 8961836 8902540 > >>> BM_DumpHashMap/20k 69761497 69322756 > >>> BM_DumpHashMap/39k 187437830 186551111 > >>> > >>> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") > >>> Cc: Yonghong Song <yhs@fb.com> > >>> Signed-off-by: Brian Vazquez <brianvv@google.com> > >> > >> Acked-by: Yonghong Song <yhs@fb.com> > > > > I must probably be missing something, but how is this safe? Presume we > > traverse in the walk with bucket_cnt = 0. Meanwhile a different CPU added > > entries to this bucket since not locked. Same reader on the other CPU with > > bucket_cnt = 0 then starts to traverse the second > > hlist_nulls_for_each_entry_safe() unlocked e.g. deleting entries? > > Thanks for pointing this out. Yes, you are correct. If bucket_cnt is 0 > and buck->lock is not held, we should skip the > hlist_nulls_for_each_entry_safe(l, n, head, hash_node) { > ... > } > as another cpu may traverse the bucket in parallel by adding/deleting > the elements. Makes sense. Let me fix it in the next version, thanks for reviewing it!
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 2d182c4ee9d99..fdbde28b0fe06 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1260,6 +1260,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, struct hlist_nulls_head *head; struct hlist_nulls_node *n; unsigned long flags; + bool locked = false; struct htab_elem *l; struct bucket *b; int ret = 0; @@ -1319,15 +1320,25 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, dst_val = values; b = &htab->buckets[batch]; head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + /* do not grab the lock unless need it (bucket_cnt > 0). */ + if (locked) + raw_spin_lock_irqsave(&b->lock, flags); bucket_cnt = 0; hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) bucket_cnt++; + if (bucket_cnt && !locked) { + locked = true; + goto again_nocopy; + } + if (bucket_cnt > (max_count - total)) { if (total == 0) ret = -ENOSPC; + /* Note that since bucket_cnt > 0 here, it is implicit + * that the locked was grabbed, so release it. + */ raw_spin_unlock_irqrestore(&b->lock, flags); rcu_read_unlock(); this_cpu_dec(bpf_prog_active); @@ -1337,6 +1348,9 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, if (bucket_cnt > bucket_size) { bucket_size = bucket_cnt; + /* Note that since bucket_cnt > 0 here, it is implicit + * that the locked was grabbed, so release it. + */ raw_spin_unlock_irqrestore(&b->lock, flags); rcu_read_unlock(); this_cpu_dec(bpf_prog_active); @@ -1379,7 +1393,10 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map, dst_val += value_size; } - raw_spin_unlock_irqrestore(&b->lock, flags); + if (locked) { + raw_spin_unlock_irqrestore(&b->lock, flags); + locked = false; + } /* If we are not copying data, we can go to next bucket and avoid * unlocking the rcu. */
Grabbing the spinlock for every bucket even if it's empty, was causing significant perfomance cost when traversing htab maps that have only a few entries. This patch addresses the issue by checking first the bucket_cnt, if the bucket has some entries then we go and grab the spinlock and proceed with the batching. Tested with a htab of size 50K and different value of populated entries. Before: Benchmark Time(ns) CPU(ns) --------------------------------------------- BM_DumpHashMap/1 2759655 2752033 BM_DumpHashMap/10 2933722 2930825 BM_DumpHashMap/200 3171680 3170265 BM_DumpHashMap/500 3639607 3635511 BM_DumpHashMap/1000 4369008 4364981 BM_DumpHashMap/5k 11171919 11134028 BM_DumpHashMap/20k 69150080 69033496 BM_DumpHashMap/39k 190501036 190226162 After: Benchmark Time(ns) CPU(ns) --------------------------------------------- BM_DumpHashMap/1 202707 200109 BM_DumpHashMap/10 213441 210569 BM_DumpHashMap/200 478641 472350 BM_DumpHashMap/500 980061 967102 BM_DumpHashMap/1000 1863835 1839575 BM_DumpHashMap/5k 8961836 8902540 BM_DumpHashMap/20k 69761497 69322756 BM_DumpHashMap/39k 187437830 186551111 Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") Cc: Yonghong Song <yhs@fb.com> Signed-off-by: Brian Vazquez <brianvv@google.com> --- kernel/bpf/hashtab.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)