Message ID | 1452527821-12276-6-git-send-email-tom.leiming@gmail.com |
---|---|
State | Deferred, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote: > Prepare for supporting percpu map in the following patch. > > Now userspace can lookup/update mapped value in one specific > CPU in case of percpu map. > > Signed-off-by: Ming Lei <tom.leiming@gmail.com> ... > @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) > goto free_key; > > rcu_read_lock(); > - ptr = map->ops->map_lookup_elem(map, key); > + if (!percpu) > + ptr = map->ops->map_lookup_elem(map, key); > + else > + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); I think this approach is less potent than Martin's for several reasons: - bpf program shouldn't be supplying bpf_smp_processor_id(), since it's error prone and a bit slower than doing it explicitly as in: http://patchwork.ozlabs.org/patch/564482/ although Martin's patch also needs to use this_cpu_ptr() instead of per_cpu_ptr(.., smp_processor_id()); - two new bpf helpers are not necessary in Martin's approach. regular map_lookup_elem() will work for both per-cpu maps. - such map_lookup_elem_percpu() from syscall is not accurate. Martin's approach via smp_call_function_single() returns precise value, whereas here memcpy() will race with other cpus. Overall I think both pre-cpu hash and per-cpu array maps are quite useful. For this particular set I would suggest to rebase on top of Martin's to reuse BPF_MAP_LOOKUP_PERCPU_ELEM command that should be applicable to both per-cpu array and per-cpu hash maps. and add BPF_MAP_UPDATE_PERCPU_ELEM via smp_call as another patch that should work for both as well.
Hi Alexei, Thanks for your review. On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote: >> Prepare for supporting percpu map in the following patch. >> >> Now userspace can lookup/update mapped value in one specific >> CPU in case of percpu map. >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com> > ... >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) >> goto free_key; >> >> rcu_read_lock(); >> - ptr = map->ops->map_lookup_elem(map, key); >> + if (!percpu) >> + ptr = map->ops->map_lookup_elem(map, key); >> + else >> + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); > > I think this approach is less potent than Martin's for several reasons: > - bpf program shouldn't be supplying bpf_smp_processor_id(), since > it's error prone and a bit slower than doing it explicitly as in: > http://patchwork.ozlabs.org/patch/564482/ > although Martin's patch also needs to use this_cpu_ptr() instead > of per_cpu_ptr(.., smp_processor_id()); For PERCPU map, smp_processor_id() is definitely required, and Martin's patch need that too, please see htab_percpu_map_lookup_elem() in his patch. > > - two new bpf helpers are not necessary in Martin's approach. > regular map_lookup_elem() will work for both per-cpu maps. For percpu ARRAY, they are not necessary, but it is flexiable to provide them since we should allow prog to retrieve the perpcu value, also it is easier to implement the system call with the two helpers. For percpu HASH, they are required since eBPF prog need to support deleting element, so we have provide these helpers for prog to retrieve percpu value before deleting the elem. > > - such map_lookup_elem_percpu() from syscall is not accurate. > Martin's approach via smp_call_function_single() returns precise value, I don't understand why Martin's approach is precise and my patch isn't, could you explain it a bit? > whereas here memcpy() will race with other cpus. > > Overall I think both pre-cpu hash and per-cpu array maps are quite useful. percpu hash isn't a must since we can get similar effect by making real_key and cpu_id as key with less memory consumption, but we can introduce that. > For this particular set I would suggest to rebase on top of Martin's > to reuse BPF_MAP_LOOKUP_PERCPU_ELEM command that should be applicable > to both per-cpu array and per-cpu hash maps. Martin's patch doesn't introduce the two helpers, which is required for percpu hash, and it also makes the syscall easier to implement. > and add BPF_MAP_UPDATE_PERCPU_ELEM via smp_call as another patch > that should work for both as well. Thanks, Ming Lei
On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote: > Hi Alexei, > > Thanks for your review. > > On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote: > >> Prepare for supporting percpu map in the following patch. > >> > >> Now userspace can lookup/update mapped value in one specific > >> CPU in case of percpu map. > >> > >> Signed-off-by: Ming Lei <tom.leiming@gmail.com> > > ... > >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) > >> goto free_key; > >> > >> rcu_read_lock(); > >> - ptr = map->ops->map_lookup_elem(map, key); > >> + if (!percpu) > >> + ptr = map->ops->map_lookup_elem(map, key); > >> + else > >> + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); > > > > I think this approach is less potent than Martin's for several reasons: > > - bpf program shouldn't be supplying bpf_smp_processor_id(), since > > it's error prone and a bit slower than doing it explicitly as in: > > http://patchwork.ozlabs.org/patch/564482/ > > although Martin's patch also needs to use this_cpu_ptr() instead > > of per_cpu_ptr(.., smp_processor_id()); > > For PERCPU map, smp_processor_id() is definitely required, and > Martin's patch need that too, please see htab_percpu_map_lookup_elem() > in his patch. hmm. it's definitely _not_ required. right? bpf programs shouldn't be accessing other per-cpu regions only their own. That's what this_cpu_ptr is for. I don't see a case where accessing other cpu per-cpu element wouldn't be a bug in the program. > > - two new bpf helpers are not necessary in Martin's approach. > > regular map_lookup_elem() will work for both per-cpu maps. > > For percpu ARRAY, they are not necessary, but it is flexiable to > provide them since we should allow prog to retrieve the perpcu > value, also it is easier to implement the system call with the two > helpers. > > For percpu HASH, they are required since eBPF prog need to support > deleting element, so we have provide these helpers for prog to retrieve > percpu value before deleting the elem. bpf programs cannot have loops, so there is no valid case to access other cpu element, since program cannot aggregate all-cpu values. Therefore the programs can only update/lookup this_cpu element and delete such element across all cpus. > > - such map_lookup_elem_percpu() from syscall is not accurate. > > Martin's approach via smp_call_function_single() returns precise value, > > I don't understand why Martin's approach is precise and my patch isn't, > could you explain it a bit? because simple mempcy() called from syscall will race with lookup/increment done to this_cpu element on another cpu. To avoid this race the smp_call is needed, so that memcpy() happens on the cpu that updated the element, so smp_call's memcpy and bpf program won't be touch that cpu value at the same time and user space will read the correct element values. If program updates them a lot, the value that user space reads will become stale very quickly, but it will be valid. That's especially important when program have multiple counters inside single element value. > > whereas here memcpy() will race with other cpus. > > > > Overall I think both pre-cpu hash and per-cpu array maps are quite useful. > > percpu hash isn't a must since we can get similar effect by making real_key > and cpu_id as key with less memory consumption, but we can introduce that. I don't think so. bpf programs shouldn't be dealing with smp_processor_id() It was poor man's per-cpu hack and it had too many disadvantages. Like get_next_key() doesn't work properly when key is {key+processor_id}, so walking over hash map to aggregate fake per-cpu elements requires user space to create another map just for walking. map->max_entries limit becomes bogus. this_cpu_ptr(..) is typically faster than per_cpu_ptr(.., smp_proc_id())
On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote: >> Hi Alexei, >> >> Thanks for your review. >> >> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote: >> >> Prepare for supporting percpu map in the following patch. >> >> >> >> Now userspace can lookup/update mapped value in one specific >> >> CPU in case of percpu map. >> >> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com> >> > ... >> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) >> >> goto free_key; >> >> >> >> rcu_read_lock(); >> >> - ptr = map->ops->map_lookup_elem(map, key); >> >> + if (!percpu) >> >> + ptr = map->ops->map_lookup_elem(map, key); >> >> + else >> >> + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); >> > >> > I think this approach is less potent than Martin's for several reasons: >> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since >> > it's error prone and a bit slower than doing it explicitly as in: >> > http://patchwork.ozlabs.org/patch/564482/ >> > although Martin's patch also needs to use this_cpu_ptr() instead >> > of per_cpu_ptr(.., smp_processor_id()); >> >> For PERCPU map, smp_processor_id() is definitely required, and >> Martin's patch need that too, please see htab_percpu_map_lookup_elem() >> in his patch. > > hmm. it's definitely _not_ required. right? > bpf programs shouldn't be accessing other per-cpu regions > only their own. That's what this_cpu_ptr is for. > I don't see a case where accessing other cpu per-cpu element > wouldn't be a bug in the program. > >> > - two new bpf helpers are not necessary in Martin's approach. >> > regular map_lookup_elem() will work for both per-cpu maps. >> >> For percpu ARRAY, they are not necessary, but it is flexiable to >> provide them since we should allow prog to retrieve the perpcu >> value, also it is easier to implement the system call with the two >> helpers. >> >> For percpu HASH, they are required since eBPF prog need to support >> deleting element, so we have provide these helpers for prog to retrieve >> percpu value before deleting the elem. > > bpf programs cannot have loops, so there is no valid case to access > other cpu element, since program cannot aggregate all-cpu values. > Therefore the programs can only update/lookup this_cpu element and > delete such element across all cpus. Looks I missed the point of looping constraint, then basically delete element helper doesn't make sense in percpu hash. > >> > - such map_lookup_elem_percpu() from syscall is not accurate. >> > Martin's approach via smp_call_function_single() returns precise value, >> >> I don't understand why Martin's approach is precise and my patch isn't, >> could you explain it a bit? > > because simple mempcy() called from syscall will race with lookup/increment > done to this_cpu element on another cpu. To avoid this race the smp_call > is needed, so that memcpy() happens on the cpu that updated the element, > so smp_call's memcpy and bpf program won't be touch that cpu value > at the same time and user space will read the correct element values. > If program updates them a lot, the value that user space reads will become > stale very quickly, but it will be valid. That's especially important > when program have multiple counters inside single element value. But smp_call is often very slow because of IPI, so the value acculated finally becomes stale easily even though the value from the requested cpu is 'precise' at the exact time, especially when there are lots of CPUs, so I think using smp_call is really a bad idea. And smp_call is worse than iterating from CPUs simply. > >> > whereas here memcpy() will race with other cpus. >> > >> > Overall I think both pre-cpu hash and per-cpu array maps are quite useful. >> >> percpu hash isn't a must since we can get similar effect by making real_key >> and cpu_id as key with less memory consumption, but we can introduce that. > > I don't think so. bpf programs shouldn't be dealing with smp_processor_id() > It was poor man's per-cpu hack and it had too many disadvantages. > Like get_next_key() doesn't work properly when key is {key+processor_id}, > so walking over hash map to aggregate fake per-cpu elements requires > user space to create another map just for walking. > map->max_entries limit becomes bogus. > this_cpu_ptr(..) is typically faster than per_cpu_ptr(.., smp_proc_id()) OK, then this_cpu_ptr() is better since we don't need to access the value of other CPUs.
On Tue, Jan 12, 2016 at 07:05:47PM +0800, Ming Lei wrote: > On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote: > >> Hi Alexei, > >> > >> Thanks for your review. > >> > >> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote: > >> >> Prepare for supporting percpu map in the following patch. > >> >> > >> >> Now userspace can lookup/update mapped value in one specific > >> >> CPU in case of percpu map. > >> >> > >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com> > >> > ... > >> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) > >> >> goto free_key; > >> >> > >> >> rcu_read_lock(); > >> >> - ptr = map->ops->map_lookup_elem(map, key); > >> >> + if (!percpu) > >> >> + ptr = map->ops->map_lookup_elem(map, key); > >> >> + else > >> >> + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); > >> > > >> > I think this approach is less potent than Martin's for several reasons: > >> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since > >> > it's error prone and a bit slower than doing it explicitly as in: > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_patch_564482_&d=CwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=kb6DfquDoMLBv0hgOO76O9SMvdCnhwnEwhgON8868I8&s=QtJkMfQDB55jn_aA_umJ8jiJRQlQhW5UxYO5YdxuGNI&e= > >> > although Martin's patch also needs to use this_cpu_ptr() instead > >> > of per_cpu_ptr(.., smp_processor_id()); > >> > >> For PERCPU map, smp_processor_id() is definitely required, and > >> Martin's patch need that too, please see htab_percpu_map_lookup_elem() > >> in his patch. > > > > hmm. it's definitely _not_ required. right? > > bpf programs shouldn't be accessing other per-cpu regions > > only their own. That's what this_cpu_ptr is for. > > I don't see a case where accessing other cpu per-cpu element > > wouldn't be a bug in the program. > > > >> > - two new bpf helpers are not necessary in Martin's approach. > >> > regular map_lookup_elem() will work for both per-cpu maps. > >> > >> For percpu ARRAY, they are not necessary, but it is flexiable to > >> provide them since we should allow prog to retrieve the perpcu > >> value, also it is easier to implement the system call with the two > >> helpers. > >> > >> For percpu HASH, they are required since eBPF prog need to support > >> deleting element, so we have provide these helpers for prog to retrieve > >> percpu value before deleting the elem. > > > > bpf programs cannot have loops, so there is no valid case to access > > other cpu element, since program cannot aggregate all-cpu values. > > Therefore the programs can only update/lookup this_cpu element and > > delete such element across all cpus. > > Looks I missed the point of looping constraint, then basically delete element > helper doesn't make sense in percpu hash. > > > > >> > - such map_lookup_elem_percpu() from syscall is not accurate. > >> > Martin's approach via smp_call_function_single() returns precise value, > >> > >> I don't understand why Martin's approach is precise and my patch isn't, > >> could you explain it a bit? > > > > because simple mempcy() called from syscall will race with lookup/increment > > done to this_cpu element on another cpu. To avoid this race the smp_call > > is needed, so that memcpy() happens on the cpu that updated the element, > > so smp_call's memcpy and bpf program won't be touch that cpu value > > at the same time and user space will read the correct element values. > > If program updates them a lot, the value that user space reads will become > > stale very quickly, but it will be valid. That's especially important > > when program have multiple counters inside single element value. > > But smp_call is often very slow because of IPI, so the value acculated > finally becomes stale easily even though the value from the requested cpu > is 'precise' at the exact time, especially when there are lots of CPUs, so I > think using smp_call is really a bad idea. And smp_call is worse than > iterating from CPUs simply. The userspace usually only aggregates value across all cpu every X seconds. I hardly consider some number of micro-seconds old data is stale.
On Wed, Jan 13, 2016 at 3:10 AM, Martin KaFai Lau <kafai@fb.com> wrote: > On Tue, Jan 12, 2016 at 07:05:47PM +0800, Ming Lei wrote: >> On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >> > On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote: >> >> Hi Alexei, >> >> >> >> Thanks for your review. >> >> >> >> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov >> >> <alexei.starovoitov@gmail.com> wrote: >> >> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote: >> >> >> Prepare for supporting percpu map in the following patch. >> >> >> >> >> >> Now userspace can lookup/update mapped value in one specific >> >> >> CPU in case of percpu map. >> >> >> >> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com> >> >> > ... >> >> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) >> >> >> goto free_key; >> >> >> >> >> >> rcu_read_lock(); >> >> >> - ptr = map->ops->map_lookup_elem(map, key); >> >> >> + if (!percpu) >> >> >> + ptr = map->ops->map_lookup_elem(map, key); >> >> >> + else >> >> >> + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); >> >> > >> >> > I think this approach is less potent than Martin's for several reasons: >> >> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since >> >> > it's error prone and a bit slower than doing it explicitly as in: >> >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_patch_564482_&d=CwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=kb6DfquDoMLBv0hgOO76O9SMvdCnhwnEwhgON8868I8&s=QtJkMfQDB55jn_aA_umJ8jiJRQlQhW5UxYO5YdxuGNI&e= >> >> > although Martin's patch also needs to use this_cpu_ptr() instead >> >> > of per_cpu_ptr(.., smp_processor_id()); >> >> >> >> For PERCPU map, smp_processor_id() is definitely required, and >> >> Martin's patch need that too, please see htab_percpu_map_lookup_elem() >> >> in his patch. >> > >> > hmm. it's definitely _not_ required. right? >> > bpf programs shouldn't be accessing other per-cpu regions >> > only their own. That's what this_cpu_ptr is for. >> > I don't see a case where accessing other cpu per-cpu element >> > wouldn't be a bug in the program. >> > >> >> > - two new bpf helpers are not necessary in Martin's approach. >> >> > regular map_lookup_elem() will work for both per-cpu maps. >> >> >> >> For percpu ARRAY, they are not necessary, but it is flexiable to >> >> provide them since we should allow prog to retrieve the perpcu >> >> value, also it is easier to implement the system call with the two >> >> helpers. >> >> >> >> For percpu HASH, they are required since eBPF prog need to support >> >> deleting element, so we have provide these helpers for prog to retrieve >> >> percpu value before deleting the elem. >> > >> > bpf programs cannot have loops, so there is no valid case to access >> > other cpu element, since program cannot aggregate all-cpu values. >> > Therefore the programs can only update/lookup this_cpu element and >> > delete such element across all cpus. >> >> Looks I missed the point of looping constraint, then basically delete element >> helper doesn't make sense in percpu hash. >> >> > >> >> > - such map_lookup_elem_percpu() from syscall is not accurate. >> >> > Martin's approach via smp_call_function_single() returns precise value, >> >> >> >> I don't understand why Martin's approach is precise and my patch isn't, >> >> could you explain it a bit? >> > >> > because simple mempcy() called from syscall will race with lookup/increment >> > done to this_cpu element on another cpu. To avoid this race the smp_call >> > is needed, so that memcpy() happens on the cpu that updated the element, >> > so smp_call's memcpy and bpf program won't be touch that cpu value >> > at the same time and user space will read the correct element values. >> > If program updates them a lot, the value that user space reads will become >> > stale very quickly, but it will be valid. That's especially important >> > when program have multiple counters inside single element value. >> >> But smp_call is often very slow because of IPI, so the value acculated >> finally becomes stale easily even though the value from the requested cpu >> is 'precise' at the exact time, especially when there are lots of CPUs, so I >> think using smp_call is really a bad idea. And smp_call is worse than >> iterating from CPUs simply. > The userspace usually only aggregates value across all cpu every X seconds. That is just in your case, and Alexei worried the issue of data stale. > I hardly consider some number of micro-seconds old data is stale. Firstly CPU can do hugh things in micro-seconds, such as the if's irq may just come duirng the period. Secondly, the time can become longer(maybe dozens of us, or in milli-seconds) if CPU number is very bigger. So why not do it in the quick way?
On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote: > On Wed, Jan 13, 2016 at 3:10 AM, Martin KaFai Lau <kafai@fb.com> wrote: > > On Tue, Jan 12, 2016 at 07:05:47PM +0800, Ming Lei wrote: > >> On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >> > On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote: > >> >> Hi Alexei, > >> >> > >> >> Thanks for your review. > >> >> > >> >> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov > >> >> <alexei.starovoitov@gmail.com> wrote: > >> >> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote: > >> >> >> Prepare for supporting percpu map in the following patch. > >> >> >> > >> >> >> Now userspace can lookup/update mapped value in one specific > >> >> >> CPU in case of percpu map. > >> >> >> > >> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com> > >> >> > ... > >> >> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) > >> >> >> goto free_key; > >> >> >> > >> >> >> rcu_read_lock(); > >> >> >> - ptr = map->ops->map_lookup_elem(map, key); > >> >> >> + if (!percpu) > >> >> >> + ptr = map->ops->map_lookup_elem(map, key); > >> >> >> + else > >> >> >> + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); > >> >> > > >> >> > I think this approach is less potent than Martin's for several reasons: > >> >> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since > >> >> > it's error prone and a bit slower than doing it explicitly as in: > >> >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_patch_564482_&d=CwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=kb6DfquDoMLBv0hgOO76O9SMvdCnhwnEwhgON8868I8&s=QtJkMfQDB55jn_aA_umJ8jiJRQlQhW5UxYO5YdxuGNI&e= > >> >> > although Martin's patch also needs to use this_cpu_ptr() instead > >> >> > of per_cpu_ptr(.., smp_processor_id()); > >> >> > >> >> For PERCPU map, smp_processor_id() is definitely required, and > >> >> Martin's patch need that too, please see htab_percpu_map_lookup_elem() > >> >> in his patch. > >> > > >> > hmm. it's definitely _not_ required. right? > >> > bpf programs shouldn't be accessing other per-cpu regions > >> > only their own. That's what this_cpu_ptr is for. > >> > I don't see a case where accessing other cpu per-cpu element > >> > wouldn't be a bug in the program. > >> > > >> >> > - two new bpf helpers are not necessary in Martin's approach. > >> >> > regular map_lookup_elem() will work for both per-cpu maps. > >> >> > >> >> For percpu ARRAY, they are not necessary, but it is flexiable to > >> >> provide them since we should allow prog to retrieve the perpcu > >> >> value, also it is easier to implement the system call with the two > >> >> helpers. > >> >> > >> >> For percpu HASH, they are required since eBPF prog need to support > >> >> deleting element, so we have provide these helpers for prog to retrieve > >> >> percpu value before deleting the elem. > >> > > >> > bpf programs cannot have loops, so there is no valid case to access > >> > other cpu element, since program cannot aggregate all-cpu values. > >> > Therefore the programs can only update/lookup this_cpu element and > >> > delete such element across all cpus. > >> > >> Looks I missed the point of looping constraint, then basically delete element > >> helper doesn't make sense in percpu hash. > >> > >> > > >> >> > - such map_lookup_elem_percpu() from syscall is not accurate. > >> >> > Martin's approach via smp_call_function_single() returns precise value, > >> >> > >> >> I don't understand why Martin's approach is precise and my patch isn't, > >> >> could you explain it a bit? > >> > > >> > because simple mempcy() called from syscall will race with lookup/increment > >> > done to this_cpu element on another cpu. To avoid this race the smp_call > >> > is needed, so that memcpy() happens on the cpu that updated the element, > >> > so smp_call's memcpy and bpf program won't be touch that cpu value > >> > at the same time and user space will read the correct element values. > >> > If program updates them a lot, the value that user space reads will become > >> > stale very quickly, but it will be valid. That's especially important > >> > when program have multiple counters inside single element value. > >> > >> But smp_call is often very slow because of IPI, so the value acculated > >> finally becomes stale easily even though the value from the requested cpu > >> is 'precise' at the exact time, especially when there are lots of CPUs, so I > >> think using smp_call is really a bad idea. And smp_call is worse than > >> iterating from CPUs simply. > > The userspace usually only aggregates value across all cpu every X seconds. > > That is just in your case, and Alexei worried the issue of data stale. I believe we are talking about validity of a value. How to make use of a less-stale but invalid data?
On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote: > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote: >> > The userspace usually only aggregates value across all cpu every X seconds. >> >> That is just in your case, and Alexei worried the issue of data stale. > I believe we are talking about validity of a value. How to > make use of a less-stale but invalid data? About the 'invalidity' thing, it should be same between using smp_call(run in IPI irq handler) and simple memcpy(). When smp_call_function_single() is used to request to lookup element in the specific CPU, the value of the element may be in updating in that CPU and not completed yet in eBPF prog, then IPI comes and half updated data is still returned to syscall. Thanks, Ming Lei
On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote: > On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote: > > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote: > >> > The userspace usually only aggregates value across all cpu every X seconds. > >> > >> That is just in your case, and Alexei worried the issue of data stale. > > I believe we are talking about validity of a value. How to > > make use of a less-stale but invalid data? > > About the 'invalidity' thing, it should be same between using > smp_call(run in IPI irq handler) and simple memcpy(). > > When smp_call_function_single() is used to request to lookup element in > the specific CPU, the value of the element may be in updating in that CPU > and not completed yet in eBPF prog, then IPI comes and half updated > data is still returned to syscall. hmm. I'm not following. bpf programs are executing with preempt disabled, so smp_call_function_single suppose to execute when bpf is not running.
On Wed, Jan 13, 2016 at 1:30 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote: >> On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote: >> > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote: >> >> > The userspace usually only aggregates value across all cpu every X seconds. >> >> >> >> That is just in your case, and Alexei worried the issue of data stale. >> > I believe we are talking about validity of a value. How to >> > make use of a less-stale but invalid data? >> >> About the 'invalidity' thing, it should be same between using >> smp_call(run in IPI irq handler) and simple memcpy(). >> >> When smp_call_function_single() is used to request to lookup element in >> the specific CPU, the value of the element may be in updating in that CPU >> and not completed yet in eBPF prog, then IPI comes and half updated >> data is still returned to syscall. > > hmm. I'm not following. bpf programs are executing with preempt disabled, > so smp_call_function_single suppose to execute when bpf is not running. Preempt disabled doesn't mean irq disabled, does it? So when bpf prog is running, the IPI irq for smp_call still may come on that CPU. Also in current non-percpu hash, the situation exists too between lookup elem syscall and updating value of element from bpf prog in SMP.
On Wed, Jan 13, 2016 at 10:56:38PM +0800, Ming Lei wrote: > On Wed, Jan 13, 2016 at 1:30 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote: > >> On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote: > >> > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote: > >> >> > The userspace usually only aggregates value across all cpu every X seconds. > >> >> > >> >> That is just in your case, and Alexei worried the issue of data stale. > >> > I believe we are talking about validity of a value. How to > >> > make use of a less-stale but invalid data? > >> > >> About the 'invalidity' thing, it should be same between using > >> smp_call(run in IPI irq handler) and simple memcpy(). > >> > >> When smp_call_function_single() is used to request to lookup element in > >> the specific CPU, the value of the element may be in updating in that CPU > >> and not completed yet in eBPF prog, then IPI comes and half updated > >> data is still returned to syscall. > > > > hmm. I'm not following. bpf programs are executing with preempt disabled, > > so smp_call_function_single suppose to execute when bpf is not running. > > Preempt disabled doesn't mean irq disabled, does it? So when bpf prog is > running, the IPI irq for smp_call still may come on that CPU. In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help. Can probably use schedule_work_on(), but that's too heavy. I guess we need bpf_map_lookup_and_delete_elem() syscall command, so we can delete single pointer out of per-cpu hash map and in call_rcu() copy precise counters. > Also in current non-percpu hash, the situation exists too between > lookup elem syscall and updating value of element from bpf prog in > SMP. looks like regular bpf_map_lookup_elem() syscall will return inaccurate data even for per-cpu hash. hmm. we need to brain storm more on it.
On Thu, Jan 14, 2016 at 9:19 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Wed, Jan 13, 2016 at 10:56:38PM +0800, Ming Lei wrote: >> On Wed, Jan 13, 2016 at 1:30 PM, Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >> > On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote: >> >> On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote: >> >> > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote: >> >> >> > The userspace usually only aggregates value across all cpu every X seconds. >> >> >> >> >> >> That is just in your case, and Alexei worried the issue of data stale. >> >> > I believe we are talking about validity of a value. How to >> >> > make use of a less-stale but invalid data? >> >> >> >> About the 'invalidity' thing, it should be same between using >> >> smp_call(run in IPI irq handler) and simple memcpy(). >> >> >> >> When smp_call_function_single() is used to request to lookup element in >> >> the specific CPU, the value of the element may be in updating in that CPU >> >> and not completed yet in eBPF prog, then IPI comes and half updated >> >> data is still returned to syscall. >> > >> > hmm. I'm not following. bpf programs are executing with preempt disabled, >> > so smp_call_function_single suppose to execute when bpf is not running. >> >> Preempt disabled doesn't mean irq disabled, does it? So when bpf prog is >> running, the IPI irq for smp_call still may come on that CPU. > > In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help. From 'Documentation/kprobes.txt', looks irqs aren't disabled always, see blow: Probe handlers are run with preemption disabled. Depending on the architecture and optimization state, handlers may also run with interrupts disabled (e.g., kretprobe handlers and optimized kprobe handlers run without interrupt disabled on x86/x86-64). > Can probably use schedule_work_on(), but that's too heavy. > I guess we need bpf_map_lookup_and_delete_elem() syscall command, so we can > delete single pointer out of per-cpu hash map and in call_rcu() copy precise > counters. The partial update is one generic issue, not only on percpu map. > >> Also in current non-percpu hash, the situation exists too between >> lookup elem syscall and updating value of element from bpf prog in >> SMP. > > looks like regular bpf_map_lookup_elem() syscall will return inaccurate data > even for per-cpu hash. hmm. we need to brain storm more on it. That is the reason I don't like smp_call now, since the issue is generic and not only on percpu map. But any generic protection might introduce some cost in updating path from eBPF prog, which we don't like too. The partial update only exists when one element holds more than one counter, or one element holds one 64bit counter on 32bit machine(which can be thought as double counter too). 1) single counter case - if the counter in the element may be updated concurrently, the counter has to be updated with atomic operation in prog, and that is perpcu map's value to avoid the atomic operation - now no protection is needed since the updating on the element is atomic 2) multiple counter case - lots of protection can be used, such per-element rw-spin, percpu lock, srcu, ..., but each each one may introduce cost in update path of prog. - prog code can choose if they want precise counting with the extra cost. - the lock mechanism can be provided by bpf helpers Thanks, Ming Lei
On Thu, Jan 14, 2016 at 10:42:44AM +0800, Ming Lei wrote: > > > > In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help. > > From 'Documentation/kprobes.txt', looks irqs aren't disabled always, see blow: > > Probe handlers are run with preemption disabled. Depending on the > architecture and optimization state, handlers may also run with > interrupts disabled (e.g., kretprobe handlers and optimized kprobe > handlers run without interrupt disabled on x86/x86-64). bpf tracing progs go through ftrace that disables irqs even for optimized kprobes on x64. but yeah, there could be an arch that doesn't do it and long term we probably want to do something about it on x64 as well. tracepoints+bpf will be with irqs on as well. > 2) multiple counter case > > - lots of protection can be used, such per-element rw-spin, percpu lock, > srcu, ..., but each each one may introduce cost in update path of prog. > - the lock mechanism can be provided by bpf helpers The above techniques cannot be easily used with bpf progs, since it would require very significant additions to verifier. Say we introduce a helper that takes some hidden lock and increments the counter which is part of map element value. What will you pass into it? An address of the counter? How verifier can statically check it? Theoretically it's doable, it's quite complex and run-time performance would be bad if we have to do lock,++,unlock for every counter. Existing bpf_xadd insn is likely going to be faster despite cache line bouncing comparing to per-cpu lock,++,unlock from your other email: > 3) if we use syscall to implement Ri(i=1...3), the period between T(i) > and T(i+1) > can become quite big, for example dozens of seconds, so the accumulated value > in A4 can't represent the actual/correct value(counter) at any time between T0 > and T4, and the value is wrong actually, and all events in above diagram > (E0(0)~E0(2M), E1(0)~E1(1M), E2(0) .... E2(10K), ...) aren't counted at all, > and the missed number can be quite huge. > So does the value got by A4 make sense for user? yes it does. In your example it's number of packets received. Regardless how slow or fast the per-cpu loop is the aggreate value is still valid. The kernel is full of loops like: for_each_possible_cpu(cpu) { struct stats *pcpu = per_cpu_ptr(stats, cpu); sum1 += pcpu->cnt1; sum2 += pcpu->cnt2; } and they compute valid values. It doesn't matter how slow or fast that loop is. Obviously the faster it is the more accurate the aggragtes will be, but one can add mdelay() after each iteration and it's still valid. Anyway, me and Martin had a discussion offline about this. To summarize: . smp_call() is not a good approach, since it works only kprobe+bpf . disable irqs for socket-style bpf programs is not an options either, since pushf/popf adds unnecessary overhead and having irqs off for the life of the program is bad . optional irq off for bpf progs that use per-cpu maps is just as bad . we can do bpf_map_lookup_and_delete() technique for per-cpu hash maps: delete elem and do for_each_possible_cpu() { copy values into buffer } from call_rcu() callback, but it needs extra sync wait logic in syscall, so complexity is probably not worth the gain, though nice that it's generic and works on all archs . we can do for_each_possible_cpu() {atomic_long_memcpy of values} in bpf_map_lookup() syscall. since we know that hash map values are always 8 byte aligned, atomic_long_memcpy() will be a loop of explicit 4-byte or 8-byte copies on 32-bit and 64-bit archs respectively. User space would need to provide value_size*max_cpus buffer, which will be partially filled by kernel due to holes in possible_cpus mask. For #1 'counter' use case the userspace can bzero() the buffer and aggregate all slots ignoring possible holes, since they're zero. Doing syscall for each cpu is slower, since for 40+ cpus the cost adds up. . bpf_map_update() becomes similar with atomic_long_memcpy The most appealing to me is that no new helpers needed and no new syscall commands. For per-cpu maps bpf_map_lookup/update() from kernel operates only on this_cpu() and bpf_map_lookup/update() from syscall use value_size*num_cpus buffer.
On Thu, Jan 14, 2016 at 1:08 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Thu, Jan 14, 2016 at 10:42:44AM +0800, Ming Lei wrote: >> > >> > In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help. >> >> From 'Documentation/kprobes.txt', looks irqs aren't disabled always, see blow: >> >> Probe handlers are run with preemption disabled. Depending on the >> architecture and optimization state, handlers may also run with >> interrupts disabled (e.g., kretprobe handlers and optimized kprobe >> handlers run without interrupt disabled on x86/x86-64). > > bpf tracing progs go through ftrace that disables irqs even for > optimized kprobes on x64. > but yeah, there could be an arch that doesn't do it > and long term we probably want to do something about it on x64 as well. > tracepoints+bpf will be with irqs on as well. > >> 2) multiple counter case >> >> - lots of protection can be used, such per-element rw-spin, percpu lock, >> srcu, ..., but each each one may introduce cost in update path of prog. >> - the lock mechanism can be provided by bpf helpers > > The above techniques cannot be easily used with bpf progs, since it would > require very significant additions to verifier. > Say we introduce a helper that takes some hidden lock and increments > the counter which is part of map element value. What will you pass into it? > An address of the counter? How verifier can statically check it? > Theoretically it's doable, it's quite complex and run-time performance > would be bad if we have to do lock,++,unlock for every counter. > Existing bpf_xadd insn is likely going to be faster despite cache line > bouncing comparing to per-cpu lock,++,unlock There are two simple approaches I thought of: 1) introduce two helpers of lookup_and_lock_element(map, key) && unlock_element(map, key) - the disadvantage is that unlock_element() need one extra lookup - verifier needn't any change 2) embedded one lock at the head of returned value - looks a bit ugly - it is tricky to obtain the lock in kernel(syscall path) - still needn't verifier's change Or other ideas? > > from your other email: >> 3) if we use syscall to implement Ri(i=1...3), the period between T(i) >> and T(i+1) >> can become quite big, for example dozens of seconds, so the accumulated value >> in A4 can't represent the actual/correct value(counter) at any time between T0 >> and T4, and the value is wrong actually, and all events in above diagram >> (E0(0)~E0(2M), E1(0)~E1(1M), E2(0) .... E2(10K), ...) aren't counted at all, >> and the missed number can be quite huge. >> So does the value got by A4 make sense for user? > > yes it does. In your example it's number of packets received. > Regardless how slow or fast the per-cpu loop is the aggreate value is still valid. > The kernel is full of loops like: > for_each_possible_cpu(cpu) { > struct stats *pcpu = per_cpu_ptr(stats, cpu); > sum1 += pcpu->cnt1; > sum2 += pcpu->cnt2; > } Yes, that is the way I suggest to use instead of using nr_cpu syscall to aggreate perpcu value, which kind of usage I never see before. > and they compute valid values. > It doesn't matter how slow or fast that loop is. I don't think so, quantity breeds quality. In syscall path, there are lots of possible and long delay(schedule out, memory allocation, page in, ...) especially when system is in high loading. In the above loop kernel is using, no all these possible delay. > Obviously the faster it is the more accurate the aggragtes will be, > but one can add mdelay() after each iteration and it's still valid. It might be valid, but the aggragtes can deviate too much from the correct value, and it becomes useless, then it doesn't matter about the validity, does it? If you don't object, I will try to figure out one patch to support implementing the aggragate function from bpf prog and we can use the kernel way to aggragate percpu value, then one single syscall is enough. Looks one new prog type is needed, but it is very similar with the sock filter usage. > Anyway, me and Martin had a discussion offline about this. To summarize: > . smp_call() is not a good approach, since it works only kprobe+bpf Yes. > . disable irqs for socket-style bpf programs is not an options either, since > pushf/popf adds unnecessary overhead and having irqs off for the life of > the program is bad Agree. > . optional irq off for bpf progs that use per-cpu maps is just as bad Yes. > . we can do bpf_map_lookup_and_delete() technique for per-cpu hash maps: > delete elem and do for_each_possible_cpu() { copy values into buffer } > from call_rcu() callback, but it needs extra sync wait logic in syscall, call_rcu() callback often takes long time. > so complexity is probably not worth the gain, though nice that > it's generic and works on all archs I suggest to not consider for percpu only, since it is a generic issue, and the approach should cover current array/hash map. > . we can do for_each_possible_cpu() {atomic_long_memcpy of values} in > bpf_map_lookup() syscall. since we know that hash map values are always > 8 byte aligned, atomic_long_memcpy() will be a loop of explicit > 4-byte or 8-byte copies on 32-bit and 64-bit archs respectively. > User space would need to provide value_size*max_cpus buffer, which will > be partially filled by kernel due to holes in possible_cpus mask. > For #1 'counter' use case the userspace can bzero() the buffer > and aggregate all slots ignoring possible holes, since they're zero. > Doing syscall for each cpu is slower, since for 40+ cpus the cost adds up. Exactly, that is why I think it is good to define the aggregate function into bpf prog code, then we can get the total value in single syscall. > . bpf_map_update() becomes similar with atomic_long_memcpy > The most appealing to me is that no new helpers needed and no new I agree with you about no new helpers. > syscall commands. For per-cpu maps bpf_map_lookup/update() from kernel No new syscall means we have to aggregate the value from kernel via bpf prog. > operates only on this_cpu() and bpf_map_lookup/update() from syscall > use value_size*num_cpus buffer. Thanks, Ming Lei
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2658917..63b04c6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -73,6 +73,8 @@ enum bpf_cmd { BPF_PROG_LOAD, BPF_OBJ_PIN, BPF_OBJ_GET, + BPF_MAP_LOOKUP_ELEM_PERCPU, + BPF_MAP_UPDATE_ELEM_PERCPU, }; enum bpf_map_type { @@ -114,6 +116,7 @@ union bpf_attr { __aligned_u64 next_key; }; __u64 flags; + __u32 cpu; }; struct { /* anonymous struct used by BPF_PROG_LOAD command */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 6373970..280c93b 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -231,8 +231,9 @@ static void __user *u64_to_ptr(__u64 val) /* last field in 'union bpf_attr' used by this command */ #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value +#define BPF_MAP_LOOKUP_ELEM_PERCPU_LAST_FIELD cpu -static int map_lookup_elem(union bpf_attr *attr) +static int map_lookup_elem(union bpf_attr *attr, bool percpu) { void __user *ukey = u64_to_ptr(attr->key); void __user *uvalue = u64_to_ptr(attr->value); @@ -242,8 +243,14 @@ static int map_lookup_elem(union bpf_attr *attr) struct fd f; int err; - if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM)) - return -EINVAL; + if (!percpu) { + if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM)) + return -EINVAL; + } else { + if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM_PERCPU) || + attr->cpu >= num_possible_cpus()) + return -EINVAL; + } f = fdget(ufd); map = __bpf_map_get(f); @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr) goto free_key; rcu_read_lock(); - ptr = map->ops->map_lookup_elem(map, key); + if (!percpu) + ptr = map->ops->map_lookup_elem(map, key); + else + ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu); if (ptr) memcpy(value, ptr, map->value_size); rcu_read_unlock(); @@ -290,8 +300,9 @@ err_put: } #define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags +#define BPF_MAP_UPDATE_ELEM_PERCPU_LAST_FIELD cpu -static int map_update_elem(union bpf_attr *attr) +static int map_update_elem(union bpf_attr *attr, bool percpu) { void __user *ukey = u64_to_ptr(attr->key); void __user *uvalue = u64_to_ptr(attr->value); @@ -301,8 +312,14 @@ static int map_update_elem(union bpf_attr *attr) struct fd f; int err; - if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM)) - return -EINVAL; + if (!percpu) { + if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM)) + return -EINVAL; + } else { + if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM_PERCPU) || + attr->cpu >= num_possible_cpus()) + return -EINVAL; + } f = fdget(ufd); map = __bpf_map_get(f); @@ -331,7 +348,12 @@ static int map_update_elem(union bpf_attr *attr) * therefore all map accessors rely on this fact, so do the same here */ rcu_read_lock(); - err = map->ops->map_update_elem(map, key, value, attr->flags); + if (!percpu) + err = map->ops->map_update_elem(map, key, value, attr->flags); + else + err = map->ops->map_update_elem_percpu(map, key, value, + attr->flags, attr->cpu); + rcu_read_unlock(); free_value: @@ -772,10 +794,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz err = map_create(&attr); break; case BPF_MAP_LOOKUP_ELEM: - err = map_lookup_elem(&attr); + err = map_lookup_elem(&attr, false); + break; + case BPF_MAP_LOOKUP_ELEM_PERCPU: + err = map_lookup_elem(&attr, true); break; case BPF_MAP_UPDATE_ELEM: - err = map_update_elem(&attr); + err = map_update_elem(&attr, false); + break; + case BPF_MAP_UPDATE_ELEM_PERCPU: + err = map_update_elem(&attr, true); break; case BPF_MAP_DELETE_ELEM: err = map_delete_elem(&attr);
Prepare for supporting percpu map in the following patch. Now userspace can lookup/update mapped value in one specific CPU in case of percpu map. Signed-off-by: Ming Lei <tom.leiming@gmail.com> --- include/uapi/linux/bpf.h | 3 +++ kernel/bpf/syscall.c | 48 ++++++++++++++++++++++++++++++++++++++---------- 2 files changed, 41 insertions(+), 10 deletions(-)