diff mbox

[5/9] bpf: syscall: add percpu version of lookup/update elem

Message ID 1452527821-12276-6-git-send-email-tom.leiming@gmail.com
State Deferred, archived
Delegated to: David Miller
Headers show

Commit Message

Ming Lei Jan. 11, 2016, 3:56 p.m. UTC
Prepare for supporting percpu map in the following patch.

Now userspace can lookup/update mapped value in one specific
CPU in case of percpu map.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 include/uapi/linux/bpf.h |  3 +++
 kernel/bpf/syscall.c     | 48 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 10 deletions(-)

Comments

Alexei Starovoitov Jan. 11, 2016, 7:02 p.m. UTC | #1
On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote:
> Prepare for supporting percpu map in the following patch.
> 
> Now userspace can lookup/update mapped value in one specific
> CPU in case of percpu map.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
...
> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>  		goto free_key;
>  
>  	rcu_read_lock();
> -	ptr = map->ops->map_lookup_elem(map, key);
> +	if (!percpu)
> +		ptr = map->ops->map_lookup_elem(map, key);
> +	else
> +		ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);

I think this approach is less potent than Martin's for several reasons:
- bpf program shouldn't be supplying bpf_smp_processor_id(), since
  it's error prone and a bit slower than doing it explicitly as in:
  http://patchwork.ozlabs.org/patch/564482/
  although Martin's patch also needs to use this_cpu_ptr() instead
  of per_cpu_ptr(.., smp_processor_id());

- two new bpf helpers are not necessary in Martin's approach.
  regular map_lookup_elem() will work for both per-cpu maps.

- such map_lookup_elem_percpu() from syscall is not accurate.
  Martin's approach via smp_call_function_single() returns precise value,
  whereas here memcpy() will race with other cpus.
 
Overall I think both pre-cpu hash and per-cpu array maps are quite useful.
For this particular set I would suggest to rebase on top of Martin's
to reuse BPF_MAP_LOOKUP_PERCPU_ELEM command that should be applicable
to both per-cpu array and per-cpu hash maps.
and add BPF_MAP_UPDATE_PERCPU_ELEM via smp_call as another patch
that should work for both as well.
Ming Lei Jan. 12, 2016, 5 a.m. UTC | #2
Hi Alexei,

Thanks for your review.

On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote:
>> Prepare for supporting percpu map in the following patch.
>>
>> Now userspace can lookup/update mapped value in one specific
>> CPU in case of percpu map.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ...
>> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>>               goto free_key;
>>
>>       rcu_read_lock();
>> -     ptr = map->ops->map_lookup_elem(map, key);
>> +     if (!percpu)
>> +             ptr = map->ops->map_lookup_elem(map, key);
>> +     else
>> +             ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);
>
> I think this approach is less potent than Martin's for several reasons:
> - bpf program shouldn't be supplying bpf_smp_processor_id(), since
>   it's error prone and a bit slower than doing it explicitly as in:
>   http://patchwork.ozlabs.org/patch/564482/
>   although Martin's patch also needs to use this_cpu_ptr() instead
>   of per_cpu_ptr(.., smp_processor_id());

For PERCPU map, smp_processor_id() is definitely required, and
Martin's patch need that too, please see htab_percpu_map_lookup_elem()
in his patch.

>
> - two new bpf helpers are not necessary in Martin's approach.
>   regular map_lookup_elem() will work for both per-cpu maps.

For percpu ARRAY, they are not necessary, but it is flexiable to
provide them since we should allow prog to retrieve the perpcu
value, also it is easier to implement the system call with the two
helpers.

For percpu HASH, they are required since eBPF prog need to support
deleting element, so we have provide these helpers for prog to retrieve
percpu value before deleting the elem.

>
> - such map_lookup_elem_percpu() from syscall is not accurate.
>   Martin's approach via smp_call_function_single() returns precise value,

I don't understand why Martin's approach is precise and my patch isn't,
could you explain it a bit?

>   whereas here memcpy() will race with other cpus.
>
> Overall I think both pre-cpu hash and per-cpu array maps are quite useful.

percpu hash isn't a must since we can get similar effect by making real_key
and cpu_id as key with less memory consumption, but we can introduce that.

> For this particular set I would suggest to rebase on top of Martin's
> to reuse BPF_MAP_LOOKUP_PERCPU_ELEM command that should be
applicable
> to both per-cpu array and per-cpu hash maps.

Martin's patch doesn't introduce the two helpers, which is required for percpu
hash, and it also makes the syscall easier to implement.

> and add BPF_MAP_UPDATE_PERCPU_ELEM via smp_call as another patch
> that should work for both as well.




Thanks,
Ming Lei
Alexei Starovoitov Jan. 12, 2016, 5:49 a.m. UTC | #3
On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote:
> Hi Alexei,
> 
> Thanks for your review.
> 
> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote:
> >> Prepare for supporting percpu map in the following patch.
> >>
> >> Now userspace can lookup/update mapped value in one specific
> >> CPU in case of percpu map.
> >>
> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> > ...
> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr)
> >>               goto free_key;
> >>
> >>       rcu_read_lock();
> >> -     ptr = map->ops->map_lookup_elem(map, key);
> >> +     if (!percpu)
> >> +             ptr = map->ops->map_lookup_elem(map, key);
> >> +     else
> >> +             ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);
> >
> > I think this approach is less potent than Martin's for several reasons:
> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since
> >   it's error prone and a bit slower than doing it explicitly as in:
> >   http://patchwork.ozlabs.org/patch/564482/
> >   although Martin's patch also needs to use this_cpu_ptr() instead
> >   of per_cpu_ptr(.., smp_processor_id());
> 
> For PERCPU map, smp_processor_id() is definitely required, and
> Martin's patch need that too, please see htab_percpu_map_lookup_elem()
> in his patch.

hmm. it's definitely _not_ required. right?
bpf programs shouldn't be accessing other per-cpu regions
only their own. That's what this_cpu_ptr is for.
I don't see a case where accessing other cpu per-cpu element
wouldn't be a bug in the program.

> > - two new bpf helpers are not necessary in Martin's approach.
> >   regular map_lookup_elem() will work for both per-cpu maps.
> 
> For percpu ARRAY, they are not necessary, but it is flexiable to
> provide them since we should allow prog to retrieve the perpcu
> value, also it is easier to implement the system call with the two
> helpers.
> 
> For percpu HASH, they are required since eBPF prog need to support
> deleting element, so we have provide these helpers for prog to retrieve
> percpu value before deleting the elem.

bpf programs cannot have loops, so there is no valid case to access
other cpu element, since program cannot aggregate all-cpu values.
Therefore the programs can only update/lookup this_cpu element and
delete such element across all cpus.

> > - such map_lookup_elem_percpu() from syscall is not accurate.
> >   Martin's approach via smp_call_function_single() returns precise value,
> 
> I don't understand why Martin's approach is precise and my patch isn't,
> could you explain it a bit?

because simple mempcy() called from syscall will race with lookup/increment
done to this_cpu element on another cpu. To avoid this race the smp_call
is needed, so that memcpy() happens on the cpu that updated the element,
so smp_call's memcpy and bpf program won't be touch that cpu value
at the same time and user space will read the correct element values.
If program updates them a lot, the value that user space reads will become
stale very quickly, but it will be valid. That's especially important
when program have multiple counters inside single element value.

> >   whereas here memcpy() will race with other cpus.
> >
> > Overall I think both pre-cpu hash and per-cpu array maps are quite useful.
> 
> percpu hash isn't a must since we can get similar effect by making real_key
> and cpu_id as key with less memory consumption, but we can introduce that.

I don't think so. bpf programs shouldn't be dealing with smp_processor_id()
It was poor man's per-cpu hack and it had too many disadvantages.
Like get_next_key() doesn't work properly when key is {key+processor_id},
so walking over hash map to aggregate fake per-cpu elements requires
user space to create another map just for walking.
map->max_entries limit becomes bogus.
this_cpu_ptr(..) is typically faster than per_cpu_ptr(.., smp_proc_id())
Ming Lei Jan. 12, 2016, 11:05 a.m. UTC | #4
On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote:
>> Hi Alexei,
>>
>> Thanks for your review.
>>
>> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote:
>> >> Prepare for supporting percpu map in the following patch.
>> >>
>> >> Now userspace can lookup/update mapped value in one specific
>> >> CPU in case of percpu map.
>> >>
>> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> > ...
>> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>> >>               goto free_key;
>> >>
>> >>       rcu_read_lock();
>> >> -     ptr = map->ops->map_lookup_elem(map, key);
>> >> +     if (!percpu)
>> >> +             ptr = map->ops->map_lookup_elem(map, key);
>> >> +     else
>> >> +             ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);
>> >
>> > I think this approach is less potent than Martin's for several reasons:
>> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since
>> >   it's error prone and a bit slower than doing it explicitly as in:
>> >   http://patchwork.ozlabs.org/patch/564482/
>> >   although Martin's patch also needs to use this_cpu_ptr() instead
>> >   of per_cpu_ptr(.., smp_processor_id());
>>
>> For PERCPU map, smp_processor_id() is definitely required, and
>> Martin's patch need that too, please see htab_percpu_map_lookup_elem()
>> in his patch.
>
> hmm. it's definitely _not_ required. right?
> bpf programs shouldn't be accessing other per-cpu regions
> only their own. That's what this_cpu_ptr is for.
> I don't see a case where accessing other cpu per-cpu element
> wouldn't be a bug in the program.
>
>> > - two new bpf helpers are not necessary in Martin's approach.
>> >   regular map_lookup_elem() will work for both per-cpu maps.
>>
>> For percpu ARRAY, they are not necessary, but it is flexiable to
>> provide them since we should allow prog to retrieve the perpcu
>> value, also it is easier to implement the system call with the two
>> helpers.
>>
>> For percpu HASH, they are required since eBPF prog need to support
>> deleting element, so we have provide these helpers for prog to retrieve
>> percpu value before deleting the elem.
>
> bpf programs cannot have loops, so there is no valid case to access
> other cpu element, since program cannot aggregate all-cpu values.
> Therefore the programs can only update/lookup this_cpu element and
> delete such element across all cpus.

Looks I missed the point of looping constraint, then basically delete element
helper doesn't make sense in percpu hash.

>
>> > - such map_lookup_elem_percpu() from syscall is not accurate.
>> >   Martin's approach via smp_call_function_single() returns precise value,
>>
>> I don't understand why Martin's approach is precise and my patch isn't,
>> could you explain it a bit?
>
> because simple mempcy() called from syscall will race with lookup/increment
> done to this_cpu element on another cpu. To avoid this race the smp_call
> is needed, so that memcpy() happens on the cpu that updated the element,
> so smp_call's memcpy and bpf program won't be touch that cpu value
> at the same time and user space will read the correct element values.
> If program updates them a lot, the value that user space reads will become
> stale very quickly, but it will be valid. That's especially important
> when program have multiple counters inside single element value.

But smp_call is often very slow because of IPI, so the value acculated
finally becomes stale easily even though the value from the requested cpu
is 'precise' at the exact time, especially when there are lots of CPUs, so I
think using smp_call is really a bad idea. And smp_call is worse than
iterating from CPUs simply.

>
>> >   whereas here memcpy() will race with other cpus.
>> >
>> > Overall I think both pre-cpu hash and per-cpu array maps are quite useful.
>>
>> percpu hash isn't a must since we can get similar effect by making real_key
>> and cpu_id as key with less memory consumption, but we can introduce that.
>
> I don't think so. bpf programs shouldn't be dealing with smp_processor_id()
> It was poor man's per-cpu hack and it had too many disadvantages.
> Like get_next_key() doesn't work properly when key is {key+processor_id},
> so walking over hash map to aggregate fake per-cpu elements requires
> user space to create another map just for walking.
> map->max_entries limit becomes bogus.
> this_cpu_ptr(..) is typically faster than per_cpu_ptr(.., smp_proc_id())

OK, then this_cpu_ptr() is better since we don't need to access the value
of other CPUs.
Martin KaFai Lau Jan. 12, 2016, 7:10 p.m. UTC | #5
On Tue, Jan 12, 2016 at 07:05:47PM +0800, Ming Lei wrote:
> On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote:
> >> Hi Alexei,
> >>
> >> Thanks for your review.
> >>
> >> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote:
> >> >> Prepare for supporting percpu map in the following patch.
> >> >>
> >> >> Now userspace can lookup/update mapped value in one specific
> >> >> CPU in case of percpu map.
> >> >>
> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> > ...
> >> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr)
> >> >>               goto free_key;
> >> >>
> >> >>       rcu_read_lock();
> >> >> -     ptr = map->ops->map_lookup_elem(map, key);
> >> >> +     if (!percpu)
> >> >> +             ptr = map->ops->map_lookup_elem(map, key);
> >> >> +     else
> >> >> +             ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);
> >> >
> >> > I think this approach is less potent than Martin's for several reasons:
> >> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since
> >> >   it's error prone and a bit slower than doing it explicitly as in:
> >> >   https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_patch_564482_&d=CwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=kb6DfquDoMLBv0hgOO76O9SMvdCnhwnEwhgON8868I8&s=QtJkMfQDB55jn_aA_umJ8jiJRQlQhW5UxYO5YdxuGNI&e=
> >> >   although Martin's patch also needs to use this_cpu_ptr() instead
> >> >   of per_cpu_ptr(.., smp_processor_id());
> >>
> >> For PERCPU map, smp_processor_id() is definitely required, and
> >> Martin's patch need that too, please see htab_percpu_map_lookup_elem()
> >> in his patch.
> >
> > hmm. it's definitely _not_ required. right?
> > bpf programs shouldn't be accessing other per-cpu regions
> > only their own. That's what this_cpu_ptr is for.
> > I don't see a case where accessing other cpu per-cpu element
> > wouldn't be a bug in the program.
> >
> >> > - two new bpf helpers are not necessary in Martin's approach.
> >> >   regular map_lookup_elem() will work for both per-cpu maps.
> >>
> >> For percpu ARRAY, they are not necessary, but it is flexiable to
> >> provide them since we should allow prog to retrieve the perpcu
> >> value, also it is easier to implement the system call with the two
> >> helpers.
> >>
> >> For percpu HASH, they are required since eBPF prog need to support
> >> deleting element, so we have provide these helpers for prog to retrieve
> >> percpu value before deleting the elem.
> >
> > bpf programs cannot have loops, so there is no valid case to access
> > other cpu element, since program cannot aggregate all-cpu values.
> > Therefore the programs can only update/lookup this_cpu element and
> > delete such element across all cpus.
>
> Looks I missed the point of looping constraint, then basically delete element
> helper doesn't make sense in percpu hash.
>
> >
> >> > - such map_lookup_elem_percpu() from syscall is not accurate.
> >> >   Martin's approach via smp_call_function_single() returns precise value,
> >>
> >> I don't understand why Martin's approach is precise and my patch isn't,
> >> could you explain it a bit?
> >
> > because simple mempcy() called from syscall will race with lookup/increment
> > done to this_cpu element on another cpu. To avoid this race the smp_call
> > is needed, so that memcpy() happens on the cpu that updated the element,
> > so smp_call's memcpy and bpf program won't be touch that cpu value
> > at the same time and user space will read the correct element values.
> > If program updates them a lot, the value that user space reads will become
> > stale very quickly, but it will be valid. That's especially important
> > when program have multiple counters inside single element value.
>
> But smp_call is often very slow because of IPI, so the value acculated
> finally becomes stale easily even though the value from the requested cpu
> is 'precise' at the exact time, especially when there are lots of CPUs, so I
> think using smp_call is really a bad idea. And smp_call is worse than
> iterating from CPUs simply.
The userspace usually only aggregates value across all cpu every X seconds.
I hardly consider some number of micro-seconds old data is stale.
Ming Lei Jan. 13, 2016, 12:38 a.m. UTC | #6
On Wed, Jan 13, 2016 at 3:10 AM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Tue, Jan 12, 2016 at 07:05:47PM +0800, Ming Lei wrote:
>> On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote:
>> >> Hi Alexei,
>> >>
>> >> Thanks for your review.
>> >>
>> >> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov
>> >> <alexei.starovoitov@gmail.com> wrote:
>> >> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote:
>> >> >> Prepare for supporting percpu map in the following patch.
>> >> >>
>> >> >> Now userspace can lookup/update mapped value in one specific
>> >> >> CPU in case of percpu map.
>> >> >>
>> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >> > ...
>> >> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>> >> >>               goto free_key;
>> >> >>
>> >> >>       rcu_read_lock();
>> >> >> -     ptr = map->ops->map_lookup_elem(map, key);
>> >> >> +     if (!percpu)
>> >> >> +             ptr = map->ops->map_lookup_elem(map, key);
>> >> >> +     else
>> >> >> +             ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);
>> >> >
>> >> > I think this approach is less potent than Martin's for several reasons:
>> >> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since
>> >> >   it's error prone and a bit slower than doing it explicitly as in:
>> >> >   https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_patch_564482_&d=CwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=kb6DfquDoMLBv0hgOO76O9SMvdCnhwnEwhgON8868I8&s=QtJkMfQDB55jn_aA_umJ8jiJRQlQhW5UxYO5YdxuGNI&e=
>> >> >   although Martin's patch also needs to use this_cpu_ptr() instead
>> >> >   of per_cpu_ptr(.., smp_processor_id());
>> >>
>> >> For PERCPU map, smp_processor_id() is definitely required, and
>> >> Martin's patch need that too, please see htab_percpu_map_lookup_elem()
>> >> in his patch.
>> >
>> > hmm. it's definitely _not_ required. right?
>> > bpf programs shouldn't be accessing other per-cpu regions
>> > only their own. That's what this_cpu_ptr is for.
>> > I don't see a case where accessing other cpu per-cpu element
>> > wouldn't be a bug in the program.
>> >
>> >> > - two new bpf helpers are not necessary in Martin's approach.
>> >> >   regular map_lookup_elem() will work for both per-cpu maps.
>> >>
>> >> For percpu ARRAY, they are not necessary, but it is flexiable to
>> >> provide them since we should allow prog to retrieve the perpcu
>> >> value, also it is easier to implement the system call with the two
>> >> helpers.
>> >>
>> >> For percpu HASH, they are required since eBPF prog need to support
>> >> deleting element, so we have provide these helpers for prog to retrieve
>> >> percpu value before deleting the elem.
>> >
>> > bpf programs cannot have loops, so there is no valid case to access
>> > other cpu element, since program cannot aggregate all-cpu values.
>> > Therefore the programs can only update/lookup this_cpu element and
>> > delete such element across all cpus.
>>
>> Looks I missed the point of looping constraint, then basically delete element
>> helper doesn't make sense in percpu hash.
>>
>> >
>> >> > - such map_lookup_elem_percpu() from syscall is not accurate.
>> >> >   Martin's approach via smp_call_function_single() returns precise value,
>> >>
>> >> I don't understand why Martin's approach is precise and my patch isn't,
>> >> could you explain it a bit?
>> >
>> > because simple mempcy() called from syscall will race with lookup/increment
>> > done to this_cpu element on another cpu. To avoid this race the smp_call
>> > is needed, so that memcpy() happens on the cpu that updated the element,
>> > so smp_call's memcpy and bpf program won't be touch that cpu value
>> > at the same time and user space will read the correct element values.
>> > If program updates them a lot, the value that user space reads will become
>> > stale very quickly, but it will be valid. That's especially important
>> > when program have multiple counters inside single element value.
>>
>> But smp_call is often very slow because of IPI, so the value acculated
>> finally becomes stale easily even though the value from the requested cpu
>> is 'precise' at the exact time, especially when there are lots of CPUs, so I
>> think using smp_call is really a bad idea. And smp_call is worse than
>> iterating from CPUs simply.
> The userspace usually only aggregates value across all cpu every X seconds.

That is just in your case, and Alexei worried the issue of data stale.

> I hardly consider some number of micro-seconds old data is stale.

Firstly CPU can do hugh things in micro-seconds, such as the if's irq
may just come duirng the period.

Secondly, the time can become longer(maybe dozens of us, or in milli-seconds)
if CPU number is very bigger.

So why not do it in the quick way?
Martin KaFai Lau Jan. 13, 2016, 2:22 a.m. UTC | #7
On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote:
> On Wed, Jan 13, 2016 at 3:10 AM, Martin KaFai Lau <kafai@fb.com> wrote:
> > On Tue, Jan 12, 2016 at 07:05:47PM +0800, Ming Lei wrote:
> >> On Tue, Jan 12, 2016 at 1:49 PM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Tue, Jan 12, 2016 at 01:00:00PM +0800, Ming Lei wrote:
> >> >> Hi Alexei,
> >> >>
> >> >> Thanks for your review.
> >> >>
> >> >> On Tue, Jan 12, 2016 at 3:02 AM, Alexei Starovoitov
> >> >> <alexei.starovoitov@gmail.com> wrote:
> >> >> > On Mon, Jan 11, 2016 at 11:56:57PM +0800, Ming Lei wrote:
> >> >> >> Prepare for supporting percpu map in the following patch.
> >> >> >>
> >> >> >> Now userspace can lookup/update mapped value in one specific
> >> >> >> CPU in case of percpu map.
> >> >> >>
> >> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> >> > ...
> >> >> >> @@ -265,7 +272,10 @@ static int map_lookup_elem(union bpf_attr *attr)
> >> >> >>               goto free_key;
> >> >> >>
> >> >> >>       rcu_read_lock();
> >> >> >> -     ptr = map->ops->map_lookup_elem(map, key);
> >> >> >> +     if (!percpu)
> >> >> >> +             ptr = map->ops->map_lookup_elem(map, key);
> >> >> >> +     else
> >> >> >> +             ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);
> >> >> >
> >> >> > I think this approach is less potent than Martin's for several reasons:
> >> >> > - bpf program shouldn't be supplying bpf_smp_processor_id(), since
> >> >> >   it's error prone and a bit slower than doing it explicitly as in:
> >> >> >   https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_patch_564482_&d=CwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=kb6DfquDoMLBv0hgOO76O9SMvdCnhwnEwhgON8868I8&s=QtJkMfQDB55jn_aA_umJ8jiJRQlQhW5UxYO5YdxuGNI&e=
> >> >> >   although Martin's patch also needs to use this_cpu_ptr() instead
> >> >> >   of per_cpu_ptr(.., smp_processor_id());
> >> >>
> >> >> For PERCPU map, smp_processor_id() is definitely required, and
> >> >> Martin's patch need that too, please see htab_percpu_map_lookup_elem()
> >> >> in his patch.
> >> >
> >> > hmm. it's definitely _not_ required. right?
> >> > bpf programs shouldn't be accessing other per-cpu regions
> >> > only their own. That's what this_cpu_ptr is for.
> >> > I don't see a case where accessing other cpu per-cpu element
> >> > wouldn't be a bug in the program.
> >> >
> >> >> > - two new bpf helpers are not necessary in Martin's approach.
> >> >> >   regular map_lookup_elem() will work for both per-cpu maps.
> >> >>
> >> >> For percpu ARRAY, they are not necessary, but it is flexiable to
> >> >> provide them since we should allow prog to retrieve the perpcu
> >> >> value, also it is easier to implement the system call with the two
> >> >> helpers.
> >> >>
> >> >> For percpu HASH, they are required since eBPF prog need to support
> >> >> deleting element, so we have provide these helpers for prog to retrieve
> >> >> percpu value before deleting the elem.
> >> >
> >> > bpf programs cannot have loops, so there is no valid case to access
> >> > other cpu element, since program cannot aggregate all-cpu values.
> >> > Therefore the programs can only update/lookup this_cpu element and
> >> > delete such element across all cpus.
> >>
> >> Looks I missed the point of looping constraint, then basically delete element
> >> helper doesn't make sense in percpu hash.
> >>
> >> >
> >> >> > - such map_lookup_elem_percpu() from syscall is not accurate.
> >> >> >   Martin's approach via smp_call_function_single() returns precise value,
> >> >>
> >> >> I don't understand why Martin's approach is precise and my patch isn't,
> >> >> could you explain it a bit?
> >> >
> >> > because simple mempcy() called from syscall will race with lookup/increment
> >> > done to this_cpu element on another cpu. To avoid this race the smp_call
> >> > is needed, so that memcpy() happens on the cpu that updated the element,
> >> > so smp_call's memcpy and bpf program won't be touch that cpu value
> >> > at the same time and user space will read the correct element values.
> >> > If program updates them a lot, the value that user space reads will become
> >> > stale very quickly, but it will be valid. That's especially important
> >> > when program have multiple counters inside single element value.
> >>
> >> But smp_call is often very slow because of IPI, so the value acculated
> >> finally becomes stale easily even though the value from the requested cpu
> >> is 'precise' at the exact time, especially when there are lots of CPUs, so I
> >> think using smp_call is really a bad idea. And smp_call is worse than
> >> iterating from CPUs simply.
> > The userspace usually only aggregates value across all cpu every X seconds.
>
> That is just in your case, and Alexei worried the issue of data stale.
I believe we are talking about validity of a value.  How to
make use of a less-stale but invalid data?
Ming Lei Jan. 13, 2016, 3:17 a.m. UTC | #8
On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote:
>> > The userspace usually only aggregates value across all cpu every X seconds.
>>
>> That is just in your case, and Alexei worried the issue of data stale.
> I believe we are talking about validity of a value.  How to
> make use of a less-stale but invalid data?

About the 'invalidity' thing, it should be same between using
smp_call(run in IPI irq handler) and simple memcpy().

When smp_call_function_single() is used to request to lookup element in
the specific CPU, the value of the element may be in updating in that CPU
and not completed yet in eBPF prog, then IPI comes and half updated
data is still returned to syscall.


Thanks,
Ming Lei
Alexei Starovoitov Jan. 13, 2016, 5:30 a.m. UTC | #9
On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote:
> On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote:
> > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote:
> >> > The userspace usually only aggregates value across all cpu every X seconds.
> >>
> >> That is just in your case, and Alexei worried the issue of data stale.
> > I believe we are talking about validity of a value.  How to
> > make use of a less-stale but invalid data?
> 
> About the 'invalidity' thing, it should be same between using
> smp_call(run in IPI irq handler) and simple memcpy().
> 
> When smp_call_function_single() is used to request to lookup element in
> the specific CPU, the value of the element may be in updating in that CPU
> and not completed yet in eBPF prog, then IPI comes and half updated
> data is still returned to syscall.

hmm. I'm not following. bpf programs are executing with preempt disabled,
so smp_call_function_single suppose to execute when bpf is not running.
Ming Lei Jan. 13, 2016, 2:56 p.m. UTC | #10
On Wed, Jan 13, 2016 at 1:30 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote:
>> On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote:
>> > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote:
>> >> > The userspace usually only aggregates value across all cpu every X seconds.
>> >>
>> >> That is just in your case, and Alexei worried the issue of data stale.
>> > I believe we are talking about validity of a value.  How to
>> > make use of a less-stale but invalid data?
>>
>> About the 'invalidity' thing, it should be same between using
>> smp_call(run in IPI irq handler) and simple memcpy().
>>
>> When smp_call_function_single() is used to request to lookup element in
>> the specific CPU, the value of the element may be in updating in that CPU
>> and not completed yet in eBPF prog, then IPI comes and half updated
>> data is still returned to syscall.
>
> hmm. I'm not following. bpf programs are executing with preempt disabled,
> so smp_call_function_single suppose to execute when bpf is not running.

Preempt disabled doesn't mean irq disabled, does it?  So when bpf prog is
running, the IPI irq for smp_call still may come on that CPU.

Also in current non-percpu hash, the situation exists too between
lookup elem syscall and updating value of element from bpf prog in
SMP.
Alexei Starovoitov Jan. 14, 2016, 1:19 a.m. UTC | #11
On Wed, Jan 13, 2016 at 10:56:38PM +0800, Ming Lei wrote:
> On Wed, Jan 13, 2016 at 1:30 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote:
> >> On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote:
> >> > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote:
> >> >> > The userspace usually only aggregates value across all cpu every X seconds.
> >> >>
> >> >> That is just in your case, and Alexei worried the issue of data stale.
> >> > I believe we are talking about validity of a value.  How to
> >> > make use of a less-stale but invalid data?
> >>
> >> About the 'invalidity' thing, it should be same between using
> >> smp_call(run in IPI irq handler) and simple memcpy().
> >>
> >> When smp_call_function_single() is used to request to lookup element in
> >> the specific CPU, the value of the element may be in updating in that CPU
> >> and not completed yet in eBPF prog, then IPI comes and half updated
> >> data is still returned to syscall.
> >
> > hmm. I'm not following. bpf programs are executing with preempt disabled,
> > so smp_call_function_single suppose to execute when bpf is not running.
> 
> Preempt disabled doesn't mean irq disabled, does it?  So when bpf prog is
> running, the IPI irq for smp_call still may come on that CPU.

In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help.
Can probably use schedule_work_on(), but that's too heavy.
I guess we need bpf_map_lookup_and_delete_elem() syscall command, so we can
delete single pointer out of per-cpu hash map and in call_rcu() copy precise
counters.

> Also in current non-percpu hash, the situation exists too between
> lookup elem syscall and updating value of element from bpf prog in
> SMP.

looks like regular bpf_map_lookup_elem() syscall will return inaccurate data
even for per-cpu hash. hmm. we need to brain storm more on it.
Ming Lei Jan. 14, 2016, 2:42 a.m. UTC | #12
On Thu, Jan 14, 2016 at 9:19 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Jan 13, 2016 at 10:56:38PM +0800, Ming Lei wrote:
>> On Wed, Jan 13, 2016 at 1:30 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Wed, Jan 13, 2016 at 11:17:23AM +0800, Ming Lei wrote:
>> >> On Wed, Jan 13, 2016 at 10:22 AM, Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > On Wed, Jan 13, 2016 at 08:38:18AM +0800, Ming Lei wrote:
>> >> >> > The userspace usually only aggregates value across all cpu every X seconds.
>> >> >>
>> >> >> That is just in your case, and Alexei worried the issue of data stale.
>> >> > I believe we are talking about validity of a value.  How to
>> >> > make use of a less-stale but invalid data?
>> >>
>> >> About the 'invalidity' thing, it should be same between using
>> >> smp_call(run in IPI irq handler) and simple memcpy().
>> >>
>> >> When smp_call_function_single() is used to request to lookup element in
>> >> the specific CPU, the value of the element may be in updating in that CPU
>> >> and not completed yet in eBPF prog, then IPI comes and half updated
>> >> data is still returned to syscall.
>> >
>> > hmm. I'm not following. bpf programs are executing with preempt disabled,
>> > so smp_call_function_single suppose to execute when bpf is not running.
>>
>> Preempt disabled doesn't mean irq disabled, does it?  So when bpf prog is
>> running, the IPI irq for smp_call still may come on that CPU.
>
> In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help.

From 'Documentation/kprobes.txt', looks irqs aren't disabled always, see blow:

    Probe handlers are run with preemption disabled.  Depending on the
    architecture and optimization state, handlers may also run with
    interrupts disabled (e.g., kretprobe handlers and optimized kprobe
    handlers run without interrupt disabled on x86/x86-64).

> Can probably use schedule_work_on(), but that's too heavy.
> I guess we need bpf_map_lookup_and_delete_elem() syscall command, so we can
> delete single pointer out of per-cpu hash map and in call_rcu() copy precise
> counters.

The partial update is one generic issue, not only on percpu map.

>
>> Also in current non-percpu hash, the situation exists too between
>> lookup elem syscall and updating value of element from bpf prog in
>> SMP.
>
> looks like regular bpf_map_lookup_elem() syscall will return inaccurate data
> even for per-cpu hash. hmm. we need to brain storm more on it.

That is the reason I don't like smp_call now, since the issue is generic
and not only on percpu map.

But any generic protection might introduce some cost in updating path
from eBPF prog, which we don't like too.

The partial update only exists when one element holds more than one
counter, or one element holds one 64bit counter on 32bit machine(which
can be thought as double counter too).

1) single counter case

- if the counter in the element may be updated concurrently, the counter
has to be updated with atomic operation in prog, and that is perpcu map's
value to avoid the atomic operation

- now no protection is needed since the updating on the element is atomic

2) multiple counter case

- lots of protection can be used, such per-element rw-spin, percpu lock,
srcu, ..., but each each one may introduce cost in update path of prog.

- prog code can choose if they want precise counting with the extra cost.

- the lock mechanism can be provided by bpf helpers


Thanks,
Ming Lei
Alexei Starovoitov Jan. 14, 2016, 5:08 a.m. UTC | #13
On Thu, Jan 14, 2016 at 10:42:44AM +0800, Ming Lei wrote:
> >
> > In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help.
> 
> From 'Documentation/kprobes.txt', looks irqs aren't disabled always, see blow:
> 
>     Probe handlers are run with preemption disabled.  Depending on the
>     architecture and optimization state, handlers may also run with
>     interrupts disabled (e.g., kretprobe handlers and optimized kprobe
>     handlers run without interrupt disabled on x86/x86-64).

bpf tracing progs go through ftrace that disables irqs even for
optimized kprobes on x64.
but yeah, there could be an arch that doesn't do it
and long term we probably want to do something about it on x64 as well.
tracepoints+bpf will be with irqs on as well.

> 2) multiple counter case
> 
> - lots of protection can be used, such per-element rw-spin, percpu lock,
> srcu, ..., but each each one may introduce cost in update path of prog.
> - the lock mechanism can be provided by bpf helpers

The above techniques cannot be easily used with bpf progs, since it would
require very significant additions to verifier.
Say we introduce a helper that takes some hidden lock and increments
the counter which is part of map element value. What will you pass into it?
An address of the counter? How verifier can statically check it?
Theoretically it's doable, it's quite complex and run-time performance
would be bad if we have to do lock,++,unlock for every counter.
Existing bpf_xadd insn is likely going to be faster despite cache line 
bouncing comparing to per-cpu lock,++,unlock

from your other email:
> 3) if we use syscall to implement Ri(i=1...3), the period between T(i)
> and T(i+1)
> can become quite big, for example dozens of seconds, so the accumulated value
> in A4 can't represent the actual/correct value(counter) at any time between T0
> and T4, and the value is wrong actually, and all events in above diagram
> (E0(0)~E0(2M), E1(0)~E1(1M),  E2(0) .... E2(10K), ...) aren't counted at all,
> and the missed number can be quite huge.
> So does the value got by A4 make sense for user?

yes it does. In your example it's number of packets received.
Regardless how slow or fast the per-cpu loop is the aggreate value is still valid.
The kernel is full of loops like:
 for_each_possible_cpu(cpu) {
   struct stats *pcpu = per_cpu_ptr(stats, cpu);
   sum1 += pcpu->cnt1;
   sum2 += pcpu->cnt2;
 }
and they compute valid values.
It doesn't matter how slow or fast that loop is.
Obviously the faster it is the more accurate the aggragtes will be,
but one can add mdelay() after each iteration and it's still valid.

Anyway, me and Martin had a discussion offline about this. To summarize:
. smp_call() is not a good approach, since it works only kprobe+bpf
. disable irqs for socket-style bpf programs is not an options either, since
  pushf/popf adds unnecessary overhead and having irqs off for the life of
  the program is bad
. optional irq off for bpf progs that use per-cpu maps is just as bad
. we can do bpf_map_lookup_and_delete() technique for per-cpu hash maps:
  delete elem and do for_each_possible_cpu() { copy values into buffer }
  from call_rcu() callback, but it needs extra sync wait logic in syscall,
  so complexity is probably not worth the gain, though nice that
  it's generic and works on all archs
. we can do for_each_possible_cpu() {atomic_long_memcpy of values} in
  bpf_map_lookup() syscall. since we know that hash map values are always
  8 byte aligned, atomic_long_memcpy() will be a loop of explicit
  4-byte or 8-byte copies on 32-bit and 64-bit archs respectively.
  User space would need to provide value_size*max_cpus buffer, which will
  be partially filled by kernel due to holes in possible_cpus mask.
  For #1 'counter' use case the userspace can bzero() the buffer
  and aggregate all slots ignoring possible holes, since they're zero.
  Doing syscall for each cpu is slower, since for 40+ cpus the cost adds up.
. bpf_map_update() becomes similar with atomic_long_memcpy
  The most appealing to me is that no new helpers needed and no new
  syscall commands. For per-cpu maps bpf_map_lookup/update() from kernel
  operates only on this_cpu() and bpf_map_lookup/update() from syscall
  use value_size*num_cpus buffer.
Ming Lei Jan. 14, 2016, 7:16 a.m. UTC | #14
On Thu, Jan 14, 2016 at 1:08 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Jan 14, 2016 at 10:42:44AM +0800, Ming Lei wrote:
>> >
>> > In case of kprobes irqs are disabled, but yeah for sockets smp_call won't help.
>>
>> From 'Documentation/kprobes.txt', looks irqs aren't disabled always, see blow:
>>
>>     Probe handlers are run with preemption disabled.  Depending on the
>>     architecture and optimization state, handlers may also run with
>>     interrupts disabled (e.g., kretprobe handlers and optimized kprobe
>>     handlers run without interrupt disabled on x86/x86-64).
>
> bpf tracing progs go through ftrace that disables irqs even for
> optimized kprobes on x64.
> but yeah, there could be an arch that doesn't do it
> and long term we probably want to do something about it on x64 as well.
> tracepoints+bpf will be with irqs on as well.
>
>> 2) multiple counter case
>>
>> - lots of protection can be used, such per-element rw-spin, percpu lock,
>> srcu, ..., but each each one may introduce cost in update path of prog.
>> - the lock mechanism can be provided by bpf helpers
>
> The above techniques cannot be easily used with bpf progs, since it would
> require very significant additions to verifier.
> Say we introduce a helper that takes some hidden lock and increments
> the counter which is part of map element value. What will you pass into it?
> An address of the counter? How verifier can statically check it?
> Theoretically it's doable, it's quite complex and run-time performance
> would be bad if we have to do lock,++,unlock for every counter.
> Existing bpf_xadd insn is likely going to be faster despite cache line
> bouncing comparing to per-cpu lock,++,unlock

There are two simple approaches I thought of:

1) introduce two helpers of lookup_and_lock_element(map, key) &&
unlock_element(map, key)
- the disadvantage is that unlock_element() need one extra lookup
- verifier needn't any change

2) embedded one lock at the head of returned value
- looks a bit ugly
- it is tricky to obtain the lock in kernel(syscall path)
- still needn't verifier's change

Or other ideas?

>
> from your other email:
>> 3) if we use syscall to implement Ri(i=1...3), the period between T(i)
>> and T(i+1)
>> can become quite big, for example dozens of seconds, so the accumulated value
>> in A4 can't represent the actual/correct value(counter) at any time between T0
>> and T4, and the value is wrong actually, and all events in above diagram
>> (E0(0)~E0(2M), E1(0)~E1(1M),  E2(0) .... E2(10K), ...) aren't counted at all,
>> and the missed number can be quite huge.
>> So does the value got by A4 make sense for user?
>
> yes it does. In your example it's number of packets received.
> Regardless how slow or fast the per-cpu loop is the aggreate value is still valid.
> The kernel is full of loops like:
>  for_each_possible_cpu(cpu) {
>    struct stats *pcpu = per_cpu_ptr(stats, cpu);
>    sum1 += pcpu->cnt1;
>    sum2 += pcpu->cnt2;
>  }

Yes, that is the way I suggest to use instead of using nr_cpu syscall to
aggreate perpcu value, which kind of usage I never see before.

> and they compute valid values.
> It doesn't matter how slow or fast that loop is.

I don't think so, quantity breeds quality. In syscall path, there are
lots of possible and long delay(schedule out, memory allocation,
page in, ...) especially when system is in high loading. In the above
loop kernel is using, no all these possible delay.

> Obviously the faster it is the more accurate the aggragtes will be,
> but one can add mdelay() after each iteration and it's still valid.

It might be valid, but the aggragtes can deviate too much from the
correct value, and it becomes useless, then it doesn't matter about
the validity, does it?

If you don't object, I will try to figure out one patch to support
implementing the aggragate function from bpf prog and we can use
the kernel way to aggragate percpu value, then one single syscall
is enough. Looks one new prog type is needed, but it is very similar
with the sock filter usage.

> Anyway, me and Martin had a discussion offline about this. To summarize:
> . smp_call() is not a good approach, since it works only kprobe+bpf

Yes.

> . disable irqs for socket-style bpf programs is not an options either, since
>   pushf/popf adds unnecessary overhead and having irqs off for the life of
>   the program is bad

Agree.

> . optional irq off for bpf progs that use per-cpu maps is just as bad

Yes.

> . we can do bpf_map_lookup_and_delete() technique for per-cpu hash maps:
>   delete elem and do for_each_possible_cpu() { copy values into buffer }
>   from call_rcu() callback, but it needs extra sync wait logic in syscall,

call_rcu() callback often takes long time.

>   so complexity is probably not worth the gain, though nice that
>   it's generic and works on all archs

I suggest to not consider for percpu only, since it is a generic issue,
and the approach should cover current array/hash map.

> . we can do for_each_possible_cpu() {atomic_long_memcpy of values} in
>   bpf_map_lookup() syscall. since we know that hash map values are always
>   8 byte aligned, atomic_long_memcpy() will be a loop of explicit
>   4-byte or 8-byte copies on 32-bit and 64-bit archs respectively.
>   User space would need to provide value_size*max_cpus buffer, which will
>   be partially filled by kernel due to holes in possible_cpus mask.
>   For #1 'counter' use case the userspace can bzero() the buffer
>   and aggregate all slots ignoring possible holes, since they're zero.
>   Doing syscall for each cpu is slower, since for 40+ cpus the cost adds up.

Exactly, that is why I think it is good to define the aggregate function into
bpf prog code, then we can get the total value in single syscall.

> . bpf_map_update() becomes similar with atomic_long_memcpy
>   The most appealing to me is that no new helpers needed and no new

I agree with you about no new helpers.

>   syscall commands. For per-cpu maps bpf_map_lookup/update() from kernel

No new syscall means we have to aggregate the value from kernel via
bpf prog.

>   operates only on this_cpu() and bpf_map_lookup/update() from syscall
>   use value_size*num_cpus buffer.


Thanks,
Ming Lei
diff mbox

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2658917..63b04c6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@  enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_MAP_LOOKUP_ELEM_PERCPU,
+	BPF_MAP_UPDATE_ELEM_PERCPU,
 };
 
 enum bpf_map_type {
@@ -114,6 +116,7 @@  union bpf_attr {
 			__aligned_u64 next_key;
 		};
 		__u64		flags;
+		__u32		cpu;
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6373970..280c93b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -231,8 +231,9 @@  static void __user *u64_to_ptr(__u64 val)
 
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
+#define BPF_MAP_LOOKUP_ELEM_PERCPU_LAST_FIELD cpu
 
-static int map_lookup_elem(union bpf_attr *attr)
+static int map_lookup_elem(union bpf_attr *attr, bool percpu)
 {
 	void __user *ukey = u64_to_ptr(attr->key);
 	void __user *uvalue = u64_to_ptr(attr->value);
@@ -242,8 +243,14 @@  static int map_lookup_elem(union bpf_attr *attr)
 	struct fd f;
 	int err;
 
-	if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))
-		return -EINVAL;
+	if (!percpu) {
+		if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))
+			return -EINVAL;
+	} else {
+		if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM_PERCPU) ||
+				attr->cpu >= num_possible_cpus())
+			return -EINVAL;
+	}
 
 	f = fdget(ufd);
 	map = __bpf_map_get(f);
@@ -265,7 +272,10 @@  static int map_lookup_elem(union bpf_attr *attr)
 		goto free_key;
 
 	rcu_read_lock();
-	ptr = map->ops->map_lookup_elem(map, key);
+	if (!percpu)
+		ptr = map->ops->map_lookup_elem(map, key);
+	else
+		ptr = map->ops->map_lookup_elem_percpu(map, key, attr->cpu);
 	if (ptr)
 		memcpy(value, ptr, map->value_size);
 	rcu_read_unlock();
@@ -290,8 +300,9 @@  err_put:
 }
 
 #define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags
+#define BPF_MAP_UPDATE_ELEM_PERCPU_LAST_FIELD cpu
 
-static int map_update_elem(union bpf_attr *attr)
+static int map_update_elem(union bpf_attr *attr, bool percpu)
 {
 	void __user *ukey = u64_to_ptr(attr->key);
 	void __user *uvalue = u64_to_ptr(attr->value);
@@ -301,8 +312,14 @@  static int map_update_elem(union bpf_attr *attr)
 	struct fd f;
 	int err;
 
-	if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
-		return -EINVAL;
+	if (!percpu) {
+		if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
+			return -EINVAL;
+	} else {
+		if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM_PERCPU) ||
+				attr->cpu >= num_possible_cpus())
+			return -EINVAL;
+	}
 
 	f = fdget(ufd);
 	map = __bpf_map_get(f);
@@ -331,7 +348,12 @@  static int map_update_elem(union bpf_attr *attr)
 	 * therefore all map accessors rely on this fact, so do the same here
 	 */
 	rcu_read_lock();
-	err = map->ops->map_update_elem(map, key, value, attr->flags);
+	if (!percpu)
+		err = map->ops->map_update_elem(map, key, value, attr->flags);
+	else
+		err = map->ops->map_update_elem_percpu(map, key, value,
+				attr->flags, attr->cpu);
+
 	rcu_read_unlock();
 
 free_value:
@@ -772,10 +794,16 @@  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 		err = map_create(&attr);
 		break;
 	case BPF_MAP_LOOKUP_ELEM:
-		err = map_lookup_elem(&attr);
+		err = map_lookup_elem(&attr, false);
+		break;
+	case BPF_MAP_LOOKUP_ELEM_PERCPU:
+		err = map_lookup_elem(&attr, true);
 		break;
 	case BPF_MAP_UPDATE_ELEM:
-		err = map_update_elem(&attr);
+		err = map_update_elem(&attr, false);
+		break;
+	case BPF_MAP_UPDATE_ELEM_PERCPU:
+		err = map_update_elem(&attr, true);
 		break;
 	case BPF_MAP_DELETE_ELEM:
 		err = map_delete_elem(&attr);