Message ID | 20200908020939.7653-1-dust.li@linux.alibaba.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | [v2] net: tracepoint: fix print wrong sysctl_mem value | expand |
From: Dust Li <dust.li@linux.alibaba.com> Date: Tue, 8 Sep 2020 10:09:39 +0800 > @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit, > > TP_STRUCT__entry( > __array(char, name, 32) > - __field(long *, sysctl_mem) > + __array(long, sysctl_mem, 3) > __field(long, allocated) > __field(int, sysctl_rmem) > __field(int, rmem_alloc) > @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit, > > TP_fast_assign( > strncpy(__entry->name, prot->name, 32); > - __entry->sysctl_mem = prot->sysctl_mem; > + __entry->sysctl_mem[0] = prot->sysctl_mem[0]; > + __entry->sysctl_mem[1] = prot->sysctl_mem[1]; > + __entry->sysctl_mem[2] = prot->sysctl_mem[2]; I can't understand at all why the current code doesn't work. We assign a pointer to entry->sysctl_mem and then print out the three words pointed to by that. It's so wasteful to copy this over every tracepoint entry so the pointer approach is very desirable.
On Tue, Sep 08, 2020 at 07:49:52PM -0700, David Miller wrote: >From: Dust Li <dust.li@linux.alibaba.com> >Date: Tue, 8 Sep 2020 10:09:39 +0800 > >> @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit, >> >> TP_STRUCT__entry( >> __array(char, name, 32) >> - __field(long *, sysctl_mem) >> + __array(long, sysctl_mem, 3) >> __field(long, allocated) >> __field(int, sysctl_rmem) >> __field(int, rmem_alloc) >> @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit, >> >> TP_fast_assign( >> strncpy(__entry->name, prot->name, 32); >> - __entry->sysctl_mem = prot->sysctl_mem; >> + __entry->sysctl_mem[0] = prot->sysctl_mem[0]; >> + __entry->sysctl_mem[1] = prot->sysctl_mem[1]; >> + __entry->sysctl_mem[2] = prot->sysctl_mem[2]; > >I can't understand at all why the current code doesn't work. > >We assign a pointer to entry->sysctl_mem and then print out the >three words pointed to by that. > >It's so wasteful to copy this over every tracepoint entry so >the pointer approach is very desirable. Thanks for your reply! I took a close look at the code generated by tracepoint and found the problem is not the tracepoint itself, but `perf trace`. My previous output was got by running: `perf trace -e sock:sock_exceed_buf_limit` This time, I tried directly read from /sys/kernel/debug/tracing/trace, and everything is right :) So I checked the code of perf tool, and found the foundamatal difference is `perf trace` did the string formatting in the userspace, but raw ftrace did it in the kernel. When using `perf trace`, the kernel passes the string format and the data to perf using the perf ringbuffer, and no one in the kernel will try to visit the pointer sysctl_mem is pointed to, so the the userspace perf got the original pointer of sysctl_mem and tries to do the formating, which result in the wrong data in the commit log. The key call trace when using `perf trace` is this: trace_sock_exceed_buf_limit() --> perf_trace_sock_exceed_buf_limit() { ... perf_fetch_caller_regs(__regs); { strncpy(entry->name, prot->name, 32); entry->sysctl_mem = prot->sysctl_mem; entry->allocated = allocated; entry->sysctl_rmem = sk_get_rmem0(sk, prot); entry->rmem_alloc = atomic_read(&sk->sk_backlog.rmem_alloc); entry->sysctl_wmem = sk_get_wmem0(sk, prot); entry->wmem_alloc = refcount_read(&sk->sk_wmem_alloc); entry->wmem_queued = sk->sk_wmem_queued; entry->kind = kind; } perf_trace_run_bpf_submit(entry, __entry_size, rctx, event_call, \ __count, __regs, head, __task); } Here *entry* is directly passed in to perf_trace_run_bpf_submit() as raw data, and perf_trace_run_bpf_submit() won't do string formatting but just pass them to the userspace perf, which will finnally did the formatting, but it's already too late to get sysctl_mem[x]. So, any pointer dereference in tracepoint entry should failed in `perf trace`. Thanks. Dust.Li
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h index a966d4b5ab37..914e58938480 100644 --- a/include/trace/events/sock.h +++ b/include/trace/events/sock.h @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit, TP_STRUCT__entry( __array(char, name, 32) - __field(long *, sysctl_mem) + __array(long, sysctl_mem, 3) __field(long, allocated) __field(int, sysctl_rmem) __field(int, rmem_alloc) @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit, TP_fast_assign( strncpy(__entry->name, prot->name, 32); - __entry->sysctl_mem = prot->sysctl_mem; + __entry->sysctl_mem[0] = prot->sysctl_mem[0]; + __entry->sysctl_mem[1] = prot->sysctl_mem[1]; + __entry->sysctl_mem[2] = prot->sysctl_mem[2]; __entry->allocated = allocated; __entry->sysctl_rmem = sk_get_rmem0(sk, prot); __entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
sysctl_mem is an point, and tracepoint entry do not support been visited like an array. Use an __array(3) to store sysctl_mem instead. tracpoint output with and without this fix: - without fix: 28821.074 sock:sock_exceed_buf_limit:proto:UDP sysctl_mem=-1741233440,19,322156906942464 allocated=19 sysctl_rmem=4096 rmem_alloc=75008 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0 kind=SK_MEM_RECV - with fix: 2126.136 sock:sock_exceed_buf_limit:proto:UDP sysctl_mem=18,122845,184266 allocated=19 sysctl_rmem=4096 rmem_alloc=73728 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0 kind=SK_MEM_RECV v2: use __array(3) instead of 3 long type to store sysctl_mem Fixes: 3847ce32aea9fdf ("core: add tracepoints for queueing skb to rcvbuf") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> --- include/trace/events/sock.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)