Message ID | 157918589199.29301.4419459150054220408.stgit@devnote2 |
---|---|
State | RFC |
Headers | show |
Series | tracing: kprobes: Introduce async unregistration | expand |
On Thu, 16 Jan 2020 23:44:52 +0900 Masami Hiramatsu <mhiramat@kernel.org> wrote: > Since the 5 jiffies delay for the optimizer is too > short to wait for other probes, make it longer, > like 1 second. Hi Masami, Can you explain more *why* 5 jiffies is too short. Thanks! -- Steve > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > --- > kernel/kprobes.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > index 0dacdcecc90f..9c6e230852ad 100644 > --- a/kernel/kprobes.c > +++ b/kernel/kprobes.c > @@ -469,7 +469,8 @@ static int kprobe_optimizer_queue_update; > > static void kprobe_optimizer(struct work_struct *work); > static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer); > -#define OPTIMIZE_DELAY 5 > +/* Wait 1 second for starting optimization */ > +#define OPTIMIZE_DELAY HZ > > /* > * Optimize (replace a breakpoint with a jump) kprobes listed on
On Tue, 21 Jan 2020 19:29:05 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 16 Jan 2020 23:44:52 +0900 > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > Since the 5 jiffies delay for the optimizer is too > > short to wait for other probes, make it longer, > > like 1 second. > > Hi Masami, > > Can you explain more *why* 5 jiffies is too short. Yes, I had introduced this 5 jiffies delay for multiple probe registration and unregistration like systemtap, which will use array-based interface to register/unregister. In that case, 5 jiffies will be enough for the delay to wait for other kprobe registration/unregsitration. However, since perf and ftrace register/unregister probes one-by-one with RCU synchronization interval, the optimizer will be started before finishing to register/unregister all probes. And the optimizer locks kprobe_mutex a while -- RCU-tasks synchronization. Since the kprobe_mutex is also involved in disabling kprobes, this also stops probe-event disabling. Maybe 5 jiffies is enough for adding/removing a few probe events, but not enough for dozens of probe events. Thank you, > > Thanks! > > -- Steve > > > > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > > --- > > kernel/kprobes.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > index 0dacdcecc90f..9c6e230852ad 100644 > > --- a/kernel/kprobes.c > > +++ b/kernel/kprobes.c > > @@ -469,7 +469,8 @@ static int kprobe_optimizer_queue_update; > > > > static void kprobe_optimizer(struct work_struct *work); > > static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer); > > -#define OPTIMIZE_DELAY 5 > > +/* Wait 1 second for starting optimization */ > > +#define OPTIMIZE_DELAY HZ > > > > /* > > * Optimize (replace a breakpoint with a jump) kprobes listed on >
On Wed, 22 Jan 2020 16:23:17 +0900 Masami Hiramatsu <mhiramat@kernel.org> wrote: > On Tue, 21 Jan 2020 19:29:05 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Thu, 16 Jan 2020 23:44:52 +0900 > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > Since the 5 jiffies delay for the optimizer is too > > > short to wait for other probes, make it longer, > > > like 1 second. > > > > Hi Masami, > > > > Can you explain more *why* 5 jiffies is too short. > > Yes, I had introduced this 5 jiffies delay for multiple probe registration > and unregistration like systemtap, which will use array-based interface to > register/unregister. In that case, 5 jiffies will be enough for the delay > to wait for other kprobe registration/unregsitration. > > However, since perf and ftrace register/unregister probes one-by-one with > RCU synchronization interval, the optimizer will be started before > finishing to register/unregister all probes. > And the optimizer locks kprobe_mutex a while -- RCU-tasks synchronization. > Since the kprobe_mutex is also involved in disabling kprobes, this also > stops probe-event disabling. > > Maybe 5 jiffies is enough for adding/removing a few probe events, but > not enough for dozens of probe events. > Perhaps we should have a mechanism that can detect new probes being added, and just continue to delay the optimization, instead of having some arbitrary delay. -- Steve
On Wed, 22 Jan 2020 07:11:15 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 22 Jan 2020 16:23:17 +0900 > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > On Tue, 21 Jan 2020 19:29:05 -0500 > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > On Thu, 16 Jan 2020 23:44:52 +0900 > > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > > Since the 5 jiffies delay for the optimizer is too > > > > short to wait for other probes, make it longer, > > > > like 1 second. > > > > > > Hi Masami, > > > > > > Can you explain more *why* 5 jiffies is too short. > > > > Yes, I had introduced this 5 jiffies delay for multiple probe registration > > and unregistration like systemtap, which will use array-based interface to > > register/unregister. In that case, 5 jiffies will be enough for the delay > > to wait for other kprobe registration/unregsitration. > > > > However, since perf and ftrace register/unregister probes one-by-one with > > RCU synchronization interval, the optimizer will be started before > > finishing to register/unregister all probes. > > And the optimizer locks kprobe_mutex a while -- RCU-tasks synchronization. > > Since the kprobe_mutex is also involved in disabling kprobes, this also > > stops probe-event disabling. > > > > Maybe 5 jiffies is enough for adding/removing a few probe events, but > > not enough for dozens of probe events. > > > > Perhaps we should have a mechanism that can detect new probes being > added, and just continue to delay the optimization, instead of having > some arbitrary delay. Yes, that is what [03/13] does :) Anyway, it seems that the RCU-synchronization takes more than 5 jiffies. And in that case, [03/13] still doesn't work. That's why I added this patch after that. Thank you,
On Wed, Jan 22, 2020 at 10:12:40PM +0900, Masami Hiramatsu wrote: > On Wed, 22 Jan 2020 07:11:15 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Wed, 22 Jan 2020 16:23:17 +0900 > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > On Tue, 21 Jan 2020 19:29:05 -0500 > > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > On Thu, 16 Jan 2020 23:44:52 +0900 > > > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > > > > Since the 5 jiffies delay for the optimizer is too > > > > > short to wait for other probes, make it longer, > > > > > like 1 second. > > > > > > > > Hi Masami, > > > > > > > > Can you explain more *why* 5 jiffies is too short. > > > > > > Yes, I had introduced this 5 jiffies delay for multiple probe registration > > > and unregistration like systemtap, which will use array-based interface to > > > register/unregister. In that case, 5 jiffies will be enough for the delay > > > to wait for other kprobe registration/unregsitration. > > > > > > However, since perf and ftrace register/unregister probes one-by-one with > > > RCU synchronization interval, the optimizer will be started before > > > finishing to register/unregister all probes. > > > And the optimizer locks kprobe_mutex a while -- RCU-tasks synchronization. > > > Since the kprobe_mutex is also involved in disabling kprobes, this also > > > stops probe-event disabling. > > > > > > Maybe 5 jiffies is enough for adding/removing a few probe events, but > > > not enough for dozens of probe events. > > > > > > > Perhaps we should have a mechanism that can detect new probes being > > added, and just continue to delay the optimization, instead of having > > some arbitrary delay. > > Yes, that is what [03/13] does :) > Anyway, it seems that the RCU-synchronization takes more than 5 jiffies. > And in that case, [03/13] still doesn't work. That's why I added this patch > after that. If the RCU synchronization is synchronize_rcu_tasks(), then yes, it will often take way more than 5 jiffies. If it is synchronize_rcu(), 5 jiffies would not be unusual, especially on larger systems. But in the case of synchronize_rcu(), one option is to instead use synchronize_rcu_expedited(). It is not clear that this last is really justified in this case, but figured it might be worth mentioning. Thanx, Paul
On Wed, 22 Jan 2020 08:54:32 -0800 "Paul E. McKenney" <paulmck@kernel.org> wrote: > On Wed, Jan 22, 2020 at 10:12:40PM +0900, Masami Hiramatsu wrote: > > On Wed, 22 Jan 2020 07:11:15 -0500 > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > On Wed, 22 Jan 2020 16:23:17 +0900 > > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > > On Tue, 21 Jan 2020 19:29:05 -0500 > > > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > > On Thu, 16 Jan 2020 23:44:52 +0900 > > > > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > > > > > > Since the 5 jiffies delay for the optimizer is too > > > > > > short to wait for other probes, make it longer, > > > > > > like 1 second. > > > > > > > > > > Hi Masami, > > > > > > > > > > Can you explain more *why* 5 jiffies is too short. > > > > > > > > Yes, I had introduced this 5 jiffies delay for multiple probe registration > > > > and unregistration like systemtap, which will use array-based interface to > > > > register/unregister. In that case, 5 jiffies will be enough for the delay > > > > to wait for other kprobe registration/unregsitration. > > > > > > > > However, since perf and ftrace register/unregister probes one-by-one with > > > > RCU synchronization interval, the optimizer will be started before > > > > finishing to register/unregister all probes. > > > > And the optimizer locks kprobe_mutex a while -- RCU-tasks synchronization. > > > > Since the kprobe_mutex is also involved in disabling kprobes, this also > > > > stops probe-event disabling. > > > > > > > > Maybe 5 jiffies is enough for adding/removing a few probe events, but > > > > not enough for dozens of probe events. > > > > > > > > > > Perhaps we should have a mechanism that can detect new probes being > > > added, and just continue to delay the optimization, instead of having > > > some arbitrary delay. > > > > Yes, that is what [03/13] does :) > > Anyway, it seems that the RCU-synchronization takes more than 5 jiffies. > > And in that case, [03/13] still doesn't work. That's why I added this patch > > after that. > > If the RCU synchronization is synchronize_rcu_tasks(), then yes, it > will often take way more than 5 jiffies. If it is synchronize_rcu(), > 5 jiffies would not be unusual, especially on larger systems. > But in the case of synchronize_rcu(), one option is to instead use > synchronize_rcu_expedited(). It is not clear that this last is really > justified in this case, but figured it might be worth mentioning. It is synchronize_rcu(), but it seems sometimes it is called several times on one probe disabling. Anyway, without this update, I added a printk to count optimizer queue-length and found that all optimizer call with a single kprobe on the quenes. I think this depends on the machine, but as far as I tested on 8-threads qemu x86, shows this result... Probes: 256 kprobe_events Enable events real 0m 0.02s user 0m 0.00s sys 0m 0.02s [ 17.730548] Queue-update: 180, skipped, Total Queued: 180 [ 17.739445] Queue-update: 1, go, Total Queued: 180 Disable events [ 19.744634] Queue-update: 1, go, Total Queued: 1 [ 19.770743] Queue-update: 1, go, Total Queued: 1 [ 19.886625] Queue-update: 1, go, Total Queued: 1 [ 20.006631] Queue-update: 1, go, Total Queued: 1 [ 20.126628] Queue-update: 1, go, Total Queued: 1 [ 20.246620] Queue-update: 1, go, Total Queued: 1 [ 20.374665] Queue-update: 1, go, Total Queued: 1 [ 20.486617] Queue-update: 1, go, Total Queued: 1 [ 20.606608] Queue-update: 1, go, Total Queued: 1 [ 20.726596] Queue-update: 1, go, Total Queued: 1 [ 20.846608] Queue-update: 1, go, Total Queued: 1 [ 20.966723] Queue-update: 1, go, Total Queued: 1 [ 21.086611] Queue-update: 1, go, Total Queued: 1 [ 21.206605] Queue-update: 1, go, Total Queued: 1 [ 21.326603] Queue-update: 1, go, Total Queued: 1 [ 21.462609] Queue-update: 1, go, Total Queued: 1 [ 21.566755] Queue-update: 1, go, Total Queued: 1 [ 21.686606] Queue-update: 1, go, Total Queued: 1 [ 21.806611] Queue-update: 1, go, Total Queued: 1 [ 21.926609] Queue-update: 1, go, Total Queued: 1 [ 22.046621] Queue-update: 1, go, Total Queued: 1 [ 22.166729] Queue-update: 1, go, Total Queued: 1 [ 22.302609] Queue-update: 1, go, Total Queued: 1 [ 22.510627] Queue-update: 1, go, Total Queued: 1 [ 22.536638] Queue-update: 1, go, Total Queued: 1 [ 22.654618] Queue-update: 1, go, Total Queued: 1 [ 22.774643] Queue-update: 1, go, Total Queued: 1 [ 22.902609] Queue-update: 1, go, Total Queued: 1 [ 23.022608] Queue-update: 1, go, Total Queued: 1 [ 23.158606] Queue-update: 1, go, Total Queued: 1 [ 23.254618] Queue-update: 1, go, Total Queued: 1 [ 23.374647] Queue-update: 1, go, Total Queued: 1 [ 23.494605] Queue-update: 1, go, Total Queued: 1 [ 23.614610] Queue-update: 1, go, Total Queued: 1 [ 23.734606] Queue-update: 1, go, Total Queued: 1 [ 23.854609] Queue-update: 1, go, Total Queued: 1 [ 23.974615] Queue-update: 1, go, Total Queued: 1 [ 24.094609] Queue-update: 1, go, Total Queued: 1 [ 24.230607] Queue-update: 1, go, Total Queued: 1 [ 24.342625] Queue-update: 1, go, Total Queued: 1 [ 24.475478] Queue-update: 1, go, Total Queued: 1 [ 24.574554] Queue-update: 1, go, Total Queued: 1 [ 24.694605] Queue-update: 1, go, Total Queued: 1 [ 24.814585] Queue-update: 1, go, Total Queued: 1 [ 24.934591] Queue-update: 1, go, Total Queued: 1 [ 25.054614] Queue-update: 1, go, Total Queued: 1 [ 25.174628] Queue-update: 1, go, Total Queued: 1 [ 25.294637] Queue-update: 1, go, Total Queued: 1 [ 25.414620] Queue-update: 1, go, Total Queued: 1 [ 25.534553] Queue-update: 1, go, Total Queued: 1 [ 25.654610] Queue-update: 1, go, Total Queued: 1 [ 25.774627] Queue-update: 1, go, Total Queued: 1 [ 25.894609] Queue-update: 1, go, Total Queued: 1 [ 26.030548] Queue-update: 1, go, Total Queued: 1 [ 26.134626] Queue-update: 1, go, Total Queued: 1 [ 26.254620] Queue-update: 1, go, Total Queued: 1 [ 26.373603] Queue-update: 1, go, Total Queued: 1 [ 26.502606] Queue-update: 1, go, Total Queued: 1 [ 26.614607] Queue-update: 1, go, Total Queued: 1 [ 26.734610] Queue-update: 1, go, Total Queued: 1 [ 26.854620] Queue-update: 1, go, Total Queued: 1 [ 26.974660] Queue-update: 1, go, Total Queued: 1 [ 27.094620] Queue-update: 1, go, Total Queued: 1 [ 27.214596] Queue-update: 1, go, Total Queued: 1 [ 27.334640] Queue-update: 1, go, Total Queued: 1 [ 27.494606] Queue-update: 1, go, Total Queued: 1 [ 27.574703] Queue-update: 1, go, Total Queued: 1 [ 27.694609] Queue-update: 1, go, Total Queued: 1 [ 27.814607] Queue-update: 1, go, Total Queued: 1 [ 27.934631] Queue-update: 1, go, Total Queued: 1 [ 28.054606] Queue-update: 1, go, Total Queued: 1 [ 28.182685] Queue-update: 1, go, Total Queued: 1 [ 28.318568] Queue-update: 1, go, Total Queued: 1 [ 28.422610] Queue-update: 1, go, Total Queued: 1 [ 28.534621] Queue-update: 1, go, Total Queued: 1 [ 28.654618] Queue-update: 1, go, Total Queued: 1 [ 28.774622] Queue-update: 1, go, Total Queued: 1 [ 28.894609] Queue-update: 1, go, Total Queued: 1 [ 29.022609] Queue-update: 1, go, Total Queued: 1 [ 29.150615] Queue-update: 1, go, Total Queued: 1 [ 29.262610] Queue-update: 1, go, Total Queued: 1 [ 29.374621] Queue-update: 1, go, Total Queued: 1 [ 29.494606] Queue-update: 1, go, Total Queued: 1 [ 29.614616] Queue-update: 1, go, Total Queued: 1 [ 29.734607] Queue-update: 1, go, Total Queued: 1 [ 29.854601] Queue-update: 1, go, Total Queued: 1 [ 29.974610] Queue-update: 1, go, Total Queued: 1 [ 30.094625] Queue-update: 1, go, Total Queued: 1 [ 30.214606] Queue-update: 1, go, Total Queued: 1 [ 30.334602] Queue-update: 1, go, Total Queued: 1 [ 30.454634] Queue-update: 1, go, Total Queued: 1 [ 30.574606] Queue-update: 1, go, Total Queued: 1 [ 30.694589] Queue-update: 1, go, Total Queued: 1 [ 30.814613] Queue-update: 1, go, Total Queued: 1 [ 30.934602] Queue-update: 1, go, Total Queued: 1 [ 31.054605] Queue-update: 1, go, Total Queued: 1 [ 31.182596] Queue-update: 1, go, Total Queued: 1 [ 31.318621] Queue-update: 1, go, Total Queued: 1 [ 31.414615] Queue-update: 1, go, Total Queued: 1 [ 31.534610] Queue-update: 1, go, Total Queued: 1 [ 31.670608] Queue-update: 1, go, Total Queued: 1 [ 31.774626] Queue-update: 1, go, Total Queued: 1 [ 31.894607] Queue-update: 1, go, Total Queued: 1 [ 32.014609] Queue-update: 1, go, Total Queued: 1 [ 32.134607] Queue-update: 1, go, Total Queued: 1 [ 32.254611] Queue-update: 1, go, Total Queued: 1 [ 32.374608] Queue-update: 1, go, Total Queued: 1 [ 32.494619] Queue-update: 1, go, Total Queued: 1 [ 32.614607] Queue-update: 1, go, Total Queued: 1 [ 32.734612] Queue-update: 1, go, Total Queued: 1 [ 32.862616] Queue-update: 1, go, Total Queued: 1 [ 32.974620] Queue-update: 1, go, Total Queued: 1 [ 33.110609] Queue-update: 1, go, Total Queued: 1 [ 33.214609] Queue-update: 1, go, Total Queued: 1 [ 33.342611] Queue-update: 1, go, Total Queued: 1 [ 33.454607] Queue-update: 1, go, Total Queued: 1 [ 33.574611] Queue-update: 1, go, Total Queued: 1 [ 33.694619] Queue-update: 1, go, Total Queued: 1 [ 33.814607] Queue-update: 1, go, Total Queued: 1 [ 33.950614] Queue-update: 1, go, Total Queued: 1 [ 34.062609] Queue-update: 1, go, Total Queued: 1 [ 34.174609] Queue-update: 1, go, Total Queued: 1 [ 34.294619] Queue-update: 1, go, Total Queued: 1 [ 34.430533] Queue-update: 1, go, Total Queued: 1 [ 34.534594] Queue-update: 1, go, Total Queued: 1 [ 34.654605] Queue-update: 1, go, Total Queued: 1 [ 34.790596] Queue-update: 1, go, Total Queued: 1 [ 34.902611] Queue-update: 1, go, Total Queued: 1 [ 35.014630] Queue-update: 1, go, Total Queued: 1 [ 35.134634] Queue-update: 1, go, Total Queued: 1 [ 35.262608] Queue-update: 1, go, Total Queued: 1 [ 35.374634] Queue-update: 1, go, Total Queued: 1 [ 35.494617] Queue-update: 1, go, Total Queued: 1 [ 35.622608] Queue-update: 1, go, Total Queued: 1 [ 35.742610] Queue-update: 1, go, Total Queued: 1 [ 35.854572] Queue-update: 1, go, Total Queued: 1 [ 35.982596] Queue-update: 1, go, Total Queued: 1 [ 36.094603] Queue-update: 1, go, Total Queued: 1 [ 36.222612] Queue-update: 1, go, Total Queued: 1 [ 36.334610] Queue-update: 1, go, Total Queued: 1 [ 36.454619] Queue-update: 1, go, Total Queued: 1 [ 36.574619] Queue-update: 1, go, Total Queued: 1 [ 36.694643] Queue-update: 1, go, Total Queued: 1 [ 36.814614] Queue-update: 1, go, Total Queued: 1 [ 36.934610] Queue-update: 1, go, Total Queued: 1 [ 37.062609] Queue-update: 1, go, Total Queued: 1 [ 37.198611] Queue-update: 1, go, Total Queued: 1 [ 37.294618] Queue-update: 1, go, Total Queued: 1 [ 37.414618] Queue-update: 1, go, Total Queued: 1 [ 37.534595] Queue-update: 1, go, Total Queued: 1 [ 37.662594] Queue-update: 1, go, Total Queued: 1 [ 37.774610] Queue-update: 1, go, Total Queued: 1 [ 37.894618] Queue-update: 1, go, Total Queued: 1 [ 38.014619] Queue-update: 1, go, Total Queued: 1 [ 38.142612] Queue-update: 1, go, Total Queued: 1 [ 38.254609] Queue-update: 1, go, Total Queued: 1 [ 38.374619] Queue-update: 1, go, Total Queued: 1 [ 38.502481] Queue-update: 1, go, Total Queued: 1 [ 38.614596] Queue-update: 1, go, Total Queued: 1 [ 38.734597] Queue-update: 1, go, Total Queued: 1 [ 38.854606] Queue-update: 1, go, Total Queued: 1 [ 38.974620] Queue-update: 1, go, Total Queued: 1 [ 39.094617] Queue-update: 1, go, Total Queued: 1 [ 39.222597] Queue-update: 1, go, Total Queued: 1 [ 39.334610] Queue-update: 1, go, Total Queued: 1 [ 39.454609] Queue-update: 1, go, Total Queued: 1 [ 39.574633] Queue-update: 1, go, Total Queued: 1 [ 39.694611] Queue-update: 1, go, Total Queued: 1 [ 39.814608] Queue-update: 1, go, Total Queued: 1 [ 39.934610] Queue-update: 1, go, Total Queued: 1 [ 40.054621] Queue-update: 1, go, Total Queued: 1 [ 40.174611] Queue-update: 1, go, Total Queued: 1 [ 40.297471] Queue-update: 1, go, Total Queued: 1 [ 40.414504] Queue-update: 1, go, Total Queued: 1 [ 40.534601] Queue-update: 1, go, Total Queued: 1 [ 40.654611] Queue-update: 1, go, Total Queued: 1 [ 40.774609] Queue-update: 1, go, Total Queued: 1 [ 40.894628] Queue-update: 1, go, Total Queued: 1 [ 41.014608] Queue-update: 1, go, Total Queued: 1 [ 41.135594] Queue-update: 1, go, Total Queued: 1 real 0m 21.40s user 0m 0.00s sys 0m 0.04s Remove events real 0m 2.14s user 0m 0.00s sys 0m 0.04s Thank you,
On Thu, Jan 23, 2020 at 10:33:34AM +0900, Masami Hiramatsu wrote: > On Wed, 22 Jan 2020 08:54:32 -0800 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > On Wed, Jan 22, 2020 at 10:12:40PM +0900, Masami Hiramatsu wrote: > > > On Wed, 22 Jan 2020 07:11:15 -0500 > > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > On Wed, 22 Jan 2020 16:23:17 +0900 > > > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > > > > On Tue, 21 Jan 2020 19:29:05 -0500 > > > > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > > > > On Thu, 16 Jan 2020 23:44:52 +0900 > > > > > > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > > > > > > > > Since the 5 jiffies delay for the optimizer is too > > > > > > > short to wait for other probes, make it longer, > > > > > > > like 1 second. > > > > > > > > > > > > Hi Masami, > > > > > > > > > > > > Can you explain more *why* 5 jiffies is too short. > > > > > > > > > > Yes, I had introduced this 5 jiffies delay for multiple probe registration > > > > > and unregistration like systemtap, which will use array-based interface to > > > > > register/unregister. In that case, 5 jiffies will be enough for the delay > > > > > to wait for other kprobe registration/unregsitration. > > > > > > > > > > However, since perf and ftrace register/unregister probes one-by-one with > > > > > RCU synchronization interval, the optimizer will be started before > > > > > finishing to register/unregister all probes. > > > > > And the optimizer locks kprobe_mutex a while -- RCU-tasks synchronization. > > > > > Since the kprobe_mutex is also involved in disabling kprobes, this also > > > > > stops probe-event disabling. > > > > > > > > > > Maybe 5 jiffies is enough for adding/removing a few probe events, but > > > > > not enough for dozens of probe events. > > > > > > > > > > > > > Perhaps we should have a mechanism that can detect new probes being > > > > added, and just continue to delay the optimization, instead of having > > > > some arbitrary delay. > > > > > > Yes, that is what [03/13] does :) > > > Anyway, it seems that the RCU-synchronization takes more than 5 jiffies. > > > And in that case, [03/13] still doesn't work. That's why I added this patch > > > after that. > > > > If the RCU synchronization is synchronize_rcu_tasks(), then yes, it > > will often take way more than 5 jiffies. If it is synchronize_rcu(), > > 5 jiffies would not be unusual, especially on larger systems. > > But in the case of synchronize_rcu(), one option is to instead use > > synchronize_rcu_expedited(). It is not clear that this last is really > > justified in this case, but figured it might be worth mentioning. > > It is synchronize_rcu(), but it seems sometimes it is called several > times on one probe disabling. OK, good to know. > Anyway, without this update, I added a printk to count optimizer > queue-length and found that all optimizer call with a single kprobe > on the quenes. I think this depends on the machine, but as far as > I tested on 8-threads qemu x86, shows this result... > > Probes: 256 kprobe_events > Enable events > real 0m 0.02s > user 0m 0.00s > sys 0m 0.02s > [ 17.730548] Queue-update: 180, skipped, Total Queued: 180 > [ 17.739445] Queue-update: 1, go, Total Queued: 180 > Disable events > [ 19.744634] Queue-update: 1, go, Total Queued: 1 > [ 19.770743] Queue-update: 1, go, Total Queued: 1 > [ 19.886625] Queue-update: 1, go, Total Queued: 1 > [ 20.006631] Queue-update: 1, go, Total Queued: 1 > [ 20.126628] Queue-update: 1, go, Total Queued: 1 > [ 20.246620] Queue-update: 1, go, Total Queued: 1 > [ 20.374665] Queue-update: 1, go, Total Queued: 1 > [ 20.486617] Queue-update: 1, go, Total Queued: 1 > [ 20.606608] Queue-update: 1, go, Total Queued: 1 > [ 20.726596] Queue-update: 1, go, Total Queued: 1 > [ 20.846608] Queue-update: 1, go, Total Queued: 1 > [ 20.966723] Queue-update: 1, go, Total Queued: 1 > [ 21.086611] Queue-update: 1, go, Total Queued: 1 > [ 21.206605] Queue-update: 1, go, Total Queued: 1 > [ 21.326603] Queue-update: 1, go, Total Queued: 1 > [ 21.462609] Queue-update: 1, go, Total Queued: 1 > [ 21.566755] Queue-update: 1, go, Total Queued: 1 > [ 21.686606] Queue-update: 1, go, Total Queued: 1 > [ 21.806611] Queue-update: 1, go, Total Queued: 1 > [ 21.926609] Queue-update: 1, go, Total Queued: 1 > [ 22.046621] Queue-update: 1, go, Total Queued: 1 > [ 22.166729] Queue-update: 1, go, Total Queued: 1 > [ 22.302609] Queue-update: 1, go, Total Queued: 1 > [ 22.510627] Queue-update: 1, go, Total Queued: 1 > [ 22.536638] Queue-update: 1, go, Total Queued: 1 > [ 22.654618] Queue-update: 1, go, Total Queued: 1 > [ 22.774643] Queue-update: 1, go, Total Queued: 1 > [ 22.902609] Queue-update: 1, go, Total Queued: 1 > [ 23.022608] Queue-update: 1, go, Total Queued: 1 > [ 23.158606] Queue-update: 1, go, Total Queued: 1 > [ 23.254618] Queue-update: 1, go, Total Queued: 1 > [ 23.374647] Queue-update: 1, go, Total Queued: 1 > [ 23.494605] Queue-update: 1, go, Total Queued: 1 > [ 23.614610] Queue-update: 1, go, Total Queued: 1 > [ 23.734606] Queue-update: 1, go, Total Queued: 1 > [ 23.854609] Queue-update: 1, go, Total Queued: 1 > [ 23.974615] Queue-update: 1, go, Total Queued: 1 > [ 24.094609] Queue-update: 1, go, Total Queued: 1 > [ 24.230607] Queue-update: 1, go, Total Queued: 1 > [ 24.342625] Queue-update: 1, go, Total Queued: 1 > [ 24.475478] Queue-update: 1, go, Total Queued: 1 > [ 24.574554] Queue-update: 1, go, Total Queued: 1 > [ 24.694605] Queue-update: 1, go, Total Queued: 1 > [ 24.814585] Queue-update: 1, go, Total Queued: 1 > [ 24.934591] Queue-update: 1, go, Total Queued: 1 > [ 25.054614] Queue-update: 1, go, Total Queued: 1 > [ 25.174628] Queue-update: 1, go, Total Queued: 1 > [ 25.294637] Queue-update: 1, go, Total Queued: 1 > [ 25.414620] Queue-update: 1, go, Total Queued: 1 > [ 25.534553] Queue-update: 1, go, Total Queued: 1 > [ 25.654610] Queue-update: 1, go, Total Queued: 1 > [ 25.774627] Queue-update: 1, go, Total Queued: 1 > [ 25.894609] Queue-update: 1, go, Total Queued: 1 > [ 26.030548] Queue-update: 1, go, Total Queued: 1 > [ 26.134626] Queue-update: 1, go, Total Queued: 1 > [ 26.254620] Queue-update: 1, go, Total Queued: 1 > [ 26.373603] Queue-update: 1, go, Total Queued: 1 > [ 26.502606] Queue-update: 1, go, Total Queued: 1 > [ 26.614607] Queue-update: 1, go, Total Queued: 1 > [ 26.734610] Queue-update: 1, go, Total Queued: 1 > [ 26.854620] Queue-update: 1, go, Total Queued: 1 > [ 26.974660] Queue-update: 1, go, Total Queued: 1 > [ 27.094620] Queue-update: 1, go, Total Queued: 1 > [ 27.214596] Queue-update: 1, go, Total Queued: 1 > [ 27.334640] Queue-update: 1, go, Total Queued: 1 > [ 27.494606] Queue-update: 1, go, Total Queued: 1 > [ 27.574703] Queue-update: 1, go, Total Queued: 1 > [ 27.694609] Queue-update: 1, go, Total Queued: 1 > [ 27.814607] Queue-update: 1, go, Total Queued: 1 > [ 27.934631] Queue-update: 1, go, Total Queued: 1 > [ 28.054606] Queue-update: 1, go, Total Queued: 1 > [ 28.182685] Queue-update: 1, go, Total Queued: 1 > [ 28.318568] Queue-update: 1, go, Total Queued: 1 > [ 28.422610] Queue-update: 1, go, Total Queued: 1 > [ 28.534621] Queue-update: 1, go, Total Queued: 1 > [ 28.654618] Queue-update: 1, go, Total Queued: 1 > [ 28.774622] Queue-update: 1, go, Total Queued: 1 > [ 28.894609] Queue-update: 1, go, Total Queued: 1 > [ 29.022609] Queue-update: 1, go, Total Queued: 1 > [ 29.150615] Queue-update: 1, go, Total Queued: 1 > [ 29.262610] Queue-update: 1, go, Total Queued: 1 > [ 29.374621] Queue-update: 1, go, Total Queued: 1 > [ 29.494606] Queue-update: 1, go, Total Queued: 1 > [ 29.614616] Queue-update: 1, go, Total Queued: 1 > [ 29.734607] Queue-update: 1, go, Total Queued: 1 > [ 29.854601] Queue-update: 1, go, Total Queued: 1 > [ 29.974610] Queue-update: 1, go, Total Queued: 1 > [ 30.094625] Queue-update: 1, go, Total Queued: 1 > [ 30.214606] Queue-update: 1, go, Total Queued: 1 > [ 30.334602] Queue-update: 1, go, Total Queued: 1 > [ 30.454634] Queue-update: 1, go, Total Queued: 1 > [ 30.574606] Queue-update: 1, go, Total Queued: 1 > [ 30.694589] Queue-update: 1, go, Total Queued: 1 > [ 30.814613] Queue-update: 1, go, Total Queued: 1 > [ 30.934602] Queue-update: 1, go, Total Queued: 1 > [ 31.054605] Queue-update: 1, go, Total Queued: 1 > [ 31.182596] Queue-update: 1, go, Total Queued: 1 > [ 31.318621] Queue-update: 1, go, Total Queued: 1 > [ 31.414615] Queue-update: 1, go, Total Queued: 1 > [ 31.534610] Queue-update: 1, go, Total Queued: 1 > [ 31.670608] Queue-update: 1, go, Total Queued: 1 > [ 31.774626] Queue-update: 1, go, Total Queued: 1 > [ 31.894607] Queue-update: 1, go, Total Queued: 1 > [ 32.014609] Queue-update: 1, go, Total Queued: 1 > [ 32.134607] Queue-update: 1, go, Total Queued: 1 > [ 32.254611] Queue-update: 1, go, Total Queued: 1 > [ 32.374608] Queue-update: 1, go, Total Queued: 1 > [ 32.494619] Queue-update: 1, go, Total Queued: 1 > [ 32.614607] Queue-update: 1, go, Total Queued: 1 > [ 32.734612] Queue-update: 1, go, Total Queued: 1 > [ 32.862616] Queue-update: 1, go, Total Queued: 1 > [ 32.974620] Queue-update: 1, go, Total Queued: 1 > [ 33.110609] Queue-update: 1, go, Total Queued: 1 > [ 33.214609] Queue-update: 1, go, Total Queued: 1 > [ 33.342611] Queue-update: 1, go, Total Queued: 1 > [ 33.454607] Queue-update: 1, go, Total Queued: 1 > [ 33.574611] Queue-update: 1, go, Total Queued: 1 > [ 33.694619] Queue-update: 1, go, Total Queued: 1 > [ 33.814607] Queue-update: 1, go, Total Queued: 1 > [ 33.950614] Queue-update: 1, go, Total Queued: 1 > [ 34.062609] Queue-update: 1, go, Total Queued: 1 > [ 34.174609] Queue-update: 1, go, Total Queued: 1 > [ 34.294619] Queue-update: 1, go, Total Queued: 1 > [ 34.430533] Queue-update: 1, go, Total Queued: 1 > [ 34.534594] Queue-update: 1, go, Total Queued: 1 > [ 34.654605] Queue-update: 1, go, Total Queued: 1 > [ 34.790596] Queue-update: 1, go, Total Queued: 1 > [ 34.902611] Queue-update: 1, go, Total Queued: 1 > [ 35.014630] Queue-update: 1, go, Total Queued: 1 > [ 35.134634] Queue-update: 1, go, Total Queued: 1 > [ 35.262608] Queue-update: 1, go, Total Queued: 1 > [ 35.374634] Queue-update: 1, go, Total Queued: 1 > [ 35.494617] Queue-update: 1, go, Total Queued: 1 > [ 35.622608] Queue-update: 1, go, Total Queued: 1 > [ 35.742610] Queue-update: 1, go, Total Queued: 1 > [ 35.854572] Queue-update: 1, go, Total Queued: 1 > [ 35.982596] Queue-update: 1, go, Total Queued: 1 > [ 36.094603] Queue-update: 1, go, Total Queued: 1 > [ 36.222612] Queue-update: 1, go, Total Queued: 1 > [ 36.334610] Queue-update: 1, go, Total Queued: 1 > [ 36.454619] Queue-update: 1, go, Total Queued: 1 > [ 36.574619] Queue-update: 1, go, Total Queued: 1 > [ 36.694643] Queue-update: 1, go, Total Queued: 1 > [ 36.814614] Queue-update: 1, go, Total Queued: 1 > [ 36.934610] Queue-update: 1, go, Total Queued: 1 > [ 37.062609] Queue-update: 1, go, Total Queued: 1 > [ 37.198611] Queue-update: 1, go, Total Queued: 1 > [ 37.294618] Queue-update: 1, go, Total Queued: 1 > [ 37.414618] Queue-update: 1, go, Total Queued: 1 > [ 37.534595] Queue-update: 1, go, Total Queued: 1 > [ 37.662594] Queue-update: 1, go, Total Queued: 1 > [ 37.774610] Queue-update: 1, go, Total Queued: 1 > [ 37.894618] Queue-update: 1, go, Total Queued: 1 > [ 38.014619] Queue-update: 1, go, Total Queued: 1 > [ 38.142612] Queue-update: 1, go, Total Queued: 1 > [ 38.254609] Queue-update: 1, go, Total Queued: 1 > [ 38.374619] Queue-update: 1, go, Total Queued: 1 > [ 38.502481] Queue-update: 1, go, Total Queued: 1 > [ 38.614596] Queue-update: 1, go, Total Queued: 1 > [ 38.734597] Queue-update: 1, go, Total Queued: 1 > [ 38.854606] Queue-update: 1, go, Total Queued: 1 > [ 38.974620] Queue-update: 1, go, Total Queued: 1 > [ 39.094617] Queue-update: 1, go, Total Queued: 1 > [ 39.222597] Queue-update: 1, go, Total Queued: 1 > [ 39.334610] Queue-update: 1, go, Total Queued: 1 > [ 39.454609] Queue-update: 1, go, Total Queued: 1 > [ 39.574633] Queue-update: 1, go, Total Queued: 1 > [ 39.694611] Queue-update: 1, go, Total Queued: 1 > [ 39.814608] Queue-update: 1, go, Total Queued: 1 > [ 39.934610] Queue-update: 1, go, Total Queued: 1 > [ 40.054621] Queue-update: 1, go, Total Queued: 1 > [ 40.174611] Queue-update: 1, go, Total Queued: 1 > [ 40.297471] Queue-update: 1, go, Total Queued: 1 > [ 40.414504] Queue-update: 1, go, Total Queued: 1 > [ 40.534601] Queue-update: 1, go, Total Queued: 1 > [ 40.654611] Queue-update: 1, go, Total Queued: 1 > [ 40.774609] Queue-update: 1, go, Total Queued: 1 > [ 40.894628] Queue-update: 1, go, Total Queued: 1 > [ 41.014608] Queue-update: 1, go, Total Queued: 1 > [ 41.135594] Queue-update: 1, go, Total Queued: 1 > real 0m 21.40s > user 0m 0.00s > sys 0m 0.04s So 21.4s/256 = 84 milliseconds per event disable, correct? It might be worth trying synchronize_rcu_expedited() just as an experiment to see if it speeds things up significantly. Thanx, Paul > Remove events > real 0m 2.14s > user 0m 0.00s > sys 0m 0.04s > > > Thank you, > > > -- > Masami Hiramatsu <mhiramat@kernel.org>
On Wed, 22 Jan 2020 18:26:47 -0800 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > Anyway, without this update, I added a printk to count optimizer > > queue-length and found that all optimizer call with a single kprobe > > on the quenes. I think this depends on the machine, but as far as > > I tested on 8-threads qemu x86, shows this result... > > > > Probes: 256 kprobe_events > > Enable events > > real 0m 0.02s > > user 0m 0.00s > > sys 0m 0.02s > > [ 17.730548] Queue-update: 180, skipped, Total Queued: 180 > > [ 17.739445] Queue-update: 1, go, Total Queued: 180 > > Disable events [...] > > [ 41.135594] Queue-update: 1, go, Total Queued: 1 > > real 0m 21.40s > > user 0m 0.00s > > sys 0m 0.04s > > So 21.4s/256 = 84 milliseconds per event disable, correct? Actually, it seems only 172 probes are on the unoptimized list, so the number will be a bit different. Anyway, that above elapsed time is including non-batch optimizer working time as below. (1) start looping on probe events (2) disabling-kprobe (2.1) wait kprobe_mutex if optimizer is running (2.2) if the kprobe is on optimized kprobe, queue it to unoptimizing list and kick optimizer with 5 jiffies delay (4) unlink enabled event (5) wait synchronize_rcu() (6) optimizer start optimization before finishing (5) (7) goto (1) I think the disabling performance issue came from (6) (and (2.1)). Thus, if we change (2.2) to 1 HZ jiffies, the optimizer will start after some loops are done. (and the optimizer detects "active" queuing, postpone the process) > > It might be worth trying synchronize_rcu_expedited() just as an experiment > to see if it speeds things up significantly. Would you mean replacing synchronize_rcu() in disabling loop, or replacing synchronize_rcu_tasks() in optimizer? I think that is not a root cause of this behavior, since if we make the optimizer delay to 1 sec, it seems enough for making it a batch operation. See below, this is the result with patched kernel (1 HZ delay). Probes: 256 kprobe_events Enable events real 0m 0.07s user 0m 0.00s sys 0m 0.07s [ 19.191181] Queue-update: 180, skipped, Total Queued: 180 Disable events [ 20.214966] Queue-update: 1, go, Total Queued: 172 [ 21.302924] Queue-update: 86, skipped, Total Queued: 86 real 0m 2.11s user 0m 0.00s sys 0m 0.03s [ 22.327173] Queue-update: 87, skipped, Total Queued: 172 [ 23.350933] Queue-update: 1, go, Total Queued: 172 Remove events real 0m 2.13s user 0m 0.02s sys 0m 0.02s As you can see, the optimizer ran outside of the disabling loop. In that case, it is OK to synchronize RCU tasks in the optimizer because it just runs *once* per multiple probe events. From above result, 86 probes are disabled per 1 sec delay. Each probe disabling took 11-12 msec in average. So (HZ / 10) can also be good. (But note that the optimizer will retry to run each time until the disabling loop is finished.) BTW, testing kernel was build with HZ=1000, if HZ=250 or HZ=100, the result will be different in the current code. So I think we should use HZ-based delay instead of fixed number. Thank you,
On Thu, Jan 23, 2020 at 03:13:28PM +0900, Masami Hiramatsu wrote: > On Wed, 22 Jan 2020 18:26:47 -0800 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > Anyway, without this update, I added a printk to count optimizer > > > queue-length and found that all optimizer call with a single kprobe > > > on the quenes. I think this depends on the machine, but as far as > > > I tested on 8-threads qemu x86, shows this result... > > > > > > Probes: 256 kprobe_events > > > Enable events > > > real 0m 0.02s > > > user 0m 0.00s > > > sys 0m 0.02s > > > [ 17.730548] Queue-update: 180, skipped, Total Queued: 180 > > > [ 17.739445] Queue-update: 1, go, Total Queued: 180 > > > Disable events > [...] > > > [ 41.135594] Queue-update: 1, go, Total Queued: 1 > > > real 0m 21.40s > > > user 0m 0.00s > > > sys 0m 0.04s > > > > So 21.4s/256 = 84 milliseconds per event disable, correct? > > Actually, it seems only 172 probes are on the unoptimized list, so > the number will be a bit different. > > Anyway, that above elapsed time is including non-batch optimizer > working time as below. > > (1) start looping on probe events > (2) disabling-kprobe > (2.1) wait kprobe_mutex if optimizer is running > (2.2) if the kprobe is on optimized kprobe, queue it to unoptimizing > list and kick optimizer with 5 jiffies delay > (4) unlink enabled event > (5) wait synchronize_rcu() > (6) optimizer start optimization before finishing (5) > (7) goto (1) > > I think the disabling performance issue came from (6) (and (2.1)). > Thus, if we change (2.2) to 1 HZ jiffies, the optimizer will start > after some loops are done. (and the optimizer detects "active" > queuing, postpone the process) > > > > > It might be worth trying synchronize_rcu_expedited() just as an experiment > > to see if it speeds things up significantly. > > Would you mean replacing synchronize_rcu() in disabling loop, or > replacing synchronize_rcu_tasks() in optimizer? The former. (There is no synchronize_rcu_tasks_expedited().) > I think that is not a root cause of this behavior, since if we > make the optimizer delay to 1 sec, it seems enough for making > it a batch operation. See below, this is the result with patched > kernel (1 HZ delay). > > Probes: 256 kprobe_events > Enable events > real 0m 0.07s > user 0m 0.00s > sys 0m 0.07s > [ 19.191181] Queue-update: 180, skipped, Total Queued: 180 > Disable events > [ 20.214966] Queue-update: 1, go, Total Queued: 172 > [ 21.302924] Queue-update: 86, skipped, Total Queued: 86 > real 0m 2.11s > user 0m 0.00s > sys 0m 0.03s > [ 22.327173] Queue-update: 87, skipped, Total Queued: 172 > [ 23.350933] Queue-update: 1, go, Total Queued: 172 > Remove events > real 0m 2.13s > user 0m 0.02s > sys 0m 0.02s > > As you can see, the optimizer ran outside of the disabling loop. > In that case, it is OK to synchronize RCU tasks in the optimizer > because it just runs *once* per multiple probe events. Even better! > >From above result, 86 probes are disabled per 1 sec delay. > Each probe disabling took 11-12 msec in average. So > (HZ / 10) can also be good. (But note that the optimizer > will retry to run each time until the disabling loop is > finished.) > > BTW, testing kernel was build with HZ=1000, if HZ=250 or HZ=100, > the result will be different in the current code. So I think > we should use HZ-based delay instead of fixed number. Agreed, the HZ-based delay seems worth a try. Thanx, Paul
diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 0dacdcecc90f..9c6e230852ad 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -469,7 +469,8 @@ static int kprobe_optimizer_queue_update; static void kprobe_optimizer(struct work_struct *work); static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer); -#define OPTIMIZE_DELAY 5 +/* Wait 1 second for starting optimization */ +#define OPTIMIZE_DELAY HZ /* * Optimize (replace a breakpoint with a jump) kprobes listed on
Since the 5 jiffies delay for the optimizer is too short to wait for other probes, make it longer, like 1 second. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> --- kernel/kprobes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)