Message ID | 20231213000452.88295-1-graf@amazon.com |
---|---|
Headers | show |
Series | kexec: Allow preservation of ftrace buffers | expand |
On Wed, 13 Dec 2023 00:04:45 +0000 Alexander Graf <graf@amazon.com> wrote: > With KHO (Kexec HandOver), we want to preserve trace buffers across > kexec. To carry over their state between kernels, the kernel needs a > common handle for them that exists on both sides. As handle we introduce > names for ring buffers. In a follow-up patch, the kernel can then use > these names to recover buffer contents for specific ring buffers. > Is there a way to use the trace_array name instead? The trace_array is the structure that represents each tracing instance. And it already has a name field. And if you can get the associated ring buffer from that too. struct trace_array *tr; tr->array_buffer.buffer tr->name When you do: mkdir /sys/kernel/tracing/instance/foo You create a new trace_array instance where tr->name = "foo" and allocates the buffer for it as well. -- Steve
Hi Steve, On 13.12.23 01:15, Steven Rostedt wrote: > > On Wed, 13 Dec 2023 00:04:45 +0000 > Alexander Graf <graf@amazon.com> wrote: > >> With KHO (Kexec HandOver), we want to preserve trace buffers across >> kexec. To carry over their state between kernels, the kernel needs a >> common handle for them that exists on both sides. As handle we introduce >> names for ring buffers. In a follow-up patch, the kernel can then use >> these names to recover buffer contents for specific ring buffers. >> > Is there a way to use the trace_array name instead? > > The trace_array is the structure that represents each tracing instance. And > it already has a name field. And if you can get the associated ring buffer > from that too. > > struct trace_array *tr; > > tr->array_buffer.buffer > > tr->name > > When you do: mkdir /sys/kernel/tracing/instance/foo > > You create a new trace_array instance where tr->name = "foo" and allocates > the buffer for it as well. The name in the ring buffer is pretty much just a copy of the trace array name. I use it to reconstruct which buffer we're actually referring to inside __ring_buffer_alloc(). I'm all ears for alternative suggestions. I suppose we could pass tr as argument to ring_buffer_alloc() instead of the name? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Wed, 13 Dec 2023 01:35:16 +0100 Alexander Graf <graf@amazon.com> wrote: > > The trace_array is the structure that represents each tracing instance. And > > it already has a name field. And if you can get the associated ring buffer > > from that too. > > > > struct trace_array *tr; > > > > tr->array_buffer.buffer > > > > tr->name > > > > When you do: mkdir /sys/kernel/tracing/instance/foo > > > > You create a new trace_array instance where tr->name = "foo" and allocates > > the buffer for it as well. > > The name in the ring buffer is pretty much just a copy of the trace > array name. I use it to reconstruct which buffer we're actually > referring to inside __ring_buffer_alloc(). No, I rather not tie the ring buffer to the trace_array. > > I'm all ears for alternative suggestions. I suppose we could pass tr as > argument to ring_buffer_alloc() instead of the name? I'll have to spend some time (that I don't currently have :-( ) on looking at this more. I really don't like the copying of the name into the ring buffer allocation, as it may be an unneeded burden to maintain, not to mention the duplicate field. -- Steve
On Wed, 13 Dec 2023 00:04:46 +0000 Alexander Graf <graf@amazon.com> wrote: > With KHO (Kexec HandOver), we want to preserve trace buffers. To parse > them, we need to ensure that all trace events that exist in the logs are > identical to the ones we parse as. That means we need to match the > events before and after kexec. > > As a first step towards that, let's give every event a unique name. That > way we can clearly identify the event before and after kexec and restore > its ID post-kexec. > > Signed-off-by: Alexander Graf <graf@amazon.com> > --- > include/linux/trace_events.h | 1 + > include/trace/trace_events.h | 2 ++ > kernel/trace/blktrace.c | 1 + > kernel/trace/trace_branch.c | 1 + > kernel/trace/trace_events.c | 3 +++ > kernel/trace/trace_functions_graph.c | 4 +++- > kernel/trace/trace_output.c | 13 +++++++++++++ > kernel/trace/trace_probe.c | 3 +++ > kernel/trace/trace_syscalls.c | 29 ++++++++++++++++++++++++++++ > 9 files changed, 56 insertions(+), 1 deletion(-) > > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h > index d68ff9b1247f..7670224aa92d 100644 > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -149,6 +149,7 @@ struct trace_event { > struct hlist_node node; > int type; > struct trace_event_functions *funcs; > + const char *name; > }; OK, this is a hard no. We definitely need to find a different way to do this. I'm trying hard to lower the footprint of tracing, and this just added 8 bytes to every event on a 64 bit machine. On my box I have 1953 events, and they are constantly growing. This just added 15,624 bytes of tracing overhead to that machine. That may not sound like much, but as this is only for this feature, it just added 15K to the overhead for the majority of users. I'm not sure how easy it is to make this a config option that takes away that field when not set. But I would need that at a minimum. -- Steve
On Wed, Dec 13, 2023 at 12:04:40AM +0000, Alexander Graf wrote: > +int register_kho_notifier(struct notifier_block *nb) > +{ > + return blocking_notifier_chain_register(&kho.chain_head, nb); > +} > +EXPORT_SYMBOL_GPL(register_kho_notifier); > + > +int unregister_kho_notifier(struct notifier_block *nb) > +{ > + return blocking_notifier_chain_unregister(&kho.chain_head, nb); > +} > +EXPORT_SYMBOL_GPL(unregister_kho_notifier); > + > +bool kho_is_active(void) > +{ > + return kho.active; > +} > +EXPORT_SYMBOL_GPL(kho_is_active); > + Why should these helpers be restricted to GPL code?
On Wed, Dec 13, 2023 at 12:04:41AM +0000, Alexander Graf wrote: > +void *kho_get_fdt(void) > +{ > + return fdt; > +} > +EXPORT_SYMBOL_GPL(kho_get_fdt); > + Same question here (and in other places of this code): shouldn't this facility be provided to non-GPL drivers as well? Also a minor nit: "const void *" looks like a cleaner prototype here.
On 13.12.23 19:36, Stanislav Kinsburskii wrote: > On Wed, Dec 13, 2023 at 12:04:40AM +0000, Alexander Graf wrote: >> +int register_kho_notifier(struct notifier_block *nb) >> +{ >> + return blocking_notifier_chain_register(&kho.chain_head, nb); >> +} >> +EXPORT_SYMBOL_GPL(register_kho_notifier); >> + >> +int unregister_kho_notifier(struct notifier_block *nb) >> +{ >> + return blocking_notifier_chain_unregister(&kho.chain_head, nb); >> +} >> +EXPORT_SYMBOL_GPL(unregister_kho_notifier); >> + >> +bool kho_is_active(void) >> +{ >> + return kho.active; >> +} >> +EXPORT_SYMBOL_GPL(kho_is_active); >> + > Why should these helpers be restricted to GPL code? That's a simple one: Everything should be EXPORT_SYMBOL_GPL by default. You need to have really good reasons to export anything for non-GPL modules. I don't have a good reason for them, so it's GPL only :) Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Alexander Graf <graf@amazon.com> writes: > Kexec today considers itself purely a boot loader: When we enter the new > kernel, any state the previous kernel left behind is irrelevant and the > new kernel reinitializes the system. > > However, there are use cases where this mode of operation is not what we > actually want. In virtualization hosts for example, we want to use kexec > to update the host kernel while virtual machine memory stays untouched. > When we add device assignment to the mix, we also need to ensure that > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we > need to do the same for the PCI subsystem. If we want to kexec while an > SEV-SNP enabled virtual machine is running, we need to preserve the VM > context pages and physical memory. See James' and my Linux Plumbers > Conference 2023 presentation for details: > > https://lpc.events/event/17/contributions/1485/ > > To start us on the journey to support all the use cases above, this > patch implements basic infrastructure to allow hand over of kernel state > across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace: > With this patch set applied, you can read ftrace records from the > pre-kexec environment in your post-kexec one. This creates a very powerful > debugging and performance analysis tool for kexec. It's also slightly > easier to reason about than full blown VFIO state preservation. > > == Alternatives == > > There are alternative approaches to (parts of) the problems above: > > * Memory Pools [1] - preallocated persistent memory region + allocator > * PRMEM [2] - resizable persistent memory regions with fixed metadata > pointer on the kernel command line + allocator > * Pkernfs [3] - preallocated file system for in-kernel data with fixed > address location on the kernel command line > * PKRAM [4] - handover of user space pages using a fixed metadata page > specified via command line > > All of the approaches above fundamentally have the same problem: They > require the administrator to explicitly carve out a physical memory > location because they have no mechanism outside of the kernel command > line to pass data (including memory reservations) between kexec'ing > kernels. > > KHO provides that base foundation. We will determine later whether we > still need any of the approaches above for fast bulk memory handover of for > example IOMMU page tables. But IMHO they would all be users of KHO, with > KHO providing the foundational primitive to pass metadata and bulk memory > reservations as well as provide easy versioning for data. What you are describe in many ways is the same problem as kexec-on-panic. The goal of leaving devices running absolutely requires carving out memory for the new kernel to live in while it is coming up so that DMA from a device that was not shutdown down does not stomp the kernel coming up. If I understand the virtualization case some of those virtual machines are going to have virtual NICs that are going to want to DMA memory to the host system. Which if I understand things correctly means that among the devices you explicitly want to keep running there is a not a way to avoid the chance of DMA coming in while the kernel is being changed. There is also a huge maintenance challenge associated with all of this. If you go with something that is essentially kexec-on-panic and then add a little bit to help find things in the memory of the previous kernel while the new kernel is coming up I can see it as a possibility. As an example I think preserving ftrace data of kexec seems bizarre. I don't see how that is an interesting use case at all. Not in the situation of preserving virtual machines, and not in the situation of kexec on panic. If you are doing an orderly shutdown and kernel switch you should be able to manually change the memory. If you are not doing an orderly shutdown then I really don't get it. I don't hate the capability you are trying to build. I have not read or looked at most of this so I am probably missing subtle details. As you are currently describing things I have the sense you have completely misframed the problem and are trying to solve the wrong parts of the problem. Eric
Hey Eric, On 14.12.23 15:58, Eric W. Biederman wrote: > Alexander Graf <graf@amazon.com> writes: > >> Kexec today considers itself purely a boot loader: When we enter the new >> kernel, any state the previous kernel left behind is irrelevant and the >> new kernel reinitializes the system. >> >> However, there are use cases where this mode of operation is not what we >> actually want. In virtualization hosts for example, we want to use kexec >> to update the host kernel while virtual machine memory stays untouched. >> When we add device assignment to the mix, we also need to ensure that >> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we >> need to do the same for the PCI subsystem. If we want to kexec while an >> SEV-SNP enabled virtual machine is running, we need to preserve the VM >> context pages and physical memory. See James' and my Linux Plumbers >> Conference 2023 presentation for details: >> >> https://lpc.events/event/17/contributions/1485/ >> >> To start us on the journey to support all the use cases above, this >> patch implements basic infrastructure to allow hand over of kernel state >> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace: >> With this patch set applied, you can read ftrace records from the >> pre-kexec environment in your post-kexec one. This creates a very powerful >> debugging and performance analysis tool for kexec. It's also slightly >> easier to reason about than full blown VFIO state preservation. >> >> == Alternatives == >> >> There are alternative approaches to (parts of) the problems above: >> >> * Memory Pools [1] - preallocated persistent memory region + allocator >> * PRMEM [2] - resizable persistent memory regions with fixed metadata >> pointer on the kernel command line + allocator >> * Pkernfs [3] - preallocated file system for in-kernel data with fixed >> address location on the kernel command line >> * PKRAM [4] - handover of user space pages using a fixed metadata page >> specified via command line >> >> All of the approaches above fundamentally have the same problem: They >> require the administrator to explicitly carve out a physical memory >> location because they have no mechanism outside of the kernel command >> line to pass data (including memory reservations) between kexec'ing >> kernels. >> >> KHO provides that base foundation. We will determine later whether we >> still need any of the approaches above for fast bulk memory handover of for >> example IOMMU page tables. But IMHO they would all be users of KHO, with >> KHO providing the foundational primitive to pass metadata and bulk memory >> reservations as well as provide easy versioning for data. > What you are describe in many ways is the same problem as > kexec-on-panic. The goal of leaving devices running absolutely requires > carving out memory for the new kernel to live in while it is coming up > so that DMA from a device that was not shutdown down does not stomp the > kernel coming up. Yes, part of the problem is similar: We need a safe space to boot from that doesn't overwrite existing data. What happens after is different: With panics, you're trying to rescue previous state for post-mortem analysis. You may even have intrinsic knowledge of the environment you came from, so you can optimize that rescuing. Nobody wants to continue running the system as if nothing happened after a panic. With KHO, the kernels establish an ABI between each other to communicate any state that needs to get preserved and the rest gets reinitialized. After KHO, the new kernel continues executing workloads that were running before. The ABI is important because the next environment may not have a chance to know about the previous environment's setup. Think for example of roll-out and roll-back scenarios: If I roll back into my previous environment because I determined something didn't work as expected after update, I'm moving the system into an environment that was built when the kexec source environment didn't even exist yet. > If I understand the virtualization case some of those virtual machines > are going to have virtual NICs that are going to want to DMA memory to > the host system. Which if I understand things correctly means that No, to the *guest* system. This is about device assignment: The guest is in full control of the NICs that do DMA, so we have no chance to quiesce them. > among the devices you explicitly want to keep running there is a not > a way to avoid the chance of DMA coming in while the kernel is being > changed. Correct, because the host doesn't own the driver :). > There is also a huge maintenance challenge associated with all of this. > > If you go with something that is essentially kexec-on-panic and then > add a little bit to help find things in the memory of the previous > kernel while the new kernel is coming up I can see it as a possibility. That's roughly what the patch set is doing, yes. It avoids a static allocation ahead of time for next-kernel memory, because I only know the size of all components when we're actually doing the kexec. But the principle is similar. The bit where the new kernel finds bits in the old memory is the KHO DT: A flattened device tree structure the old kernel passes to the new kernel. That contains all memory locations as well as additional metadata to "help find things" in a way that doesn't immediately break on every kernel change. > As an example I think preserving ftrace data of kexec seems bizarre. > I don't see how that is an interesting use case at all. Not in > the situation of preserving virtual machines, and not in the situation > of kexec on panic. It's super useful as self debugging aid: I already used it to profile the kexec path to find a few performance issues :). It's also really helpful - even without device assignment support yet - when you use it in combination with KVM trace points: You have a VM running backed by a DAX pmem device, then serialize its virtual device state, kexec, restore from the virtual device state, then the VM misbehaves. With ftrace handover in place, you get a full trace of the flow which simplifies debugging of issues that happen during/because of the serialization/deserialization flow of KVM state. But the main reason I chose ftrace to start with is that all other use cases require another concept: fd preservation. All the typical "objects" you want to preserve across kexec are anonymous file descriptors. So we need to also build a way in Linux that allows user space to request the kernel to preserve an fd using the kexec handover framework in this patch set. But that is another big discussion I wanted to keep separate: Ftrace is from kernel, to kernel and hence "easy". > If you are doing an orderly shutdown and kernel switch you should be > able to manually change the memory. If you are not doing an orderly > shutdown then I really don't get it. I don't follow the paragraph above? > I don't hate the capability you are trying to build. > > I have not read or looked at most of this so I am probably > missing subtle details. > > As you are currently describing things I have the sense you have > completely misframed the problem and are trying to solve the wrong parts > of the problem. Very well possible :). I hope the above clarifies it a bit. If not, please let me know where exactly it's unclear so I can elaborate. If you have a few minutes, it would also be great if you could have a look at our slides [1] or even video [2] from LPC 2023 which go into detail of the end problem. Beware that I'm consciously *not* trying to solve the end problem yet: I want to take baby steps towards it. Nobody wants to review an 80 patches patch set where everything depends on everything else. Alex [1] https://lpc.events/event/17/contributions/1485/attachments/1296/2650/jgowans-preserving-across-kexec.pdf [2] https://www.youtube.com/watch?v=cYrlV4bK1Y4 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879