Message ID | 04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com |
---|---|
Headers | show |
Series | "Task_isolation" mode | expand |
Alex, Alex Belits <abelits@marvell.com> writes: > This is a new version of task isolation implementation. Previous version is at > https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/ > > Mostly this covers race conditions prevention on breaking isolation. Early after kernel entry, > task_isolation_enter() is called to update flags visible to other CPU cores and to perform > synchronization if necessary. Before this call only "safe" operations happen, as long as > CONFIG_TRACE_IRQFLAGS is not enabled. Without going into details of the individual patches, let me give you a high level view of this series: 1) Entry code handling: That's completely broken vs. the careful ordering and instrumentation protection of the entry code. You can't just slap stuff randomly into places which you think are safe w/o actually trying to understand why this code is ordered in the way it is. This clearly was never built and tested with any of the relevant debug options enabled. Both build and boot would have told you. 2) Instruction synchronization Trying to do instruction synchronization delayed is a clear recipe for hard to diagnose failures. Just because it blew not up in your face does not make it correct in any way. It's broken by design and violates _all_ rules of safe instruction patching and introduces a complete trainwreck in x86 NMI processing. If you really think that this is correct, then please have at least the courtesy to come up with a detailed and precise argumentation why this is a valid approach. While writing that up you surely will find out why it is not. 3) Debug calls Sprinkling debug calls around the codebase randomly is not going to happen. That's an unmaintainable mess. Aside of that none of these dmesg based debug things is necessary. This can simply be monitored with tracing. 4) Tons of undocumented smp barriers See Documentation/process/submit-checklist.rst #25 5) Signal on page fault Why is this a magic task isolation feature instead of making it something which can be used in general? There are other legit reasons why a task might want a notification about an unexpected (resolved) page fault. 6) Coding style violations all over the place Using checkpatch.pl is mandatory 7) Not Cc'ed maintainers While your Cc list is huge, you completely fail to Cc the relevant maintainers of various files and subsystems as requested in Documentation/process/* 8) Changelogs Most of the changelogs have something along the lines: 'task isolation does not want X, so do Y to make it not do X' without any single line of explanation why this approach was chosen and why it is correct under all circumstances and cannot have nasty side effects. It's not the job of the reviewers/maintainers to figure this out. Please come up with a coherent design first and then address the identified issues one by one in a way which is palatable and reviewable. Throwing a big pile of completely undocumented 'works for me' mess over the fence does not get you anywhere, not even to the point that people are willing to review it in detail. Thanks, tglx
On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote: > 2) Instruction synchronization > > Trying to do instruction synchronization delayed is a clear recipe > for hard to diagnose failures. Just because it blew not up in your > face does not make it correct in any way. It's broken by design and > violates _all_ rules of safe instruction patching and introduces a > complete trainwreck in x86 NMI processing. > > If you really think that this is correct, then please have at least > the courtesy to come up with a detailed and precise argumentation > why this is a valid approach. > > While writing that up you surely will find out why it is not. So delaying the sync_core() IPIs for kernel text patching _might_ be possible, but it very much wants to be a separate patchset and not something hidden inside a 'gem' like this.
On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote: > 8) Changelogs > > Most of the changelogs have something along the lines: > > 'task isolation does not want X, so do Y to make it not do X' > > without any single line of explanation why this approach was chosen > and why it is correct under all circumstances and cannot have nasty > side effects. > > It's not the job of the reviewers/maintainers to figure this out. > > Please come up with a coherent design first and then address the > identified issues one by one in a way which is palatable and reviewable. > > Throwing a big pile of completely undocumented 'works for me' mess over > the fence does not get you anywhere, not even to the point that people > are willing to review it in detail. This.. as presented it is an absolutely unreviewable pile of junk. It presents code witout any coherent problem description and analysis. And the patches are not split sanely either.
Peter Zijlstra <peterz@infradead.org> writes: > On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote: > >> 2) Instruction synchronization >> >> Trying to do instruction synchronization delayed is a clear recipe >> for hard to diagnose failures. Just because it blew not up in your >> face does not make it correct in any way. It's broken by design and >> violates _all_ rules of safe instruction patching and introduces a >> complete trainwreck in x86 NMI processing. >> >> If you really think that this is correct, then please have at least >> the courtesy to come up with a detailed and precise argumentation >> why this is a valid approach. >> >> While writing that up you surely will find out why it is not. > > So delaying the sync_core() IPIs for kernel text patching _might_ be > possible, but it very much wants to be a separate patchset and not > something hidden inside a 'gem' like this. I'm not saying it's impossible, but the proposed hack is definitely beyond broken and you really don't want to be the one who has to mop up the pieces later. Thanks, tglx
On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote: > > Without going into details of the individual patches, let me give you a > high level view of this series: > > 1) Entry code handling: > > That's completely broken vs. the careful ordering and instrumentation > protection of the entry code. You can't just slap stuff randomly > into places which you think are safe w/o actually trying to understand > why this code is ordered in the way it is. > > This clearly was never built and tested with any of the relevant > debug options enabled. Both build and boot would have told you. This is intended to avoid a race condition when entry or exit from isolation happens at the same time as an event that requires synchronization. The idea is, it is possible to insulate the core from all events while it is running isolated task in userspace, it will receive those calls normally after breaking isolation and entering kernel, and it will synchronize itself on kernel entry. This has two potential problems that I am trying to solve: 1. Without careful ordering, there will be a race condition with events that happen at the same time as kernel entry or exit. 2. CPU runs some kernel code after entering but before synchronization. This code should be restricted to early entry that is not affected by the "stale" state, similar to how IPI code that receives synchronization events does it normally. I can't say that I am completely happy with the amount of kernel entry handling that had to be added. The problem is, I am trying to introduce a feature that allows CPU cores to go into "de-synchronized" state while running isolated tasks and not receiving synchronization events that normally would reach them. This means, there should be established some point on kernel entry when it is safe for the core to catch up with the rest of kernel. It may be useful for other purposes, however at this point task isolation is the first to need it, so I had to determine where such point is for every supported architecture and method of kernel entry. I have found that each architecture has its own way of handling this, and sometimes individual interrupt controller drivers vary in their sequence of calls on early kernel entry. For x86 I also have an implementation for kernel 5.6, before your changes to IDT macros. That version is much less straightforward, so I am grateful for those relatively recent improvements. Nevertheless, I believe that the goal of finding those points and using them for synchronization is valid. If you can recommend me a better way for at least x86, I will be happy to follow your advice. I have tried to cover kernel entry in a generic way while making the changes least disruptive, and this is why it looks simple and spread over multiple places. I also had to do the same for arm and arm64 (that I use for development), and for each architecture I had to produce sequences of entry points and function calls to determine the correct placement of task_isolation_enter() calls in them. It is not random, however it does reflect the complex nature of kernel entry code. I believe, RCU implementation faced somewhat similar requirements for calls on kernel entry, however it is not completely unified, either > 2) Instruction synchronization > Trying to do instruction synchronization delayed is a clear recipe > for hard to diagnose failures. Just because it blew not up in your > face does not make it correct in any way. It's broken by design and > violates _all_ rules of safe instruction patching and introduces a > complete trainwreck in x86 NMI processing. The idea is that just like synchronization events are handled by regular IPI, we already use some code with the assumption that it is safe to be entered in "stale" state before synchronization. I have extended it to allow synchronization points on all kernel entry points. > If you really think that this is correct, then please have at least > the courtesy to come up with a detailed and precise argumentation > why this is a valid approach. > > While writing that up you surely will find out why it is not. > I had to document a sequence of calls for every entry point on three supported architectures, to determine the points for synchronization. It is possible that I have somehow missed something, however I don't see a better approach, save for establishing a kernel-wide infrastructure for this. And even if we did just that, it would be possible to implement this kind of synchronization point calls first, and convert them to something more generic later. > > 3) Debug calls > > Sprinkling debug calls around the codebase randomly is not going to > happen. That's an unmaintainable mess. Those report isolation breaking causes, and are intended for application and system debugging. > > Aside of that none of these dmesg based debug things is necessary. > This can simply be monitored with tracing. I think, it would be better to make all that information available to the userspace application, however I have based this on the Chris Metcalf code, and gradually updated the mechanisms and interfaces. The original reporting of isolation breaking causes had far greater problems, so at first I wanted to have something that produces easily visible and correct reporting, and does not break things while doing so. > > 4) Tons of undocumented smp barriers > > See Documentation/process/submit-checklist.rst #25 > That should be fixed. > 5) Signal on page fault > > Why is this a magic task isolation feature instead of making it > something which can be used in general? There are other legit > reasons why a task might want a notification about an unexpected > (resolved) page fault. Page fault causes isolation breaking. When a task runs in isolated mode it does so because it requires predictable timing, so causing page faults and expecting them to be handled along the way would defeat the purpose of isolation. So if page fault did happen, it is important that application will receive notification about isolation being broken, and then may decide to do something about it, re-enter isolation, etc. > > 6) Coding style violations all over the place > > Using checkpatch.pl is mandatory > > 7) Not Cc'ed maintainers > > While your Cc list is huge, you completely fail to Cc the relevant > maintainers of various files and subsystems as requested in > Documentation/process/* To be honest, I am not sure, whom I have missed, I tried to include everyone from my previous attempt. > > 8) Changelogs > > Most of the changelogs have something along the lines: > > 'task isolation does not want X, so do Y to make it not do X' > > without any single line of explanation why this approach was chosen > and why it is correct under all circumstances and cannot have nasty > side effects. This is the same as the previous version, except for the addition of kernel entry handling. As far as I can tell, the rest was discussed before, and not many questions remained except for the race condition on kernel entry. I agree that kernel entry handling is a complex issue in itself, so I have included explanation of entry points / function calls sequences for each supported architecture. I have longer call diagram, that I used to track each particular function, it probably should be included as a separate document. > It's not the job of the reviewers/maintainers to figure this out. > > Please come up with a coherent design first and then address the > identified issues one by one in a way which is palatable and reviewable. > > Throwing a big pile of completely undocumented 'works for me' mess over > the fence does not get you anywhere, not even to the point that people > are willing to review it in detail. There is a design, and it is a result of a careful tracking of calls in the kernel source. It has multiple point where task_isolation_enter() is called for a reason similar to why RCU-related functions are called in multiple places. If someone can recommend a better way to introduce a kernel entry checkpoint for synchronization that did not exist before, I will be happy to hear it.
On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote: > . > > This.. as presented it is an absolutely unreviewable pile of junk. It > presents code witout any coherent problem description and analysis. > And > the patches are not split sanely either. There is a more complete and slightly outdated description in the previous version of the patch at https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/ . It allows userspace application to take a CPU core for itself and run completely isolated, with no disturbances. There is work in progress that also disables and re-enables TLB flushes, and depending on CPU it may be possible to also pre-allocate cache, so it would not be affected by the rest of the system. Events that cause interaction with isolated task, cause isolation breaking, turning the task into a regular userspace task that can continue running normally and enter isolated state again if necessary. To make this feature suitable for any practical use, many mechanisms that normally would cause events on a CPU, should exclude CPU cores in this state, and synchronization should happen later, at the time of isolation breaking. There are three architectures supported, x86, arm and arm64, and it should be possible to extend it to others. Unfortunately kernel entry procedures are neither unified, nor straightforward, so introducing new feature to them causes an appearance of a mess.
On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote: > > On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote: > > . > > > > This.. as presented it is an absolutely unreviewable pile of junk. It > > presents code witout any coherent problem description and analysis. > > And > > the patches are not split sanely either. > > There is a more complete and slightly outdated description in the > previous version of the patch at > https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/ Not the point, you're mixing far too many things in one go. You also have the patches split like 'generic / arch-1 / arch-2' which is wrong per definition, as patches should be split per change and not care about sily boundaries. Also, if you want generic entry code, there's patches for that here: https://lkml.kernel.org/r/20200722215954.464281930@linutronix.de
On Thu, Jul 23, 2020 at 03:18:42PM +0000, Alex Belits wrote: > On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote: > > > > Without going into details of the individual patches, let me give you a > > high level view of this series: > > > > 1) Entry code handling: > > > > That's completely broken vs. the careful ordering and instrumentation > > protection of the entry code. You can't just slap stuff randomly > > into places which you think are safe w/o actually trying to understand > > why this code is ordered in the way it is. > > > > This clearly was never built and tested with any of the relevant > > debug options enabled. Both build and boot would have told you. > > This is intended to avoid a race condition when entry or exit from isolation > happens at the same time as an event that requires synchronization. The idea > is, it is possible to insulate the core from all events while it is running > isolated task in userspace, it will receive those calls normally after > breaking isolation and entering kernel, and it will synchronize itself on > kernel entry. 'What does noinstr mean? and why do we have it" -- don't dare touch the entry code until you can answer that.
On Thu, 2020-07-23 at 17:48 +0200, Peter Zijlstra wrote: > On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote: > > On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote: > > > . > > > > > > This.. as presented it is an absolutely unreviewable pile of > > > junk. It > > > presents code witout any coherent problem description and > > > analysis. > > > And > > > the patches are not split sanely either. > > > > There is a more complete and slightly outdated description in the > > previous version of the patch at > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_07c25c246c55012981ec0296eee23e68c719333a.camel-40marvell.com_&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=FFZaj-KanwqEiXYCdjd96JOgP_GAOnanpkw6bBvNrK4&e= > > Not the point, you're mixing far too many things in one go. You also > have the patches split like 'generic / arch-1 / arch-2' which is > wrong > per definition, as patches should be split per change and not care > about > sily boundaries. This follows the original patch by Chris Metcalf. There is a reason for that -- per-architecture changes are independent from each other and affect not just code but functionality that was implemented per- architecture. To support more architectures, it will be necessary to do it separately for each, and mark them supported with HAVE_ARCH_TASK_ISOLATION. Having only some architectures supported does not break anything for the rest -- architectures that are not covered, would not have this functionality. > > Also, if you want generic entry code, there's patches for that here: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20200722215954.464281930-40linutronix.de&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=nZXIviY7rva31KvPgSVnTacwFNbsmkdW0LxSTfYSiqg&e= > > > That looks useful. Why didn't Thomas Gleixner mention it in his criticism of my approach if he already solved that exact problem, at least for x86?
On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote: > > 'What does noinstr mean? and why do we have it" -- don't dare touch > the > entry code until you can answer that. noinstr disables instrumentation, so there would not be calls and dependencies on other parts of the kernel when it's not yet safe to call them. Relevant functions already have it, and I add an inline call to perform flags update and synchronization. Unless something else is involved, those operations are safe, so I am not adding anything that can break those.
Alex, Alex Belits <abelits@marvell.com> writes: > On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote: >> >> Without going into details of the individual patches, let me give you a >> high level view of this series: >> >> 1) Entry code handling: >> >> That's completely broken vs. the careful ordering and instrumentation >> protection of the entry code. You can't just slap stuff randomly >> into places which you think are safe w/o actually trying to understand >> why this code is ordered in the way it is. >> >> This clearly was never built and tested with any of the relevant >> debug options enabled. Both build and boot would have told you. > > This is intended to avoid a race condition when entry or exit from isolation > happens at the same time as an event that requires synchronization. The idea > is, it is possible to insulate the core from all events while it is running > isolated task in userspace, it will receive those calls normally after > breaking isolation and entering kernel, and it will synchronize itself on > kernel entry. It does not matter what your intention is. Fact is that you disrupt a carefully designed entry code sequence without even trying to point out that you did so because you don't know how to do it better. There is a big fat comment above enter_from_user_mode() which should have make you ask at least. Peter and myself spent month on getting this correct vs. RCU, instrumentation, code patching and some more things. From someone who tries to fiddle with such a sensitive area of code it's not too much asked that he follows or reads up on these changes instead of just making uninformed choices of placement by defining that this new stuff is the most important thing on the planet or at least documenting why this is correct and not violating any of the existing constraints. > This has two potential problems that I am trying to solve: > > 1. Without careful ordering, there will be a race condition with events that > happen at the same time as kernel entry or exit. Entry code is all about ordering. News at 11. > 2. CPU runs some kernel code after entering but before synchronization. This > code should be restricted to early entry that is not affected by the "stale" > state, similar to how IPI code that receives synchronization events does it > normally. And because of that you define that you can place anything you need _before_ functionality which is essential for establishing kernel state correctly without providing the minimum proof that this does not violate any of the existing contraints. > reach them. This means, there should be established some point on kernel entry > when it is safe for the core to catch up with the rest of kernel. It may be > useful for other purposes, however at this point task isolation is the first > to need it, so I had to determine where such point is for every supported > architecture and method of kernel entry. You decided that your feature has to run first. Where is the analysis that this is safe and correct vs. the existing ordering constraints? Why does this trigger build and run time warnings? (I neither built nor ran it, but with full debug enabled it will for sure). > Nevertheless, I believe that the goal of finding those points and using > them for synchronization is valid. The goal does not justify the means. > If you can recommend me a better way for at least x86, I will be happy > to follow your advice. I have tried to cover kernel entry in a generic > way while making the changes least disruptive, and this is why it > looks simple and spread over multiple places. It does not look simple. It looks random and like the outcome of try and error. Oh, here it explodes, lets slap another instance into it. > I also had to do the same for arm and arm64 (that I use for > development), and for each architecture I had to produce sequences of > entry points and function calls to determine the correct placement of > task_isolation_enter() calls in them. It is not random, however it does > reflect the complex nature of kernel entry code. I believe, RCU > implementation faced somewhat similar requirements for calls on kernel > entry, however it is not completely unified, either But RCU has a well defined design and requirement list and people are working on making the entry sequence generic and convert architectures over to it. And no, we don't try to do 5 architectures at once. We did x86 with an eye on others. It's not perfect and it never will be because of hardware. >> 2) Instruction synchronization >> Trying to do instruction synchronization delayed is a clear recipe >> for hard to diagnose failures. Just because it blew not up in your >> face does not make it correct in any way. It's broken by design and >> violates _all_ rules of safe instruction patching and introduces a >> complete trainwreck in x86 NMI processing. > > The idea is that just like synchronization events are handled by regular IPI, > we already use some code with the assumption that it is safe to be entered in > "stale" state before synchronization. I have extended it to allow > synchronization points on all kernel entry points. The idea is clear, just where is the analysis that this is safe? Just from quickly skimming the code it's clear that this has never been done. Experimental development on the base of 'does not explode' is not a valid approach in the kernel whatever your goal is. >> If you really think that this is correct, then please have at least >> the courtesy to come up with a detailed and precise argumentation >> why this is a valid approach. >> >> While writing that up you surely will find out why it is not. > > I had to document a sequence of calls for every entry point on three supported > architectures, to determine the points for synchronization. Why is that documentation not part of the patches in form of documentation or proper changelogs? > It is possible that I have somehow missed something, however I don't > see a better approach, save for establishing a kernel-wide > infrastructure for this. And even if we did just that, it would be > possible to implement this kind of synchronization point calls first, > and convert them to something more generic later. You're putting the cart before the horse. You want delayed instruction patching synchronization. So the right approach is to: 1) Analyze the constraints of instruction patching on a given architecture. 2) Implement a scheme for this architecture to handle delayed patching as a stand alone feature with well documented and fine grained patches and proper prove that none of the constraints is violated. Find good arguments why such a feature is generally useful and not only for your personal pet pieve. Once you've done that, then you'll find out that there is no need for magic task isolation hackery simply because it's already there. Code patching is very much architecture specific and the constraints vary due to the different hardware requirements. The idea of making this generic is laudable, but naive at best. Once you have done #1 above on two architectures you will know why. >> 3) Debug calls >> >> Sprinkling debug calls around the codebase randomly is not going to >> happen. That's an unmaintainable mess. > > Those report isolation breaking causes, and are intended for application and > system debugging. I don't care what they do as that does not make them more palatable or maintainable. >> >> Aside of that none of these dmesg based debug things is necessary. >> This can simply be monitored with tracing. > > I think, it would be better to make all that information available to the > userspace application, however I have based this on the Chris Metcalf code, > and gradually updated the mechanisms and interfaces. The original reporting > of isolation breaking causes had far greater problems, so at first I wanted > to have something that produces easily visible and correct reporting, and > does not break things while doing so. Why are you exposing other people to these horrors? I don't care what you use in your development branch and I don't care what you share with your friends, but if you want maintainers and reviewers to look at that stuff then ensure that what you present: - Makes sense - Is properly implemented - Is properly documented - Is properly argumented why this is the right approach. 'I need', 'I want', 'this does' are non-arguments to begin with. >> 5) Signal on page fault >> >> Why is this a magic task isolation feature instead of making it >> something which can be used in general? There are other legit >> reasons why a task might want a notification about an unexpected >> (resolved) page fault. > > Page fault causes isolation breaking. When a task runs in isolated mode it > does so because it requires predictable timing, so causing page faults and > expecting them to be handled along the way would defeat the purpose of > isolation. So if page fault did happen, it is important that application will > receive notification about isolation being broken, and then may decide to do > something about it, re-enter isolation, etc. Did you actually read what I wrote? I very much understood what you are trying to do and why. Otherwise I wouldn't have written the above. >> 6) Coding style violations all over the place >> >> Using checkpatch.pl is mandatory >> >> 7) Not Cc'ed maintainers >> >> While your Cc list is huge, you completely fail to Cc the relevant >> maintainers of various files and subsystems as requested in >> Documentation/process/* > > To be honest, I am not sure, whom I have missed, I tried to include everyone > from my previous attempt. May I ask you to read, understand and follow the documentation I pointed you to? >> 8) Changelogs >> >> Most of the changelogs have something along the lines: >> >> 'task isolation does not want X, so do Y to make it not do X' >> >> without any single line of explanation why this approach was chosen >> and why it is correct under all circumstances and cannot have nasty >> side effects. > > This is the same as the previous version, except for the addition of kernel > entry handling. As far as I can tell, the rest was discussed before, and not > many questions remained except for the race condition on kernel entry. How is that related to changelogs which are useless? > agree that kernel entry handling is a complex issue in itself, so I have > included explanation of entry points / function calls sequences for each > supported architecture. Which explanations? Let's talk about 7/13 the x86 part: > In prepare_exit_to_usermode(), run cleanup for tasks exited fromi > isolation and call task_isolation_start() for tasks that entered > TIF_TASK_ISOLATION. > > In syscall_trace_enter(), add the necessary support for reporting > syscalls for task-isolation processes. > > Add task_isolation_remote() calls for the kernel exception types > that do not result in signals, namely non-signalling page faults. > > Add task_isolation_kernel_enter() calls to interrupt and syscall > entry handlers. > > This mechanism relies on calls to functions that call > task_isolation_kernel_enter() early after entry into kernel. Those > functions are: > > enter_from_user_mode() > called from do_syscall_64(), do_int80_syscall_32(), > do_fast_syscall_32(), idtentry_enter_user(), > idtentry_enter_cond_rcu() > idtentry_enter_cond_rcu() > called from non-raw IDT macros and other entry points > idtentry_enter_user() > nmi_enter() > xen_call_function_interrupt() > xen_call_function_single_interrupt() > xen_irq_work_interrupt() Can you point me to a single word of explanation in this blurb? It's a list of things WHAT the patch does without a single word of WHY and without a single word of WHY any of this would be correct. > I have longer call diagram, that I used to track each particular > function, it probably should be included as a separate document. Call diagrams are completely useless. The people who have to review this know how that works. They want real explanations: - Why is this the right approach - Why does this not violate constraints A, B, C - What are the potential side effects - ... All of this is asked for in Documentation/process/* for a reason. >> It's not the job of the reviewers/maintainers to figure this out. >> >> Please come up with a coherent design first and then address the >> identified issues one by one in a way which is palatable and reviewable. >> >> Throwing a big pile of completely undocumented 'works for me' mess over >> the fence does not get you anywhere, not even to the point that people >> are willing to review it in detail. > > There is a design, and it is a result of a careful tracking of calls in the > kernel source. It has multiple point where task_isolation_enter() is called > for a reason similar to why RCU-related functions are called in multiple > places. Design based on call tracking? That must be some newfangled method of design which was not taught when I was in school. You can do analysis with call tracking, but not design. Comparing this to RCU is beyond hillarious. RCU has design and requirements documented and every single instance of RCU state establishment has been argued in the changelogs and is most of the time (except for the obvious places) extensively commented. > If someone can recommend a better way to introduce a kernel entry > checkpoint for synchronization that did not exist before, I will be happy > to hear it. Start with a coherent explanation of: - What you are trying to achieve - Which problems did you observe in your analysis including the impact of the problem on your goal. - A per problem conceptual approach to solve it along with cleanly implemented and independent RFC code for each particular problem without tons of debug hacks and the vain attempts to make everything generic. There might be common parts of it, but as explained with code patching and #PF signals they can be completely independent of each other. Thanks, tglx
Alex Belits <abelits@marvell.com> writes: > On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote: >> >> 'What does noinstr mean? and why do we have it" -- don't dare touch >> the >> entry code until you can answer that. > > noinstr disables instrumentation, so there would not be calls and > dependencies on other parts of the kernel when it's not yet safe to > call them. Relevant functions already have it, and I add an inline call > to perform flags update and synchronization. Unless something else is > involved, those operations are safe, so I am not adding anything that > can break those. Sure. 1) That inline function can be put out of line by the compiler and placed into the regular text section which makes it subject to instrumentation 2) That inline function invokes local_irq_save() which is subject to instrumentation _before_ the entry state for the instrumentation mechanisms is established. 3) That inline function invokes sync_core() before important state has been established, which is especially interesting in NMI like exceptions. As you clearly documented why all of the above is safe and does not cause any problems, it's just me and Peter being silly, right? Try again. Thanks, tglx
On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote: > External Email > > ------------------------------------------------------------------- > --- > Alex Belits <abelits@marvell.com> writes: > > On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote: > > > 'What does noinstr mean? and why do we have it" -- don't dare > > > touch > > > the > > > entry code until you can answer that. > > > > noinstr disables instrumentation, so there would not be calls and > > dependencies on other parts of the kernel when it's not yet safe to > > call them. Relevant functions already have it, and I add an inline > > call > > to perform flags update and synchronization. Unless something else > > is > > involved, those operations are safe, so I am not adding anything > > that > > can break those. > > Sure. > > 1) That inline function can be put out of line by the compiler and > placed into the regular text section which makes it subject to > instrumentation > > 2) That inline function invokes local_irq_save() which is subject to > instrumentation _before_ the entry state for the instrumentation > mechanisms is established. > > 3) That inline function invokes sync_core() before important state > has > been established, which is especially interesting in NMI like > exceptions. > > As you clearly documented why all of the above is safe and does not > cause any problems, it's just me and Peter being silly, right? > > Try again. I don't think, accusations and mockery are really necessary here. I am trying to do the right thing here. In particular, I am trying to port the code that was developed on platforms that have not yet implemented those useful instrumentation safety features of x86 arch support. For most of the development time I had to figure out, where the synchronization can be safely inserted into kernel entry code on three platforms and tens of interrupt controller drivers, with some of those presenting unusual exceptions (forgive me the pun) from platform- wide conventions. I really appreciate the work you did cleaning up kernel entry procedures, my 5.6 version of this patch had to follow a much more complex and I would say, convoluted entry handling on x86, and now I don't have to do that, thanks to you. Unfortunately, most of my mental effort recently had to be spent on three things: 1. (small): finding a way to safely enable events and synchronize state on kernel entry, so it will not have a race condition between isolation-breaking kernel entry and an event that was disabled while the task was isolated. 2. (big): trying to derive any useful rules applicable to kernel entry in various architectures, finding that there is very little consistency across architectures, and whatever exists, can be broken by interrupt controller drivers that don't all follow the same rules as the rest of the platform. 3. (medium): introducing calls to synchronization on all kernel entry procedures, in places where it is guaranteed to not normally yet have done any calls to parts of the kernel that may be affected by "stale" state, and do it in a manner as consistent and generalized as possible. The current state of kernel entry handling on arm and arm64 architectures has significant differences from x86 and from each other. There is also a matter of interrupt controllers. As can be seen in interrupt controller-specific patch, I had to accommodate some variety of custom interrupt entry code. What can not be seen, is that I had to check that all other interrupt controller drivers and architecture- specific entry procedures, and find that they _do_ follow some understandable rules -- unfortunately architecture-specific and not documented in any manner. I have no valid reasons for complaining about it. I could not expect that authors of all kernel entry procedures would have any foreknowledge that someone at some point may have a reason to establish any kind of synchronization point for CPU cores. And this is why I had to do my research by manually drawing call trees and sequences, separately for every entry on every supported architecture, and across two or three versions of kernel, as those were changing along the way. The result of this may be not a "design" per se, but an understanding of how things are implemented, and what rules are being followed, so I could add my code in a manner consistent with what is done, and document the whole thing. Then there will be some written rules to check for, when anything of this kind will be necessary again (say, with TLB, but considering how much now is done in userspace, possibly to accommodate more exotic CPU features that may have state messed up by userspace). I am afraid, this task, kernel entry documentation, would take me some time, and I did not want to delay my task isolation patch for this reason. As I followed whatever rules I have determined to be applicable, I have produced code that introduces hooks in multiple seemingly unrelated to each other places. Whenever there was a, forgive me the pun, exception to those rules, another special case had to be handled. So no, I did not just add entry hooks randomly, and your accusations of having no design are unwarranted. My patches reflect what is already in code and in its design, I have added one simple rule that entry hook runs at the point when no dependency on something that requires synchronization, exists yet. The entry hook is small, you have already seen all of it while listing things that are not compatible with noinst. Its mechanism and purpose are explained in general description of task isolation. I don't think, I can provide a better explanation. I have to apologize for not taking into account all your carefully built instrumentation safety support. That was one thing I have missed. However at this point the only way for me to find out that I am wrong about it, and my code does not comply with expectations defined by advanced state of x86 architecture development, was to present whatever I could do right, based on experience with other platforms. I don't think, this warrants such hostility. Another issue that you have asked me to defend is the existence and scope of task isolation itself. I have provided long explanation in changelog and previous discussions of this patch, and before me so did Chris Metcalf and occasionally Yuri Norov. While I understand that this is an unusual feature and by its nature it affects kernel in multiple places, it does not deserve to be called a "mess" and other synonyms of "mess". It's an attempt to introduce a feature that turns Linux userspace into superior replacement of RTOS. Considering current state of CPU and SoC development, it is becoming very difficult even for vendor-specific RTOS to keep up with advanced hardware features. Linux keeps up with them just fine, however it lacks the ability to truly leave the CPU core alone, to run the performance-critical and latency- critical part of a task in a manner that RTOS user would expect. Very close but not yet. Task isolation provides the last step for this RTOS replacement. It is implemented in a manner that allows the user to combine Linux resource handling and initialization with RTOS isolation and latency. The decision about page faults is a part of this design, as well as many other decisions implemented in this code. Many may disagree with either those decisions, or the validity of a goal, some may even argue that it's a bad thing to provide a reason to stop RTOS development (I think, this is a good thing but that's not the point). However most definitely this is not a "mess", and it I do not believe that I have to defend the validity of this direction of development, or be accused of general incompetence every time someone finds a frustrating mistake in my code. As I said, I am trying to do the right thing, and want to bring my code not only to the state where x86 support is on par with other platforms (that is, working when instrumentation is disabled), but also make it fully compliant with current requirements of x86 platform.
Alex, Alex Belits <abelits@marvell.com> writes: > On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote: >> 1) That inline function can be put out of line by the compiler and >> placed into the regular text section which makes it subject to >> instrumentation >> >> 2) That inline function invokes local_irq_save() which is subject to >> instrumentation _before_ the entry state for the instrumentation >> mechanisms is established. >> >> 3) That inline function invokes sync_core() before important state >> has >> been established, which is especially interesting in NMI like >> exceptions. >> >> As you clearly documented why all of the above is safe and does not >> cause any problems, it's just me and Peter being silly, right? >> >> Try again. > > I don't think, accusations and mockery are really necessary here. Let's get some context to this. I told you in my first mail, that this breaks noinstr and that building with full debug would have told you. Peter gave you a clear hint where to look. Now it might be expected that you investigate that or at least ask questions before making the bold claim: >> > Unless something else is involved, those operations are safe, so I >> > am not adding anything that can break those. Surely I could have avoided the snide remark, but after you demonstrably ignored technically valid concerns and suggestions in your other reply, I was surely not in the mood to be overly careful in my choice of words. > The result of this may be not a "design" per se, but an understanding > of how things are implemented, and what rules are being followed, so I > could add my code in a manner consistent with what is done, and > document the whole thing. Every other big and technically complex project which has to change the very inner workings of the kernel started the same way. I'm not aware of any of them getting accepted as is or in a big code dump. What you have now qualifies as proof of concept and the big challenge is to turn it into something which is acceptable and maintainable. You talk in great length about how inconsistent stuff is all over the place. Yes, it is indeed. You even call that inconsistency an existing design: > My patches reflect what is already in code and in its design. I agree that you just work with the code as is, but you might have noticed that quite some of this stuff is clearly not designed at all or designed badly. The solution is not to pile on top of the inconsistency, the solution is to make it consistent in the first place. You are surely going to say, that's beyond the scope of your project. I can tell you that it is in the scope of your project simply because just proliferating the status quo and piling new stuff on top is not an option. And no, there are no people waiting in a row to mop up after you either. Quite some of the big technology projects have spent and still spend considerable amount of time to do exactly this kind of consolidation work upfront in order to make their features acceptable in a maintainable form. All of these projects have been merged or are still being merged piecewise in reviewable chunks. We are talking about intrusive technology which requires a very careful integration to prevent it from becoming a roadblock or a maintenaince headache. The approach and implementation has to be _agreed_ on by the involved parties, i.e. submitters, reviewers and maintainers. > While I understand that this is an unusual feature and by its nature > it affects kernel in multiple places, it does not deserve to be called > a "mess" and other synonyms of "mess". The feature is perfectly fine and I completely understand why you want it. Guess who started to lay the grounds for NOHZ_FULL more than a decade ago and why? The implementation is not acceptable on technical grounds, maintainability reasons, lack of design and proper consolidated integration. > Another issue that you have asked me to defend is the existence and > scope of task isolation itself. I have not asked you to defend the existance. I asked you for coherent explanations how the implementation works and why the chosen approach is correct and valid. That's a completely different thing. > It's an attempt to introduce a feature that turns Linux userspace into > superior replacement of RTOS..... Can you please spare me the advertising and marketing? I'm very well aware what an RTOS is and I'm also very well aware that there is no such thing like a 'superior replacement' for RTOS in general. If your view of RTOS is limited to this particular feature, then I have to tell you that this particular feature is only useful for a very small portion of the overall RTOS use cases. > However most definitely this is not a "mess", and it I do not believe > that I have to defend the validity of this direction of development, or > be accused of general incompetence every time someone finds a > frustrating mistake in my code. Nobody accuses you of incompetence, but you will have to defend the validity of your approach and implementation and accept that things might not be as shiny as you think they are. That's not hostility, that's just how Linux kernel development works whether you like it or not. I surely can understand your frustration over my view of this series, but you might have noticed that aside of criticism I gave you very clear technical arguments and suggestions how to proceed. It's your decision what you make of that. Thanks, tglx