Message ID | 20240401213950.3910531-3-paulmck@kernel.org |
---|---|
State | New |
Headers | show |
Series | None | expand |
On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote: > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte > and two-byte cmpxchg() on arc. > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> I'm missing the context here, is it now mandatory to have 16-bit cmpxchg() everywhere? I think we've historically tried hard to keep this out of common code since it's expensive on architectures that don't have native 16-bit load/store instructions (alpha, armv3) and or sub-word atomics (armv5, riscv, mips). Does the code that uses this rely on working concurrently with non-atomic stores to part of the 32-bit word? If we want to allow that, we need to merge my alpha ev4/45/5 removal series first. For the cmpxchg() interface, I would prefer to handle the 8-bit and 16-bit versions the same way as cmpxchg64() and provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions by architectures that operate on fixed-size integer values but not compounds or pointers, and a generic cmpxchg() wrapper in common code that can handle the abtraction for pointers, long and (if absolutely necessary) compounds by multiplexing between cmpxchg32() and cmpxchg64() where needed. I did a prototype a few years ago and found that there is probably under a dozen users of the sub-word atomics in the tree, so this mostly requires changes to architecture code and less to drivers and core code. Arnd
On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote: > On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote: > > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte > > and two-byte cmpxchg() on arc. > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > I'm missing the context here, is it now mandatory to have 16-bit > cmpxchg() everywhere? I think we've historically tried hard to > keep this out of common code since it's expensive on architectures > that don't have native 16-bit load/store instructions (alpha, armv3) > and or sub-word atomics (armv5, riscv, mips). I need 8-bit, and just added 16-bit because it was easy to do so. I would be OK dropping the 16-bit portions of this series, assuming that no-one needs it. And assuming that it is easier to drop it than to explain why it is not available. ;-) > Does the code that uses this rely on working concurrently with > non-atomic stores to part of the 32-bit word? If we want to > allow that, we need to merge my alpha ev4/45/5 removal series > first. For 8-but cmpxchg(), yes. There are potentially concurrent smp_load_acquire() and smp_store_release() operations to this same byte. Or is your question specific to the 16-bit primitives? (Full disclosure: I have no objection to removing Alpha ev4/45/5, having several times suggested removing Alpha entirely. And having the scars to prove it.) > For the cmpxchg() interface, I would prefer to handle the > 8-bit and 16-bit versions the same way as cmpxchg64() and > provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions > by architectures that operate on fixed-size integer values > but not compounds or pointers, and a generic cmpxchg() wrapper > in common code that can handle the abtraction for pointers, > long and (if absolutely necessary) compounds by multiplexing > between cmpxchg32() and cmpxchg64() where needed. So as to support _acquire(), _relaxed(), and _release()? If so, I don't have any use cases for other than full ordering. > I did a prototype a few years ago and found that there is > probably under a dozen users of the sub-word atomics in > the tree, so this mostly requires changes to architecture > code and less to drivers and core code. Given this approach, the predominance of changes to architecture code seems quite likely to me. But do we really wish to invest that much work into architectures that might not be all that long for the world? (Quickly donning my old asbestos suit, the one with the tungsten pinstripes...) Thanx, Paul
On Tue, Apr 02, 2024 at 10:06:14AM -0700, Paul E. McKenney wrote: > On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote: > > On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote: > > > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte > > > and two-byte cmpxchg() on arc. > > > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > I'm missing the context here, is it now mandatory to have 16-bit > > cmpxchg() everywhere? I think we've historically tried hard to > > keep this out of common code since it's expensive on architectures > > that don't have native 16-bit load/store instructions (alpha, armv3) > > and or sub-word atomics (armv5, riscv, mips). > > I need 8-bit, and just added 16-bit because it was easy to do so. > I would be OK dropping the 16-bit portions of this series, assuming > that no-one needs it. And assuming that it is easier to drop it than > to explain why it is not available. ;-) > > > Does the code that uses this rely on working concurrently with > > non-atomic stores to part of the 32-bit word? If we want to > > allow that, we need to merge my alpha ev4/45/5 removal series > > first. > > For 8-but cmpxchg(), yes. There are potentially concurrent > smp_load_acquire() and smp_store_release() operations to this same byte. > > Or is your question specific to the 16-bit primitives? (Full disclosure: > I have no objection to removing Alpha ev4/45/5, having several times > suggested removing Alpha entirely. And having the scars to prove it.) > > > For the cmpxchg() interface, I would prefer to handle the > > 8-bit and 16-bit versions the same way as cmpxchg64() and > > provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions > > by architectures that operate on fixed-size integer values > > but not compounds or pointers, and a generic cmpxchg() wrapper > > in common code that can handle the abtraction for pointers, > > long and (if absolutely necessary) compounds by multiplexing > > between cmpxchg32() and cmpxchg64() where needed. > > So as to support _acquire(), _relaxed(), and _release()? > > If so, I don't have any use cases for other than full ordering. Nor any use cases other than integers. (In case another thing you are after here is good type-checking for non-integers combined with allowing C-language implicit conversions for integers.) Thanx, Paul > > I did a prototype a few years ago and found that there is > > probably under a dozen users of the sub-word atomics in > > the tree, so this mostly requires changes to architecture > > code and less to drivers and core code. > > Given this approach, the predominance of changes to architecture code > seems quite likely to me. > > But do we really wish to invest that much work into architectures that > might not be all that long for the world? (Quickly donning my old > asbestos suit, the one with the tungsten pinstripes...) > > Thanx, Paul
On Tue, Apr 2, 2024, at 19:06, Paul E. McKenney wrote: > On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote: >> On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote: >> > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte >> > and two-byte cmpxchg() on arc. >> > >> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> >> >> I'm missing the context here, is it now mandatory to have 16-bit >> cmpxchg() everywhere? I think we've historically tried hard to >> keep this out of common code since it's expensive on architectures >> that don't have native 16-bit load/store instructions (alpha, armv3) >> and or sub-word atomics (armv5, riscv, mips). > > I need 8-bit, and just added 16-bit because it was easy to do so. > I would be OK dropping the 16-bit portions of this series, assuming > that no-one needs it. And assuming that it is easier to drop it than > to explain why it is not available. ;-) It certainly makes sense to handle both the same, whichever way we do this. >> Does the code that uses this rely on working concurrently with >> non-atomic stores to part of the 32-bit word? If we want to >> allow that, we need to merge my alpha ev4/45/5 removal series >> first. > > For 8-but cmpxchg(), yes. There are potentially concurrent > smp_load_acquire() and smp_store_release() operations to this same byte. > > Or is your question specific to the 16-bit primitives? (Full disclosure: > I have no objection to removing Alpha ev4/45/5, having several times > suggested removing Alpha entirely. And having the scars to prove it.) For the native sub-word access, alpha ev5 cannot do either of them, while armv3 could do byte access but not 16-bit words. It sounds like the old alphas are already broken then, which is a good reason to finally drop support. It would appear that the arm riscpc is not affected by this though. >> For the cmpxchg() interface, I would prefer to handle the >> 8-bit and 16-bit versions the same way as cmpxchg64() and >> provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions >> by architectures that operate on fixed-size integer values >> but not compounds or pointers, and a generic cmpxchg() wrapper >> in common code that can handle the abtraction for pointers, >> long and (if absolutely necessary) compounds by multiplexing >> between cmpxchg32() and cmpxchg64() where needed. > > So as to support _acquire(), _relaxed(), and _release()? > > If so, I don't have any use cases for other than full ordering. My main goal here would be to simplify the definition of the very commonly used cmpxchg() macro so it doesn't have to deal with so many corner cases, and make it work the same way across all architectures. Without the type agnostic wrapper, there would also be the benefit of additional type checking that we get by replacing the macros with inline functions. We'd still need all the combinations of cmpxchg() and xchg() with the four sizes and ordering variants, but at least the latter should easily collapse on most architectures. At the moment, most architectures only provide the full-ordering version. Arnd
On Thu, Apr 04, 2024 at 01:57:32PM +0200, Arnd Bergmann wrote: > On Tue, Apr 2, 2024, at 19:06, Paul E. McKenney wrote: > > On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote: > >> On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote: > >> > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte > >> > and two-byte cmpxchg() on arc. > >> > > >> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > >> > >> I'm missing the context here, is it now mandatory to have 16-bit > >> cmpxchg() everywhere? I think we've historically tried hard to > >> keep this out of common code since it's expensive on architectures > >> that don't have native 16-bit load/store instructions (alpha, armv3) > >> and or sub-word atomics (armv5, riscv, mips). > > > > I need 8-bit, and just added 16-bit because it was easy to do so. > > I would be OK dropping the 16-bit portions of this series, assuming > > that no-one needs it. And assuming that it is easier to drop it than > > to explain why it is not available. ;-) > > It certainly makes sense to handle both the same, whichever > way we do this. Agreed, at least as long as the properties of the relevant hardware are consistent with doing so. > >> Does the code that uses this rely on working concurrently with > >> non-atomic stores to part of the 32-bit word? If we want to > >> allow that, we need to merge my alpha ev4/45/5 removal series > >> first. > > > > For 8-but cmpxchg(), yes. There are potentially concurrent > > smp_load_acquire() and smp_store_release() operations to this same byte. > > > > Or is your question specific to the 16-bit primitives? (Full disclosure: > > I have no objection to removing Alpha ev4/45/5, having several times > > suggested removing Alpha entirely. And having the scars to prove it.) > > For the native sub-word access, alpha ev5 cannot do either of > them, while armv3 could do byte access but not 16-bit words. > > It sounds like the old alphas are already broken then, which > is a good reason to finally drop support. It would appear that > the arm riscpc is not affected by this though. Good point, given that single-byte load/store and emulated single-byte cmpxchg() has been in common code for some time. So armv3 is OK with one-byte emulated cmpxchg(), but not with two-byte, which is consistent with the current state of this series in the -rcu tree. (My plan is to wait until Monday to re-send the series in order to allow the test robots to find yet more bugs.) Or is the plan to also drop support for armv3 in the near term? > >> For the cmpxchg() interface, I would prefer to handle the > >> 8-bit and 16-bit versions the same way as cmpxchg64() and > >> provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions > >> by architectures that operate on fixed-size integer values > >> but not compounds or pointers, and a generic cmpxchg() wrapper > >> in common code that can handle the abtraction for pointers, > >> long and (if absolutely necessary) compounds by multiplexing > >> between cmpxchg32() and cmpxchg64() where needed. > > > > So as to support _acquire(), _relaxed(), and _release()? > > > > If so, I don't have any use cases for other than full ordering. > > My main goal here would be to simplify the definition of > the very commonly used cmpxchg() macro so it doesn't have > to deal with so many corner cases, and make it work the > same way across all architectures. Without the type > agnostic wrapper, there would also be the benefit of > additional type checking that we get by replacing the > macros with inline functions. > > We'd still need all the combinations of cmpxchg() and xchg() > with the four sizes and ordering variants, but at least the > latter should easily collapse on most architectures. At the > moment, most architectures only provide the full-ordering > version. That does make a lot of sense to me. Though C-language inline functions have some trouble with type-generic parameters. However, my patch is down at architecture-specific level, so should not affect the cmpxchg() macro, right? Or am I missing some aspect of your proposed refactoring? Thanx, Paul
On Thu, Apr 4, 2024, at 16:44, Paul E. McKenney wrote: > On Thu, Apr 04, 2024 at 01:57:32PM +0200, Arnd Bergmann wrote: >> On Tue, Apr 2, 2024, at 19:06, Paul E. McKenney wrote: >> > Or is your question specific to the 16-bit primitives? (Full disclosure: >> > I have no objection to removing Alpha ev4/45/5, having several times >> > suggested removing Alpha entirely. And having the scars to prove it.) >> >> For the native sub-word access, alpha ev5 cannot do either of >> them, while armv3 could do byte access but not 16-bit words. >> >> It sounds like the old alphas are already broken then, which >> is a good reason to finally drop support. It would appear that >> the arm riscpc is not affected by this though. > > Good point, given that single-byte load/store and emulated single-byte > cmpxchg() has been in common code for some time. > > So armv3 is OK with one-byte emulated cmpxchg(), but not with two-byte, > which is consistent with the current state of this series in the -rcu > tree. (My plan is to wait until Monday to re-send the series in order > to allow the test robots to find yet more bugs.) > > Or is the plan to also drop support for armv3 in the near term? Russell still has his RiscPC, which is probably the only one using armv3 kernels (it's actually v4 but relies on -march=armv3 to work around hardware quirks). Since armv3 support was dropped in gcc-9, it's a matter of time before we drop this as well, but it's not now. >> We'd still need all the combinations of cmpxchg() and xchg() >> with the four sizes and ordering variants, but at least the >> latter should easily collapse on most architectures. At the >> moment, most architectures only provide the full-ordering >> version. > > That does make a lot of sense to me. Though C-language inline functions > have some trouble with type-generic parameters. > > However, my patch is down at architecture-specific level, so should not > affect the cmpxchg() macro, right? Or am I missing some aspect of your > proposed refactoring? Today, arch_cmpxchg() and its variants are defined in each architecture to handle some subset of integer sizes. The way I'd like to do this in the future would be to remove that macro in favor of arch_{cmp,}xchg{8,16,32,64}{,_relaxed,_release, _acquire,_local} inline functions that each architecture needs to provide either directly or through a generic helper, with all the macros wrapping them moved to common code. I've been wanting to change this for years now and it never quite makes it to the top of my todo pile, so I guess it can wait a little longer. Arnd
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 99d2845f3feb9..0b40039f38eb2 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -14,6 +14,7 @@ config ARC select ARCH_HAS_SETUP_DMA_OPS select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYNC_DMA_FOR_DEVICE + select ARCH_NEED_CMPXCHG_1_2_EMU select ARCH_SUPPORTS_ATOMIC_RMW if ARC_HAS_LLSC select ARCH_32BIT_OFF_T select BUILDTIME_TABLE_SORT diff --git a/arch/arc/include/asm/cmpxchg.h b/arch/arc/include/asm/cmpxchg.h index e138fde067dea..1e3e23adaca13 100644 --- a/arch/arc/include/asm/cmpxchg.h +++ b/arch/arc/include/asm/cmpxchg.h @@ -46,6 +46,12 @@ __typeof__(*(ptr)) _prev_; \ \ switch(sizeof((_p_))) { \ + case 1: \ + _prev_ = cmpxchg_emu_u8((volatile u8 *)_p_, _o_, _n_); \ + break; \ + case 2: \ + _prev_ = cmpxchg_emu_u16((volatile u16 *)_p_, _o_, _n_); \ + break; \ case 4: \ _prev_ = __cmpxchg(_p_, _o_, _n_); \ break; \ @@ -65,16 +71,30 @@ __typeof__(*(ptr)) _prev_; \ unsigned long __flags; \ \ - BUILD_BUG_ON(sizeof(_p_) != 4); \ + switch(sizeof((_p_))) { \ + case 1: \ + __flags = cmpxchg_emu_u8((volatile u8 *)_p_, _o_, _n_); \ + _prev_ = (__typeof__(*(ptr)))__flags; \ + break; \ + case 2: \ + __flags = cmpxchg_emu_u16((volatile u16 *)_p_, _o_, _n_); \ + _prev_ = (__typeof__(*(ptr)))__flags; \ + break; \ + case 4: \ + /* \ + * spin lock/unlock provide the needed smp_mb() \ + * before/after \ + */ \ + atomic_ops_lock(__flags); \ + _prev_ = *_p_; \ + if (_prev_ == _o_) \ + *_p_ = _n_; \ + atomic_ops_unlock(__flags); \ + break; \ + default: \ + BUILD_BUG(); \ + } \ \ - /* \ - * spin lock/unlock provide the needed smp_mb() before/after \ - */ \ - atomic_ops_lock(__flags); \ - _prev_ = *_p_; \ - if (_prev_ == _o_) \ - *_p_ = _n_; \ - atomic_ops_unlock(__flags); \ _prev_; \ })
Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte and two-byte cmpxchg() on arc. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Andrzej Hajda <andrzej.hajda@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: <linux-snps-arc@lists.infradead.org> --- arch/arc/Kconfig | 1 + arch/arc/include/asm/cmpxchg.h | 38 ++++++++++++++++++++++++++-------- 2 files changed, 30 insertions(+), 9 deletions(-)