Message ID | 20200810135245.230600-1-anup.patel@wdc.com |
---|---|
State | Superseded |
Headers | show |
Series | firmware: fw_base: Improve exception stack setup in trap handler | expand |
On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote: > > Currently, the low-level trap handler (i.e. _trap_handler()) uses > branch instructions to conditionally setup exception stack based > on which mode trap occured. > > This patch implements exception stack setup using xor instructions > which is 2 instructions less compared to previous approach and faster > due to lack of branch instructions. > > The new exception stack setup approach can be best described by the > following pseudocode: > > Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 This relies on PRIV 0b10 currently being unallocated, which doesn't make me feel warm and fuzzy inside. > Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) This relies on MUL with 0/1 being fast, but that may well be a different pipeline in your processor (in an OoO core) with fewer available, or a higher-latency path to an M-box in a simpler in-order pipeline. The MUL is also not compressible. You probably do want a branch, but in a way that macro-op fusion will hopefully turn it into a conditional move: mv Exception_Stack, tp beqz Came_From_M_Mode, 1f mv Exception_Stack, sp 1: Same number of instructions, but all compressible, no multiplications required and the lone branch can be macro-op fused. Jess > Came_From_M_Mode = 0 ==> Exception_Stack = TP > Came_From_M_Mode = 1 ==> Exception_Stack = SP > > Signed-off-by: Anup Patel <anup.patel@wdc.com> > --- > firmware/fw_base.S | 49 ++++++++++++++++++++-------------------------- > 1 file changed, 21 insertions(+), 28 deletions(-) > > diff --git a/firmware/fw_base.S b/firmware/fw_base.S > index b66ac41..53a9a48 100644 > --- a/firmware/fw_base.S > +++ b/firmware/fw_base.S > @@ -490,35 +490,28 @@ _trap_handler: > /* Save T0 in scratch space */ > REG_S t0, SBI_SCRATCH_TMP0_OFFSET(tp) > > - /* Check which mode we came from */ > + /* > + * Set T0 to appropriate exception stack > + * > + * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 > + * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) > + * > + * Came_From_M_Mode = 0 ==> Exception_Stack = TP > + * Came_From_M_Mode = 1 ==> Exception_Stack = SP > + */ > csrr t0, CSR_MSTATUS > - srl t0, t0, MSTATUS_MPP_SHIFT > - and t0, t0, PRV_M > - xori t0, t0, PRV_M > - beq t0, zero, _trap_handler_m_mode > - > - /* We came from S-mode or U-mode */ > -_trap_handler_s_mode: > - /* Set T0 to original SP */ > - add t0, sp, zero > - > - /* Setup exception stack */ > - add sp, tp, -(SBI_TRAP_REGS_SIZE) > - > - /* Jump to code common for all modes */ > - j _trap_handler_all_mode > - > - /* We came from M-mode */ > -_trap_handler_m_mode: > - /* Set T0 to original SP */ > - add t0, sp, zero > - > - /* Re-use current SP as exception stack */ > - add sp, sp, -(SBI_TRAP_REGS_SIZE) > - > -_trap_handler_all_mode: > - /* Save original SP (from T0) on stack */ > - REG_S t0, SBI_TRAP_REGS_OFFSET(sp)(sp) > + srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > + and t0, t0, 0x1 > + xor sp, sp, tp > + mul t0, t0, sp > + xor sp, sp, tp > + xor t0, tp, t0 > + > + /* Save original SP on exception stack */ > + REG_S sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0) > + > + /* Set SP to exception stack and make room for trap registers */ > + add sp, t0, -(SBI_TRAP_REGS_SIZE) > > /* Restore T0 from scratch space */ > REG_L t0, SBI_SCRATCH_TMP0_OFFSET(tp) > -- > 2.25.1 > > > -- > opensbi mailing list > opensbi@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/opensbi
On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote: > > On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote: > > > > Currently, the low-level trap handler (i.e. _trap_handler()) uses > > branch instructions to conditionally setup exception stack based > > on which mode trap occured. > > > > This patch implements exception stack setup using xor instructions > > which is 2 instructions less compared to previous approach and faster > > due to lack of branch instructions. > > > > The new exception stack setup approach can be best described by the > > following pseudocode: > > > > Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 > > This relies on PRIV 0b10 currently being unallocated, which doesn't > make me feel warm and fuzzy inside. The PRIV 0b10 mode is not used by any of the upcoming extensions. In future, we will require really strong justification in future to use the unused 0b10 mode. Even H-extension has cleanly avoided using PRIV 0b10 mode. In any case, we should document this assumption. > > > Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) > > This relies on MUL with 0/1 being fast, but that may well be a different The MUL can be replaced with AND instruction but it was taking one additional instruction. See below. /* * Set T0 to appropriate exception stack * * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1 * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP)) * * Came_From_M_Mode = -1UL ==> Exception_Stack = TP * Came_From_M_Mode = 0UL ==> Exception_Stack = SP */ csrr t0, CSR_MSTATUS srl t0, t0, (MSTATUS_MPP_SHIFT + 1) and t0, t0, 0x1 add t0, t0, -1 xor tp, tp, sp and t0, t0, tp xor tp, tp, sp xor t0, sp, t0 > pipeline in your processor (in an OoO core) with fewer available, or a > higher-latency path to an M-box in a simpler in-order pipeline. The MUL > is also not compressible. You probably do want a branch, but in a way > that macro-op fusion will hopefully turn it into a conditional move: > > mv Exception_Stack, tp > beqz Came_From_M_Mode, 1f > mv Exception_Stack, sp > 1: Actually, I had started off with branches but code without branches (this patch) is performing better on QEMU (because fewer TCG blocks required for translating _trap_handler()). This code (this patch) will be much faster on real HW because branch misses can cause bubbles in the instruction pipeline. The code with branches is as follows: csrr t0, CSR_MSTATUS srl t0, t0, (MSTATUS_MPP_SHIFT + 1) and t0, t0, 0x1 beq t0, zero, _trap_stack_s_mode _trap_stack_m_mode: add t0, sp, zero j _trap_stack_done _trap_stack_s_mode: add t0, tp, zero _trap_stack_done: > > Same number of instructions, but all compressible, no multiplications > required and the lone branch can be macro-op fused. Overall, the key thing is having no (or fewer) branches in _trap_handler() with the same number of (or fewer) instructions. Regards, Anup > > Jess > > > Came_From_M_Mode = 0 ==> Exception_Stack = TP > > Came_From_M_Mode = 1 ==> Exception_Stack = SP > > > > Signed-off-by: Anup Patel <anup.patel@wdc.com> > > --- > > firmware/fw_base.S | 49 ++++++++++++++++++++-------------------------- > > 1 file changed, 21 insertions(+), 28 deletions(-) > > > > diff --git a/firmware/fw_base.S b/firmware/fw_base.S > > index b66ac41..53a9a48 100644 > > --- a/firmware/fw_base.S > > +++ b/firmware/fw_base.S > > @@ -490,35 +490,28 @@ _trap_handler: > > /* Save T0 in scratch space */ > > REG_S t0, SBI_SCRATCH_TMP0_OFFSET(tp) > > > > - /* Check which mode we came from */ > > + /* > > + * Set T0 to appropriate exception stack > > + * > > + * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 > > + * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) > > + * > > + * Came_From_M_Mode = 0 ==> Exception_Stack = TP > > + * Came_From_M_Mode = 1 ==> Exception_Stack = SP > > + */ > > csrr t0, CSR_MSTATUS > > - srl t0, t0, MSTATUS_MPP_SHIFT > > - and t0, t0, PRV_M > > - xori t0, t0, PRV_M > > - beq t0, zero, _trap_handler_m_mode > > - > > - /* We came from S-mode or U-mode */ > > -_trap_handler_s_mode: > > - /* Set T0 to original SP */ > > - add t0, sp, zero > > - > > - /* Setup exception stack */ > > - add sp, tp, -(SBI_TRAP_REGS_SIZE) > > - > > - /* Jump to code common for all modes */ > > - j _trap_handler_all_mode > > - > > - /* We came from M-mode */ > > -_trap_handler_m_mode: > > - /* Set T0 to original SP */ > > - add t0, sp, zero > > - > > - /* Re-use current SP as exception stack */ > > - add sp, sp, -(SBI_TRAP_REGS_SIZE) > > - > > -_trap_handler_all_mode: > > - /* Save original SP (from T0) on stack */ > > - REG_S t0, SBI_TRAP_REGS_OFFSET(sp)(sp) > > + srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > > + and t0, t0, 0x1 > > + xor sp, sp, tp > > + mul t0, t0, sp > > + xor sp, sp, tp > > + xor t0, tp, t0 > > + > > + /* Save original SP on exception stack */ > > + REG_S sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0) > > + > > + /* Set SP to exception stack and make room for trap registers */ > > + add sp, t0, -(SBI_TRAP_REGS_SIZE) > > > > /* Restore T0 from scratch space */ > > REG_L t0, SBI_SCRATCH_TMP0_OFFSET(tp) > > -- > > 2.25.1 > > > > > > -- > > opensbi mailing list > > opensbi@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/opensbi
On Tue, Aug 11, 2020 at 8:40 AM Anup Patel <anup@brainfault.org> wrote: > > On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote: > > > > On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote: > > > > > > Currently, the low-level trap handler (i.e. _trap_handler()) uses > > > branch instructions to conditionally setup exception stack based > > > on which mode trap occured. > > > > > > This patch implements exception stack setup using xor instructions > > > which is 2 instructions less compared to previous approach and faster > > > due to lack of branch instructions. > > > > > > The new exception stack setup approach can be best described by the > > > following pseudocode: > > > > > > Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 > > > > This relies on PRIV 0b10 currently being unallocated, which doesn't > > make me feel warm and fuzzy inside. > > The PRIV 0b10 mode is not used by any of the upcoming extensions. In future, > we will require really strong justification in future to use the > unused 0b10 mode. > Even H-extension has cleanly avoided using PRIV 0b10 mode. > > In any case, we should document this assumption. > > > > > > Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) > > > > This relies on MUL with 0/1 being fast, but that may well be a different > > The MUL can be replaced with AND instruction but it was taking one > additional instruction. See below. > > /* > * Set T0 to appropriate exception stack > * > * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1 > * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP)) > * > * Came_From_M_Mode = -1UL ==> Exception_Stack = TP > * Came_From_M_Mode = 0UL ==> Exception_Stack = SP > */ > csrr t0, CSR_MSTATUS > srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > and t0, t0, 0x1 > add t0, t0, -1 > xor tp, tp, sp > and t0, t0, tp > xor tp, tp, sp > xor t0, sp, t0 > > > pipeline in your processor (in an OoO core) with fewer available, or a > > higher-latency path to an M-box in a simpler in-order pipeline. The MUL > > is also not compressible. You probably do want a branch, but in a way > > that macro-op fusion will hopefully turn it into a conditional move: > > > > mv Exception_Stack, tp > > beqz Came_From_M_Mode, 1f > > mv Exception_Stack, sp > > 1: > > Actually, I had started off with branches but code without branches (this > patch) is performing better on QEMU (because fewer TCG blocks required > for translating _trap_handler()). This code (this patch) will be much faster > on real HW because branch misses can cause bubbles in the instruction > pipeline. > > The code with branches is as follows: > csrr t0, CSR_MSTATUS > srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > and t0, t0, 0x1 > beq t0, zero, _trap_stack_s_mode > _trap_stack_m_mode: > add t0, sp, zero > j _trap_stack_done > _trap_stack_s_mode: > add t0, tp, zero > _trap_stack_done: > > > > > Same number of instructions, but all compressible, no multiplications > > required and the lone branch can be macro-op fused. > > Overall, the key thing is having no (or fewer) branches in _trap_handler() > with the same number of (or fewer) instructions. Here's another approach which has no assumption about PRIV 0b10 mode and it uses AND operation in-place of MUL. /* * Set T0 to appropriate exception stack * * Came_From_M_Mode = ((MSTATUS.MPP < 0x3) ? 1 : 0) - 1; * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) * * Came_From_M_Mode = -1UL ==> Exception_Stack = SP * Came_From_M_Mode = 0UL ==> Exception_Stack = TP */ csrr t0, CSR_MSTATUS srl t0, t0, MSTATUS_MPP_SHIFT and t0, t0, PRV_M slti t0, t0, PRV_M add t0, t0, -1 xor sp, sp, tp and t0, t0, sp xor sp, sp, tp xor t0, tp, t0 Overall, the number of instructions are the same but no branches. Even this approach performs better on QEMU compared to code with branches. Here's the code with branches.. csrr t0, CSR_MSTATUS srl t0, t0, MSTATUS_MPP_SHIFT and t0, t0, PRV_M slti t0, t0, PRV_M beq t0, zero, _trap_stack_m_mode _trap_stack_s_mode: add t0, tp, zero j _trap_stack_done _trap_stack_m_mode: add t0, sp, zero _trap_stack_done: Regards, Anup > > Regards, > Anup > > > > > Jess > > > > > Came_From_M_Mode = 0 ==> Exception_Stack = TP > > > Came_From_M_Mode = 1 ==> Exception_Stack = SP > > > > > > Signed-off-by: Anup Patel <anup.patel@wdc.com> > > > --- > > > firmware/fw_base.S | 49 ++++++++++++++++++++-------------------------- > > > 1 file changed, 21 insertions(+), 28 deletions(-) > > > > > > diff --git a/firmware/fw_base.S b/firmware/fw_base.S > > > index b66ac41..53a9a48 100644 > > > --- a/firmware/fw_base.S > > > +++ b/firmware/fw_base.S > > > @@ -490,35 +490,28 @@ _trap_handler: > > > /* Save T0 in scratch space */ > > > REG_S t0, SBI_SCRATCH_TMP0_OFFSET(tp) > > > > > > - /* Check which mode we came from */ > > > + /* > > > + * Set T0 to appropriate exception stack > > > + * > > > + * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 > > > + * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) > > > + * > > > + * Came_From_M_Mode = 0 ==> Exception_Stack = TP > > > + * Came_From_M_Mode = 1 ==> Exception_Stack = SP > > > + */ > > > csrr t0, CSR_MSTATUS > > > - srl t0, t0, MSTATUS_MPP_SHIFT > > > - and t0, t0, PRV_M > > > - xori t0, t0, PRV_M > > > - beq t0, zero, _trap_handler_m_mode > > > - > > > - /* We came from S-mode or U-mode */ > > > -_trap_handler_s_mode: > > > - /* Set T0 to original SP */ > > > - add t0, sp, zero > > > - > > > - /* Setup exception stack */ > > > - add sp, tp, -(SBI_TRAP_REGS_SIZE) > > > - > > > - /* Jump to code common for all modes */ > > > - j _trap_handler_all_mode > > > - > > > - /* We came from M-mode */ > > > -_trap_handler_m_mode: > > > - /* Set T0 to original SP */ > > > - add t0, sp, zero > > > - > > > - /* Re-use current SP as exception stack */ > > > - add sp, sp, -(SBI_TRAP_REGS_SIZE) > > > - > > > -_trap_handler_all_mode: > > > - /* Save original SP (from T0) on stack */ > > > - REG_S t0, SBI_TRAP_REGS_OFFSET(sp)(sp) > > > + srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > > > + and t0, t0, 0x1 > > > + xor sp, sp, tp > > > + mul t0, t0, sp > > > + xor sp, sp, tp > > > + xor t0, tp, t0 > > > + > > > + /* Save original SP on exception stack */ > > > + REG_S sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0) > > > + > > > + /* Set SP to exception stack and make room for trap registers */ > > > + add sp, t0, -(SBI_TRAP_REGS_SIZE) > > > > > > /* Restore T0 from scratch space */ > > > REG_L t0, SBI_SCRATCH_TMP0_OFFSET(tp) > > > -- > > > 2.25.1 > > > > > > > > > -- > > > opensbi mailing list > > > opensbi@lists.infradead.org > > > http://lists.infradead.org/mailman/listinfo/opensbi
On 11 Aug 2020, at 04:10, Anup Patel <anup@brainfault.org> wrote: > > On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote: >> >> On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote: >>> >>> Currently, the low-level trap handler (i.e. _trap_handler()) uses >>> branch instructions to conditionally setup exception stack based >>> on which mode trap occured. >>> >>> This patch implements exception stack setup using xor instructions >>> which is 2 instructions less compared to previous approach and faster >>> due to lack of branch instructions. >>> >>> The new exception stack setup approach can be best described by the >>> following pseudocode: >>> >>> Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 >> >> This relies on PRIV 0b10 currently being unallocated, which doesn't >> make me feel warm and fuzzy inside. > > The PRIV 0b10 mode is not used by any of the upcoming extensions. In future, > we will require really strong justification in future to use the > unused 0b10 mode. > Even H-extension has cleanly avoided using PRIV 0b10 mode. > > In any case, we should document this assumption. > >> >>> Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) >> >> This relies on MUL with 0/1 being fast, but that may well be a different > > The MUL can be replaced with AND instruction but it was taking one > additional instruction. See below. > > /* > * Set T0 to appropriate exception stack > * > * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1 > * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP)) > * > * Came_From_M_Mode = -1UL ==> Exception_Stack = TP > * Came_From_M_Mode = 0UL ==> Exception_Stack = SP > */ > csrr t0, CSR_MSTATUS > srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > and t0, t0, 0x1 > add t0, t0, -1 > xor tp, tp, sp > and t0, t0, tp > xor tp, tp, sp > xor t0, sp, t0 > >> pipeline in your processor (in an OoO core) with fewer available, or a >> higher-latency path to an M-box in a simpler in-order pipeline. The MUL >> is also not compressible. You probably do want a branch, but in a way >> that macro-op fusion will hopefully turn it into a conditional move: >> >> mv Exception_Stack, tp >> beqz Came_From_M_Mode, 1f >> mv Exception_Stack, sp >> 1: > > Actually, I had started off with branches but code without branches (this > patch) is performing better on QEMU (because fewer TCG blocks required > for translating _trap_handler()). This code (this patch) will be much faster > on real HW because branch misses can cause bubbles in the instruction > pipeline. I don't think you properly read what I wrote. I specifically made it so there was a single branch to skip a single instruction and therefore a macro-op fusion candidate. That means the branch+move is turned into a conditional move, with no need to branch predict (and potentially suffer from a miss). I don't know whether QEMU recognises the pattern, but if not that's just a deficiency in QEMU itself, and fixing that would suddenly make my proposed solution faster than the one involving multiplication. You have also neglected to address the fact that my proposed code is more compressible than yours. Optimising for QEMU is also not the right approach in my opinion; we should optimise for hardware first and foremost if we ever want RISC-V to actually succeed in the real world. Real hardware does macro-op fusion in many cases, and even in the cases it doesn't, multiplication is not cheap. In a classic 5-stage pipeline, multiplication is done a stage after execute, and may well be multi-cycle even for multiplying by 0/1, which would cause the pipeline to *always* stall, rather than *sometimes* in the case of a branch mispredict if macro-op fusion is not implemented. Given it's a direct branch to a nearby instruction, I'd expect both the misprediction and the corrected prediction to be L1 I-cache hits, so an in-order core should see very little latency. A good out-of-order core like SonicBOOM will correctly decode it to a conditional move (in fact I believe it decodes short conditional branches as being a series of predicated instructions). Also, if MUL with 0/1 were faster, do you not think compilers would already be emitting that for RISC-V? Because both GCC and LLVM will use (compressed) branch + (compressed) move to do a conditional move. > The code with branches is as follows: > csrr t0, CSR_MSTATUS > srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > and t0, t0, 0x1 > beq t0, zero, _trap_stack_s_mode > _trap_stack_m_mode: > add t0, sp, zero > j _trap_stack_done > _trap_stack_s_mode: > add t0, tp, zero > _trap_stack_done: Again, that's not what I wrote. Mine was shorter with only a single branch. Jess >> >> Same number of instructions, but all compressible, no multiplications >> required and the lone branch can be macro-op fused. > > Overall, the key thing is having no (or fewer) branches in _trap_handler() > with the same number of (or fewer) instructions. > > Regards, > Anup > >> >> Jess >> >>> Came_From_M_Mode = 0 ==> Exception_Stack = TP >>> Came_From_M_Mode = 1 ==> Exception_Stack = SP >>> >>> Signed-off-by: Anup Patel <anup.patel@wdc.com> >>> --- >>> firmware/fw_base.S | 49 ++++++++++++++++++++-------------------------- >>> 1 file changed, 21 insertions(+), 28 deletions(-) >>> >>> diff --git a/firmware/fw_base.S b/firmware/fw_base.S >>> index b66ac41..53a9a48 100644 >>> --- a/firmware/fw_base.S >>> +++ b/firmware/fw_base.S >>> @@ -490,35 +490,28 @@ _trap_handler: >>> /* Save T0 in scratch space */ >>> REG_S t0, SBI_SCRATCH_TMP0_OFFSET(tp) >>> >>> - /* Check which mode we came from */ >>> + /* >>> + * Set T0 to appropriate exception stack >>> + * >>> + * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 >>> + * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) >>> + * >>> + * Came_From_M_Mode = 0 ==> Exception_Stack = TP >>> + * Came_From_M_Mode = 1 ==> Exception_Stack = SP >>> + */ >>> csrr t0, CSR_MSTATUS >>> - srl t0, t0, MSTATUS_MPP_SHIFT >>> - and t0, t0, PRV_M >>> - xori t0, t0, PRV_M >>> - beq t0, zero, _trap_handler_m_mode >>> - >>> - /* We came from S-mode or U-mode */ >>> -_trap_handler_s_mode: >>> - /* Set T0 to original SP */ >>> - add t0, sp, zero >>> - >>> - /* Setup exception stack */ >>> - add sp, tp, -(SBI_TRAP_REGS_SIZE) >>> - >>> - /* Jump to code common for all modes */ >>> - j _trap_handler_all_mode >>> - >>> - /* We came from M-mode */ >>> -_trap_handler_m_mode: >>> - /* Set T0 to original SP */ >>> - add t0, sp, zero >>> - >>> - /* Re-use current SP as exception stack */ >>> - add sp, sp, -(SBI_TRAP_REGS_SIZE) >>> - >>> -_trap_handler_all_mode: >>> - /* Save original SP (from T0) on stack */ >>> - REG_S t0, SBI_TRAP_REGS_OFFSET(sp)(sp) >>> + srl t0, t0, (MSTATUS_MPP_SHIFT + 1) >>> + and t0, t0, 0x1 >>> + xor sp, sp, tp >>> + mul t0, t0, sp >>> + xor sp, sp, tp >>> + xor t0, tp, t0 >>> + >>> + /* Save original SP on exception stack */ >>> + REG_S sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0) >>> + >>> + /* Set SP to exception stack and make room for trap registers */ >>> + add sp, t0, -(SBI_TRAP_REGS_SIZE) >>> >>> /* Restore T0 from scratch space */ >>> REG_L t0, SBI_SCRATCH_TMP0_OFFSET(tp) >>> -- >>> 2.25.1 >>> >>> >>> -- >>> opensbi mailing list >>> opensbi@lists.infradead.org >>> http://lists.infradead.org/mailman/listinfo/opensbi
On Tue, Aug 11, 2020 at 6:05 PM Jessica Clarke <jrtc27@jrtc27.com> wrote: > > On 11 Aug 2020, at 04:10, Anup Patel <anup@brainfault.org> wrote: > > > > On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote: > >> > >> On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote: > >>> > >>> Currently, the low-level trap handler (i.e. _trap_handler()) uses > >>> branch instructions to conditionally setup exception stack based > >>> on which mode trap occured. > >>> > >>> This patch implements exception stack setup using xor instructions > >>> which is 2 instructions less compared to previous approach and faster > >>> due to lack of branch instructions. > >>> > >>> The new exception stack setup approach can be best described by the > >>> following pseudocode: > >>> > >>> Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 > >> > >> This relies on PRIV 0b10 currently being unallocated, which doesn't > >> make me feel warm and fuzzy inside. > > > > The PRIV 0b10 mode is not used by any of the upcoming extensions. In future, > > we will require really strong justification in future to use the > > unused 0b10 mode. > > Even H-extension has cleanly avoided using PRIV 0b10 mode. > > > > In any case, we should document this assumption. > > > >> > >>> Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) > >> > >> This relies on MUL with 0/1 being fast, but that may well be a different > > > > The MUL can be replaced with AND instruction but it was taking one > > additional instruction. See below. > > > > /* > > * Set T0 to appropriate exception stack > > * > > * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1 > > * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP)) > > * > > * Came_From_M_Mode = -1UL ==> Exception_Stack = TP > > * Came_From_M_Mode = 0UL ==> Exception_Stack = SP > > */ > > csrr t0, CSR_MSTATUS > > srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > > and t0, t0, 0x1 > > add t0, t0, -1 > > xor tp, tp, sp > > and t0, t0, tp > > xor tp, tp, sp > > xor t0, sp, t0 > > > >> pipeline in your processor (in an OoO core) with fewer available, or a > >> higher-latency path to an M-box in a simpler in-order pipeline. The MUL > >> is also not compressible. You probably do want a branch, but in a way > >> that macro-op fusion will hopefully turn it into a conditional move: > >> > >> mv Exception_Stack, tp > >> beqz Came_From_M_Mode, 1f > >> mv Exception_Stack, sp > >> 1: > > > > Actually, I had started off with branches but code without branches (this > > patch) is performing better on QEMU (because fewer TCG blocks required > > for translating _trap_handler()). This code (this patch) will be much faster > > on real HW because branch misses can cause bubbles in the instruction > > pipeline. > > I don't think you properly read what I wrote. I specifically made it so > there was a single branch to skip a single instruction and therefore a > macro-op fusion candidate. That means the branch+move is turned into a > conditional move, with no need to branch predict (and potentially > suffer from a miss). I don't know whether QEMU recognises the pattern, > but if not that's just a deficiency in QEMU itself, and fixing that > would suddenly make my proposed solution faster than the one involving > multiplication. You have also neglected to address the fact that my > proposed code is more compressible than yours. I don't think you understand the problem here. We have to choose a an appropriate exception stack without modifying current SP because current SP will have to be first saved on exception stack. That's why we: 1. Compute and set exception stack in T0 2. Save current SP on exception stack pointed by T0 3. Change current SP to T0 This patch is only trying to optimize point1. Also, the place where the exception stack is chosen, we can only clobber T0 for computation to preserve state of other general purpose registers. Your proposal if using branch+move won't work because you need two general purpose registers one for Exception_Stack and second for Came_From_M_Mode. > > Optimising for QEMU is also not the right approach in my opinion; we > should optimise for hardware first and foremost if we ever want RISC-V > to actually succeed in the real world. Real hardware does macro-op > fusion in many cases, and even in the cases it doesn't, multiplication > is not cheap. In a classic 5-stage pipeline, multiplication is done a > stage after execute, and may well be multi-cycle even for multiplying > by 0/1, which would cause the pipeline to *always* stall, rather than > *sometimes* in the case of a branch mispredict if macro-op fusion is > not implemented. Given it's a direct branch to a nearby instruction, > I'd expect both the misprediction and the corrected prediction to be L1 > I-cache hits, so an in-order core should see very little latency. A > good out-of-order core like SonicBOOM will correctly decode it to a > conditional move (in fact I believe it decodes short conditional > branches as being a series of predicated instructions). Nope, we are not optimizing just for QEMU. Replacing branches with simple sequential 1-cyle ALU instructions (not MUL instruction) is faster on simple in-order CPUs as well. The biggest problem with branches is the wrong path taken by branch predictor which causes pipeline flush. > > Also, if MUL with 0/1 were faster, do you not think compilers would > already be emitting that for RISC-V? Because both GCC and LLVM will use > (compressed) branch + (compressed) move to do a conditional move I am not pushing for MUL instruction here. I am very well aware of the cost of MUL instructions. Please look at my alternate proposal where I use AND instruction in-place of MUL with slight modifications. Also, we can't compare the low-level trap_handler with general C context in which GCC/LLVM optimizes code where stack pointer is already set. The challenging part here is to select exception stack using only one temporary (T0) general purpose register. > > > The code with branches is as follows: > > csrr t0, CSR_MSTATUS > > srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > > and t0, t0, 0x1 > > beq t0, zero, _trap_stack_s_mode > > _trap_stack_m_mode: > > add t0, sp, zero > > j _trap_stack_done > > _trap_stack_s_mode: > > add t0, tp, zero > > _trap_stack_done: > > Again, that's not what I wrote. Mine was shorter with only a single > branch. I think you missed the context in-which this optimization is being done. Please see above. Let me send V2 with some of your comments addressed. Regards, Anup > > Jess > > >> > >> Same number of instructions, but all compressible, no multiplications > >> required and the lone branch can be macro-op fused. > > > > Overall, the key thing is having no (or fewer) branches in _trap_handler() > > with the same number of (or fewer) instructions. > > > > Regards, > > Anup > > > >> > >> Jess > >> > >>> Came_From_M_Mode = 0 ==> Exception_Stack = TP > >>> Came_From_M_Mode = 1 ==> Exception_Stack = SP > >>> > >>> Signed-off-by: Anup Patel <anup.patel@wdc.com> > >>> --- > >>> firmware/fw_base.S | 49 ++++++++++++++++++++-------------------------- > >>> 1 file changed, 21 insertions(+), 28 deletions(-) > >>> > >>> diff --git a/firmware/fw_base.S b/firmware/fw_base.S > >>> index b66ac41..53a9a48 100644 > >>> --- a/firmware/fw_base.S > >>> +++ b/firmware/fw_base.S > >>> @@ -490,35 +490,28 @@ _trap_handler: > >>> /* Save T0 in scratch space */ > >>> REG_S t0, SBI_SCRATCH_TMP0_OFFSET(tp) > >>> > >>> - /* Check which mode we came from */ > >>> + /* > >>> + * Set T0 to appropriate exception stack > >>> + * > >>> + * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 > >>> + * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) > >>> + * > >>> + * Came_From_M_Mode = 0 ==> Exception_Stack = TP > >>> + * Came_From_M_Mode = 1 ==> Exception_Stack = SP > >>> + */ > >>> csrr t0, CSR_MSTATUS > >>> - srl t0, t0, MSTATUS_MPP_SHIFT > >>> - and t0, t0, PRV_M > >>> - xori t0, t0, PRV_M > >>> - beq t0, zero, _trap_handler_m_mode > >>> - > >>> - /* We came from S-mode or U-mode */ > >>> -_trap_handler_s_mode: > >>> - /* Set T0 to original SP */ > >>> - add t0, sp, zero > >>> - > >>> - /* Setup exception stack */ > >>> - add sp, tp, -(SBI_TRAP_REGS_SIZE) > >>> - > >>> - /* Jump to code common for all modes */ > >>> - j _trap_handler_all_mode > >>> - > >>> - /* We came from M-mode */ > >>> -_trap_handler_m_mode: > >>> - /* Set T0 to original SP */ > >>> - add t0, sp, zero > >>> - > >>> - /* Re-use current SP as exception stack */ > >>> - add sp, sp, -(SBI_TRAP_REGS_SIZE) > >>> - > >>> -_trap_handler_all_mode: > >>> - /* Save original SP (from T0) on stack */ > >>> - REG_S t0, SBI_TRAP_REGS_OFFSET(sp)(sp) > >>> + srl t0, t0, (MSTATUS_MPP_SHIFT + 1) > >>> + and t0, t0, 0x1 > >>> + xor sp, sp, tp > >>> + mul t0, t0, sp > >>> + xor sp, sp, tp > >>> + xor t0, tp, t0 > >>> + > >>> + /* Save original SP on exception stack */ > >>> + REG_S sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0) > >>> + > >>> + /* Set SP to exception stack and make room for trap registers */ > >>> + add sp, t0, -(SBI_TRAP_REGS_SIZE) > >>> > >>> /* Restore T0 from scratch space */ > >>> REG_L t0, SBI_SCRATCH_TMP0_OFFSET(tp) > >>> -- > >>> 2.25.1 > >>> > >>> > >>> -- > >>> opensbi mailing list > >>> opensbi@lists.infradead.org > >>> http://lists.infradead.org/mailman/listinfo/opensbi >
diff --git a/firmware/fw_base.S b/firmware/fw_base.S index b66ac41..53a9a48 100644 --- a/firmware/fw_base.S +++ b/firmware/fw_base.S @@ -490,35 +490,28 @@ _trap_handler: /* Save T0 in scratch space */ REG_S t0, SBI_SCRATCH_TMP0_OFFSET(tp) - /* Check which mode we came from */ + /* + * Set T0 to appropriate exception stack + * + * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 + * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) + * + * Came_From_M_Mode = 0 ==> Exception_Stack = TP + * Came_From_M_Mode = 1 ==> Exception_Stack = SP + */ csrr t0, CSR_MSTATUS - srl t0, t0, MSTATUS_MPP_SHIFT - and t0, t0, PRV_M - xori t0, t0, PRV_M - beq t0, zero, _trap_handler_m_mode - - /* We came from S-mode or U-mode */ -_trap_handler_s_mode: - /* Set T0 to original SP */ - add t0, sp, zero - - /* Setup exception stack */ - add sp, tp, -(SBI_TRAP_REGS_SIZE) - - /* Jump to code common for all modes */ - j _trap_handler_all_mode - - /* We came from M-mode */ -_trap_handler_m_mode: - /* Set T0 to original SP */ - add t0, sp, zero - - /* Re-use current SP as exception stack */ - add sp, sp, -(SBI_TRAP_REGS_SIZE) - -_trap_handler_all_mode: - /* Save original SP (from T0) on stack */ - REG_S t0, SBI_TRAP_REGS_OFFSET(sp)(sp) + srl t0, t0, (MSTATUS_MPP_SHIFT + 1) + and t0, t0, 0x1 + xor sp, sp, tp + mul t0, t0, sp + xor sp, sp, tp + xor t0, tp, t0 + + /* Save original SP on exception stack */ + REG_S sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0) + + /* Set SP to exception stack and make room for trap registers */ + add sp, t0, -(SBI_TRAP_REGS_SIZE) /* Restore T0 from scratch space */ REG_L t0, SBI_SCRATCH_TMP0_OFFSET(tp)
Currently, the low-level trap handler (i.e. _trap_handler()) uses branch instructions to conditionally setup exception stack based on which mode trap occured. This patch implements exception stack setup using xor instructions which is 2 instructions less compared to previous approach and faster due to lack of branch instructions. The new exception stack setup approach can be best described by the following pseudocode: Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1 Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP)) Came_From_M_Mode = 0 ==> Exception_Stack = TP Came_From_M_Mode = 1 ==> Exception_Stack = SP Signed-off-by: Anup Patel <anup.patel@wdc.com> --- firmware/fw_base.S | 49 ++++++++++++++++++++-------------------------- 1 file changed, 21 insertions(+), 28 deletions(-)