Message ID | 20240813113436.831-4-zhiwei_liu@linux.alibaba.com |
---|---|
State | New |
Headers | show |
Series | tcg/riscv: Add support for vector | expand |
On 8/13/24 21:34, LIU Zhiwei wrote: > From: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com> > > When allocating registers for input and output, ensure they match > the available registers to avoid allocating illeagal registers. > > We should respect RISC-V vector extension's variable-length registers > and LMUL-based register grouping. Coordinate with tcg_target_available_regs > initialization tcg_target_init (behind this commit) to ensure proper > handling of vector register constraints. > > Note: While mov_vec doesn't have constraints, dup_vec and other IRs do. > We need to strengthen constraints for all IRs except mov_vec, and this > is sufficient. > > Signed-off-by: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com> > Fixes: 29f5e92502 (tcg: Introduce paired register allocation) > Reviewed-by: Liu Zhiwei<zhiwei_liu@linux.alibaba.com> > --- > tcg/tcg.c | 20 +++++++++++++------- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/tcg/tcg.c b/tcg/tcg.c > index 34e3056380..d26b42534d 100644 > --- a/tcg/tcg.c > +++ b/tcg/tcg.c > @@ -4722,8 +4722,10 @@ static void tcg_reg_alloc_dup(TCGContext *s, const TCGOp *op) > return; > } > > - dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs; > - dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs; > + dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs & > + tcg_target_available_regs[ots->type]; > + dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs & > + tcg_target_available_regs[its->type]; > Why would you ever have constraints that resolve to unavailable registers? If you don't want to fix this in the backend, then the next best place is in process_op_defs(), so that we take care of this once at startup, and never have to think about it again. r~
On 2024/8/13 19:52, Richard Henderson wrote: > On 8/13/24 21:34, LIU Zhiwei wrote: >> From: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com> >> >> When allocating registers for input and output, ensure they match >> the available registers to avoid allocating illeagal registers. >> >> We should respect RISC-V vector extension's variable-length registers >> and LMUL-based register grouping. Coordinate with >> tcg_target_available_regs >> initialization tcg_target_init (behind this commit) to ensure proper >> handling of vector register constraints. >> >> Note: While mov_vec doesn't have constraints, dup_vec and other IRs do. >> We need to strengthen constraints for all IRs except mov_vec, and this >> is sufficient. >> >> Signed-off-by: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com> >> Fixes: 29f5e92502 (tcg: Introduce paired register allocation) >> Reviewed-by: Liu Zhiwei<zhiwei_liu@linux.alibaba.com> >> --- >> tcg/tcg.c | 20 +++++++++++++------- >> 1 file changed, 13 insertions(+), 7 deletions(-) >> >> diff --git a/tcg/tcg.c b/tcg/tcg.c >> index 34e3056380..d26b42534d 100644 >> --- a/tcg/tcg.c >> +++ b/tcg/tcg.c >> @@ -4722,8 +4722,10 @@ static void tcg_reg_alloc_dup(TCGContext *s, >> const TCGOp *op) >> return; >> } >> - dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs; >> - dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs; >> + dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs & >> + tcg_target_available_regs[ots->type]; >> + dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs & >> + tcg_target_available_regs[its->type]; > > Why would you ever have constraints that resolve to unavailable > registers? > > If you don't want to fix this in the backend, then the next best place > is in process_op_defs(), so that we take care of this once at startup, > and never have to think about it again. Hi Richard, The constraints provided in process_op_defs() are static and tied to the IR operations. For example, if we create constraints for add_vec, the same constraints will apply to all types of add_vec operations (TCG_TYPE_V64, TCG_TYPE_V128, TCG_TYPE_V256). This means the constraints don't change based on the specific type of operation being performed. In contrast, RISC-V's LMUL (Length Multiplier) can change at runtime depending on the type of IR operation. Different LMUL values affect which vector registers are available for use in RISC-V. Let's consider an example where the host's vector register width is 128 bits: For an add_vec operation on v256 (256-bit vectors), only even-numbered vector registers like 0, 2, 4 can be used. However, for an add_vec operation on v128 (128-bit vectors), all vector registers (0, 1, 2, etc.) are available. Thus if we want to use all registers of vectors, we have to add a dynamic constraint on register allocation based on IR types. Thanks, Zhiwei > > > r~
On 8/14/24 10:58, LIU Zhiwei wrote: > Thus if we want to use all registers of vectors, we have to add a dynamic constraint on > register allocation based on IR types. My comment vs patch 4 is that you can't do that, at least not without large changes to TCG. In addition, I said that the register pressure on vector regs is not high enough to justify such changes. There is, so far, little benefit in having more than 4 or 5 vector registers, much less 32. Thus 7 (lmul 4, omitting v0) is sufficient. r~
On 2024/8/14 10:04, Richard Henderson wrote: > On 8/14/24 10:58, LIU Zhiwei wrote: >> Thus if we want to use all registers of vectors, we have to add a >> dynamic constraint on register allocation based on IR types. > > My comment vs patch 4 is that you can't do that, at least not without > large changes to TCG. > > In addition, I said that the register pressure on vector regs is not > high enough to justify such changes. There is, so far, little benefit > in having more than 4 or 5 vector registers, much less 32. Thus 7 > (lmul 4, omitting v0) is sufficient. At least on QEMU, SVE can support 2048 bit vector length with 'sve-default-vector-length=256'. Software optimized with SVE, such as X264 can benefit with long SVE length in less dynamic A64 instructions. We want to expose all host vector ability. Thus the largest TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 width operation. We have expand TCG_TYPE_V512/1024/2048 types(not in this patch set, but intend to upstream later). With large TCG_TYPE_V1024/2048, we get better performance on RISC-V board with much less translated RISC-V vector instructions. We can give a more detailed experiment result if needed. However, we will only have 3 vector register when support TCG_TYPE_V1024. And even less for TCG_TYPE_V2048. Current approach will give more vectors TCG_TYPE_V128 even with support TCG_TYPE_V1024, which will relax some guest NEON register pressure. Thanks, Zhiwei > > > r~
On 8/14/24 12:27, LIU Zhiwei wrote: > > On 2024/8/14 10:04, Richard Henderson wrote: >> On 8/14/24 10:58, LIU Zhiwei wrote: >>> Thus if we want to use all registers of vectors, we have to add a dynamic constraint on >>> register allocation based on IR types. >> >> My comment vs patch 4 is that you can't do that, at least not without large changes to TCG. >> >> In addition, I said that the register pressure on vector regs is not high enough to >> justify such changes. There is, so far, little benefit in having more than 4 or 5 >> vector registers, much less 32. Thus 7 (lmul 4, omitting v0) is sufficient. > > At least on QEMU, SVE can support 2048 bit vector length with 'sve-default-vector- > length=256'. Software optimized with SVE, such as X264 can benefit with long SVE length > in less dynamic A64 instructions. > > We want to expose all host vector ability. Thus the largest TCG_TYPE_V256 is not enough, > as 128-bit RVV can give 8*128=1024 width operation. We have expand TCG_TYPE_V512/1024/2048 > types(not in this patch set, but intend to upstream later). > With large TCG_TYPE_V1024/2048, we get better performance on RISC-V board with much less > translated RISC-V vector instructions. We can give a more detailed experiment result if > needed. > > However, we will only have 3 vector register when support TCG_TYPE_V1024. And even less > for TCG_TYPE_V2048. Current approach will give more vectors TCG_TYPE_V128 even with > support TCG_TYPE_V1024, which will relax some guest NEON register pressure. Then you will have to teach TCG about one operand consuming and clobbering N hard registers, so that you get the spills and fills done correctly. But you haven't done that in this patch set, so will currently generate incorrect code. I think you should make longer vector operations a longer term project, and start with something simpler. r~
On 2024/8/14 11:08, Richard Henderson wrote: > On 8/14/24 12:27, LIU Zhiwei wrote: >> >> On 2024/8/14 10:04, Richard Henderson wrote: >>> On 8/14/24 10:58, LIU Zhiwei wrote: >>>> Thus if we want to use all registers of vectors, we have to add a >>>> dynamic constraint on register allocation based on IR types. >>> >>> My comment vs patch 4 is that you can't do that, at least not >>> without large changes to TCG. >>> >>> In addition, I said that the register pressure on vector regs is not >>> high enough to justify such changes. There is, so far, little >>> benefit in having more than 4 or 5 vector registers, much less 32. >>> Thus 7 (lmul 4, omitting v0) is sufficient. >> >> At least on QEMU, SVE can support 2048 bit vector length with >> 'sve-default-vector- length=256'. Software optimized with SVE, such >> as X264 can benefit with long SVE length in less dynamic A64 >> instructions. >> >> We want to expose all host vector ability. Thus the largest >> TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 width >> operation. We have expand TCG_TYPE_V512/1024/2048 types(not in this >> patch set, but intend to upstream later). >> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V >> board with much less translated RISC-V vector instructions. We can >> give a more detailed experiment result if needed. >> >> However, we will only have 3 vector register when support >> TCG_TYPE_V1024. And even less for TCG_TYPE_V2048. Current approach >> will give more vectors TCG_TYPE_V128 even with support >> TCG_TYPE_V1024, which will relax some guest NEON register pressure. > > Then you will have to teach TCG about one operand consuming and > clobbering N hard registers, so that you get the spills and fills done > correctly. I think we have done this in patch 6. > > But you haven't done that in this patch set, so will currently > generate incorrect code. > > I think you should make longer vector operations a longer term project, Does longer vector operations implementation deserves to upstream? We can contribute it sooner as it is ready. > and start with something simpler. Agree if you insist. 🙂 Zhiwei > > > r~
On 8/14/24 13:30, LIU Zhiwei wrote: > > On 2024/8/14 11:08, Richard Henderson wrote: >> On 8/14/24 12:27, LIU Zhiwei wrote: >>> >>> On 2024/8/14 10:04, Richard Henderson wrote: >>>> On 8/14/24 10:58, LIU Zhiwei wrote: >>>>> Thus if we want to use all registers of vectors, we have to add a dynamic constraint >>>>> on register allocation based on IR types. >>>> >>>> My comment vs patch 4 is that you can't do that, at least not without large changes to >>>> TCG. >>>> >>>> In addition, I said that the register pressure on vector regs is not high enough to >>>> justify such changes. There is, so far, little benefit in having more than 4 or 5 >>>> vector registers, much less 32. Thus 7 (lmul 4, omitting v0) is sufficient. >>> >>> At least on QEMU, SVE can support 2048 bit vector length with 'sve-default-vector- >>> length=256'. Software optimized with SVE, such as X264 can benefit with long SVE >>> length in less dynamic A64 instructions. >>> >>> We want to expose all host vector ability. Thus the largest TCG_TYPE_V256 is not >>> enough, as 128-bit RVV can give 8*128=1024 width operation. We have expand >>> TCG_TYPE_V512/1024/2048 types(not in this patch set, but intend to upstream later). >>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V board with much >>> less translated RISC-V vector instructions. We can give a more detailed experiment >>> result if needed. >>> >>> However, we will only have 3 vector register when support TCG_TYPE_V1024. And even >>> less for TCG_TYPE_V2048. Current approach will give more vectors TCG_TYPE_V128 even >>> with support TCG_TYPE_V1024, which will relax some guest NEON register pressure. >> >> Then you will have to teach TCG about one operand consuming and clobbering N hard >> registers, so that you get the spills and fills done correctly. > I think we have done this in patch 6. No, you have not. There are no modifications to tcg_reg_alloc, and there are no additional calls to tcg_reg_free, which is where spills are generated. There would also need to be changes on the fill side, temp_load. >> I think you should make longer vector operations a longer term project, > > Does longer vector operations implementation deserves to upstream? We can contribute it > sooner as it is ready. Sure. r~
On 2024/8/14 12:18, Richard Henderson wrote: > On 8/14/24 13:30, LIU Zhiwei wrote: >> >> On 2024/8/14 11:08, Richard Henderson wrote: >>> On 8/14/24 12:27, LIU Zhiwei wrote: >>>> >>>> On 2024/8/14 10:04, Richard Henderson wrote: >>>>> On 8/14/24 10:58, LIU Zhiwei wrote: >>>>>> Thus if we want to use all registers of vectors, we have to add a >>>>>> dynamic constraint on register allocation based on IR types. >>>>> >>>>> My comment vs patch 4 is that you can't do that, at least not >>>>> without large changes to TCG. >>>>> >>>>> In addition, I said that the register pressure on vector regs is >>>>> not high enough to justify such changes. There is, so far, little >>>>> benefit in having more than 4 or 5 vector registers, much less 32. >>>>> Thus 7 (lmul 4, omitting v0) is sufficient. >>>> >>>> At least on QEMU, SVE can support 2048 bit vector length with >>>> 'sve-default-vector- length=256'. Software optimized with SVE, >>>> such as X264 can benefit with long SVE length in less dynamic A64 >>>> instructions. >>>> >>>> We want to expose all host vector ability. Thus the largest >>>> TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 >>>> width operation. We have expand TCG_TYPE_V512/1024/2048 types(not >>>> in this patch set, but intend to upstream later). >>>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V >>>> board with much less translated RISC-V vector instructions. We can >>>> give a more detailed experiment result if needed. >>>> >>>> However, we will only have 3 vector register when support >>>> TCG_TYPE_V1024. And even less for TCG_TYPE_V2048. Current >>>> approach will give more vectors TCG_TYPE_V128 even with support >>>> TCG_TYPE_V1024, which will relax some guest NEON register pressure. >>> >>> Then you will have to teach TCG about one operand consuming and >>> clobbering N hard registers, so that you get the spills and fills >>> done correctly. >> I think we have done this in patch 6. > > No, you have not. > > There are no modifications to tcg_reg_alloc, and there are no > additional calls to tcg_reg_free, which is where spills are generated. > There would also need to be changes on the fill side, temp_load. Thanks. I choose the simple design as you suggest for this patch set. And We will fix this problem when send the longer vector operations implementation. > > >>> I think you should make longer vector operations a longer term project, >> >> Does longer vector operations implementation deserves to upstream? We >> can contribute it sooner as it is ready. > > Sure. Good news! Thanks, Zhiwei > > > r~
diff --git a/tcg/tcg.c b/tcg/tcg.c index 34e3056380..d26b42534d 100644 --- a/tcg/tcg.c +++ b/tcg/tcg.c @@ -4722,8 +4722,10 @@ static void tcg_reg_alloc_dup(TCGContext *s, const TCGOp *op) return; } - dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs; - dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs; + dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs & + tcg_target_available_regs[ots->type]; + dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs & + tcg_target_available_regs[its->type]; /* Allocate the output register now. */ if (ots->val_type != TEMP_VAL_REG) { @@ -4876,7 +4878,7 @@ static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op) reg = ts->reg; i_preferred_regs = 0; - i_required_regs = arg_ct->regs; + i_required_regs = arg_ct->regs & tcg_target_available_regs[ts->type]; allocate_new_reg = false; copyto_new_reg = false; @@ -5078,6 +5080,7 @@ static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op) /* satisfy the output constraints */ for(k = 0; k < nb_oargs; k++) { + TCGRegSet o_required_regs; i = def->args_ct[k].sort_index; arg = op->args[i]; arg_ct = &def->args_ct[i]; @@ -5085,17 +5088,19 @@ static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op) /* ENV should not be modified. */ tcg_debug_assert(!temp_readonly(ts)); + o_required_regs = arg_ct->regs & + tcg_target_available_regs[ts->type]; switch (arg_ct->pair) { case 0: /* not paired */ if (arg_ct->oalias && !const_args[arg_ct->alias_index]) { reg = new_args[arg_ct->alias_index]; } else if (arg_ct->newreg) { - reg = tcg_reg_alloc(s, arg_ct->regs, + reg = tcg_reg_alloc(s, o_required_regs, i_allocated_regs | o_allocated_regs, output_pref(op, k), ts->indirect_base); } else { - reg = tcg_reg_alloc(s, arg_ct->regs, o_allocated_regs, + reg = tcg_reg_alloc(s, o_required_regs, o_allocated_regs, output_pref(op, k), ts->indirect_base); } break; @@ -5104,12 +5109,13 @@ static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op) if (arg_ct->oalias) { reg = new_args[arg_ct->alias_index]; } else if (arg_ct->newreg) { - reg = tcg_reg_alloc_pair(s, arg_ct->regs, + reg = tcg_reg_alloc_pair(s, o_required_regs, i_allocated_regs | o_allocated_regs, output_pref(op, k), ts->indirect_base); } else { - reg = tcg_reg_alloc_pair(s, arg_ct->regs, o_allocated_regs, + reg = tcg_reg_alloc_pair(s, o_required_regs, + o_allocated_regs, output_pref(op, k), ts->indirect_base); }