diff mbox series

[v1,03/15] tcg: Fix register allocation constraints

Message ID 20240813113436.831-4-zhiwei_liu@linux.alibaba.com
State New
Headers show
Series tcg/riscv: Add support for vector | expand

Commit Message

LIU Zhiwei Aug. 13, 2024, 11:34 a.m. UTC
From: TANG Tiancheng <tangtiancheng.ttc@alibaba-inc.com>

When allocating registers for input and output, ensure they match
the available registers to avoid allocating illeagal registers.

We should respect RISC-V vector extension's variable-length registers
and LMUL-based register grouping. Coordinate with tcg_target_available_regs
initialization tcg_target_init (behind this commit) to ensure proper
handling of vector register constraints.

Note: While mov_vec doesn't have constraints, dup_vec and other IRs do.
We need to strengthen constraints for all IRs except mov_vec, and this
is sufficient.

Signed-off-by: TANG Tiancheng <tangtiancheng.ttc@alibaba-inc.com>
Fixes: 29f5e92502 (tcg: Introduce paired register allocation)
Reviewed-by: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
---
 tcg/tcg.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Comments

Richard Henderson Aug. 13, 2024, 11:52 a.m. UTC | #1
On 8/13/24 21:34, LIU Zhiwei wrote:
> From: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
> 
> When allocating registers for input and output, ensure they match
> the available registers to avoid allocating illeagal registers.
> 
> We should respect RISC-V vector extension's variable-length registers
> and LMUL-based register grouping. Coordinate with tcg_target_available_regs
> initialization tcg_target_init (behind this commit) to ensure proper
> handling of vector register constraints.
> 
> Note: While mov_vec doesn't have constraints, dup_vec and other IRs do.
> We need to strengthen constraints for all IRs except mov_vec, and this
> is sufficient.
> 
> Signed-off-by: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
> Fixes: 29f5e92502 (tcg: Introduce paired register allocation)
> Reviewed-by: Liu Zhiwei<zhiwei_liu@linux.alibaba.com>
> ---
>   tcg/tcg.c | 20 +++++++++++++-------
>   1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 34e3056380..d26b42534d 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -4722,8 +4722,10 @@ static void tcg_reg_alloc_dup(TCGContext *s, const TCGOp *op)
>           return;
>       }
>   
> -    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs;
> -    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs;
> +    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs &
> +                                    tcg_target_available_regs[ots->type];
> +    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs &
> +                                    tcg_target_available_regs[its->type];
>   

Why would you ever have constraints that resolve to unavailable registers?

If you don't want to fix this in the backend, then the next best place is in 
process_op_defs(), so that we take care of this once at startup, and never have to think 
about it again.


r~
LIU Zhiwei Aug. 14, 2024, 12:58 a.m. UTC | #2
On 2024/8/13 19:52, Richard Henderson wrote:
> On 8/13/24 21:34, LIU Zhiwei wrote:
>> From: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
>>
>> When allocating registers for input and output, ensure they match
>> the available registers to avoid allocating illeagal registers.
>>
>> We should respect RISC-V vector extension's variable-length registers
>> and LMUL-based register grouping. Coordinate with 
>> tcg_target_available_regs
>> initialization tcg_target_init (behind this commit) to ensure proper
>> handling of vector register constraints.
>>
>> Note: While mov_vec doesn't have constraints, dup_vec and other IRs do.
>> We need to strengthen constraints for all IRs except mov_vec, and this
>> is sufficient.
>>
>> Signed-off-by: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
>> Fixes: 29f5e92502 (tcg: Introduce paired register allocation)
>> Reviewed-by: Liu Zhiwei<zhiwei_liu@linux.alibaba.com>
>> ---
>>   tcg/tcg.c | 20 +++++++++++++-------
>>   1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/tcg/tcg.c b/tcg/tcg.c
>> index 34e3056380..d26b42534d 100644
>> --- a/tcg/tcg.c
>> +++ b/tcg/tcg.c
>> @@ -4722,8 +4722,10 @@ static void tcg_reg_alloc_dup(TCGContext *s, 
>> const TCGOp *op)
>>           return;
>>       }
>>   -    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs;
>> -    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs;
>> +    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs &
>> + tcg_target_available_regs[ots->type];
>> +    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs &
>> + tcg_target_available_regs[its->type];
>
> Why would you ever have constraints that resolve to unavailable 
> registers?
>
> If you don't want to fix this in the backend, then the next best place 
> is in process_op_defs(), so that we take care of this once at startup, 
> and never have to think about it again.

Hi Richard,

The constraints provided in process_op_defs() are static and tied to the 
IR operations. For example, if we create constraints for add_vec, the 
same constraints will apply to all types of add_vec operations 
(TCG_TYPE_V64, TCG_TYPE_V128, TCG_TYPE_V256). This means the constraints 
don't change based on the specific type of operation being performed.
In contrast, RISC-V's LMUL (Length Multiplier) can change at runtime 
depending on the type of IR operation. Different LMUL values affect 
which vector registers are available for use in RISC-V. Let's consider 
an example where the host's vector register width is 128 bits:

For an add_vec operation on v256 (256-bit vectors), only even-numbered 
vector registers like 0, 2, 4 can be used.
However, for an add_vec operation on v128 (128-bit vectors), all vector 
registers (0, 1, 2, etc.) are available.

Thus if we want to use all registers of vectors, we have to add a 
dynamic constraint on register allocation based on IR types.

Thanks,
Zhiwei

>
>
> r~
Richard Henderson Aug. 14, 2024, 2:04 a.m. UTC | #3
On 8/14/24 10:58, LIU Zhiwei wrote:
> Thus if we want to use all registers of vectors, we have to add a dynamic constraint on 
> register allocation based on IR types.

My comment vs patch 4 is that you can't do that, at least not without large changes to TCG.

In addition, I said that the register pressure on vector regs is not high enough to 
justify such changes.  There is, so far, little benefit in having more than 4 or 5 vector 
registers, much less 32.  Thus 7 (lmul 4, omitting v0) is sufficient.


r~
LIU Zhiwei Aug. 14, 2024, 2:27 a.m. UTC | #4
On 2024/8/14 10:04, Richard Henderson wrote:
> On 8/14/24 10:58, LIU Zhiwei wrote:
>> Thus if we want to use all registers of vectors, we have to add a 
>> dynamic constraint on register allocation based on IR types.
>
> My comment vs patch 4 is that you can't do that, at least not without 
> large changes to TCG.
>
> In addition, I said that the register pressure on vector regs is not 
> high enough to justify such changes.  There is, so far, little benefit 
> in having more than 4 or 5 vector registers, much less 32.  Thus 7 
> (lmul 4, omitting v0) is sufficient.

At least on QEMU, SVE can support 2048 bit vector length with 
'sve-default-vector-length=256'.  Software optimized with SVE, such as 
X264 can benefit with long SVE length in less dynamic A64 instructions.

We want to expose all host vector ability. Thus the largest 
TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 width 
operation. We have expand TCG_TYPE_V512/1024/2048 types(not in this 
patch set, but intend to upstream later).
With large TCG_TYPE_V1024/2048, we get better performance on RISC-V 
board with much less translated RISC-V vector instructions. We can give 
a more detailed experiment result if needed.

However, we will only have 3 vector register when support 
TCG_TYPE_V1024.  And even less for TCG_TYPE_V2048.  Current approach 
will give more vectors TCG_TYPE_V128 even with support TCG_TYPE_V1024, 
which will relax some guest NEON register pressure.

Thanks,
Zhiwei

>
>
> r~
Richard Henderson Aug. 14, 2024, 3:08 a.m. UTC | #5
On 8/14/24 12:27, LIU Zhiwei wrote:
> 
> On 2024/8/14 10:04, Richard Henderson wrote:
>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>> Thus if we want to use all registers of vectors, we have to add a dynamic constraint on 
>>> register allocation based on IR types.
>>
>> My comment vs patch 4 is that you can't do that, at least not without large changes to TCG.
>>
>> In addition, I said that the register pressure on vector regs is not high enough to 
>> justify such changes.  There is, so far, little benefit in having more than 4 or 5 
>> vector registers, much less 32.  Thus 7 (lmul 4, omitting v0) is sufficient.
> 
> At least on QEMU, SVE can support 2048 bit vector length with 'sve-default-vector- 
> length=256'.  Software optimized with SVE, such as X264 can benefit with long SVE length 
> in less dynamic A64 instructions.
> 
> We want to expose all host vector ability. Thus the largest TCG_TYPE_V256 is not enough, 
> as 128-bit RVV can give 8*128=1024 width operation. We have expand TCG_TYPE_V512/1024/2048 
> types(not in this patch set, but intend to upstream later).
> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V board with much less 
> translated RISC-V vector instructions. We can give a more detailed experiment result if 
> needed.
> 
> However, we will only have 3 vector register when support TCG_TYPE_V1024.  And even less 
> for TCG_TYPE_V2048.  Current approach will give more vectors TCG_TYPE_V128 even with 
> support TCG_TYPE_V1024, which will relax some guest NEON register pressure.

Then you will have to teach TCG about one operand consuming and clobbering N hard 
registers, so that you get the spills and fills done correctly.

But you haven't done that in this patch set, so will currently generate incorrect code.

I think you should make longer vector operations a longer term project, and start with 
something simpler.


r~
LIU Zhiwei Aug. 14, 2024, 3:30 a.m. UTC | #6
On 2024/8/14 11:08, Richard Henderson wrote:
> On 8/14/24 12:27, LIU Zhiwei wrote:
>>
>> On 2024/8/14 10:04, Richard Henderson wrote:
>>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>>> Thus if we want to use all registers of vectors, we have to add a 
>>>> dynamic constraint on register allocation based on IR types.
>>>
>>> My comment vs patch 4 is that you can't do that, at least not 
>>> without large changes to TCG.
>>>
>>> In addition, I said that the register pressure on vector regs is not 
>>> high enough to justify such changes.  There is, so far, little 
>>> benefit in having more than 4 or 5 vector registers, much less 32.  
>>> Thus 7 (lmul 4, omitting v0) is sufficient.
>>
>> At least on QEMU, SVE can support 2048 bit vector length with 
>> 'sve-default-vector- length=256'.  Software optimized with SVE, such 
>> as X264 can benefit with long SVE length in less dynamic A64 
>> instructions.
>>
>> We want to expose all host vector ability. Thus the largest 
>> TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 width 
>> operation. We have expand TCG_TYPE_V512/1024/2048 types(not in this 
>> patch set, but intend to upstream later).
>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V 
>> board with much less translated RISC-V vector instructions. We can 
>> give a more detailed experiment result if needed.
>>
>> However, we will only have 3 vector register when support 
>> TCG_TYPE_V1024.  And even less for TCG_TYPE_V2048.  Current approach 
>> will give more vectors TCG_TYPE_V128 even with support 
>> TCG_TYPE_V1024, which will relax some guest NEON register pressure.
>
> Then you will have to teach TCG about one operand consuming and 
> clobbering N hard registers, so that you get the spills and fills done 
> correctly.
I think we have done this in patch 6.
>
> But you haven't done that in this patch set, so will currently 
> generate incorrect code.
>
> I think you should make longer vector operations a longer term project, 

Does longer vector operations implementation deserves to upstream? We 
can contribute it sooner as it is ready.

> and start with something simpler.

Agree if you insist. 🙂

Zhiwei

>
>
> r~
Richard Henderson Aug. 14, 2024, 4:18 a.m. UTC | #7
On 8/14/24 13:30, LIU Zhiwei wrote:
> 
> On 2024/8/14 11:08, Richard Henderson wrote:
>> On 8/14/24 12:27, LIU Zhiwei wrote:
>>>
>>> On 2024/8/14 10:04, Richard Henderson wrote:
>>>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>>>> Thus if we want to use all registers of vectors, we have to add a dynamic constraint 
>>>>> on register allocation based on IR types.
>>>>
>>>> My comment vs patch 4 is that you can't do that, at least not without large changes to 
>>>> TCG.
>>>>
>>>> In addition, I said that the register pressure on vector regs is not high enough to 
>>>> justify such changes.  There is, so far, little benefit in having more than 4 or 5 
>>>> vector registers, much less 32. Thus 7 (lmul 4, omitting v0) is sufficient.
>>>
>>> At least on QEMU, SVE can support 2048 bit vector length with 'sve-default-vector- 
>>> length=256'.  Software optimized with SVE, such as X264 can benefit with long SVE 
>>> length in less dynamic A64 instructions.
>>>
>>> We want to expose all host vector ability. Thus the largest TCG_TYPE_V256 is not 
>>> enough, as 128-bit RVV can give 8*128=1024 width operation. We have expand 
>>> TCG_TYPE_V512/1024/2048 types(not in this patch set, but intend to upstream later).
>>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V board with much 
>>> less translated RISC-V vector instructions. We can give a more detailed experiment 
>>> result if needed.
>>>
>>> However, we will only have 3 vector register when support TCG_TYPE_V1024.  And even 
>>> less for TCG_TYPE_V2048.  Current approach will give more vectors TCG_TYPE_V128 even 
>>> with support TCG_TYPE_V1024, which will relax some guest NEON register pressure.
>>
>> Then you will have to teach TCG about one operand consuming and clobbering N hard 
>> registers, so that you get the spills and fills done correctly.
> I think we have done this in patch 6.

No, you have not.

There are no modifications to tcg_reg_alloc, and there are no additional calls to 
tcg_reg_free, which is where spills are generated. There would also need to be changes on 
the fill side, temp_load.


>> I think you should make longer vector operations a longer term project, 
> 
> Does longer vector operations implementation deserves to upstream? We can contribute it 
> sooner as it is ready.

Sure.


r~
LIU Zhiwei Aug. 14, 2024, 7:47 a.m. UTC | #8
On 2024/8/14 12:18, Richard Henderson wrote:
> On 8/14/24 13:30, LIU Zhiwei wrote:
>>
>> On 2024/8/14 11:08, Richard Henderson wrote:
>>> On 8/14/24 12:27, LIU Zhiwei wrote:
>>>>
>>>> On 2024/8/14 10:04, Richard Henderson wrote:
>>>>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>>>>> Thus if we want to use all registers of vectors, we have to add a 
>>>>>> dynamic constraint on register allocation based on IR types.
>>>>>
>>>>> My comment vs patch 4 is that you can't do that, at least not 
>>>>> without large changes to TCG.
>>>>>
>>>>> In addition, I said that the register pressure on vector regs is 
>>>>> not high enough to justify such changes.  There is, so far, little 
>>>>> benefit in having more than 4 or 5 vector registers, much less 32. 
>>>>> Thus 7 (lmul 4, omitting v0) is sufficient.
>>>>
>>>> At least on QEMU, SVE can support 2048 bit vector length with 
>>>> 'sve-default-vector- length=256'.  Software optimized with SVE, 
>>>> such as X264 can benefit with long SVE length in less dynamic A64 
>>>> instructions.
>>>>
>>>> We want to expose all host vector ability. Thus the largest 
>>>> TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 
>>>> width operation. We have expand TCG_TYPE_V512/1024/2048 types(not 
>>>> in this patch set, but intend to upstream later).
>>>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V 
>>>> board with much less translated RISC-V vector instructions. We can 
>>>> give a more detailed experiment result if needed.
>>>>
>>>> However, we will only have 3 vector register when support 
>>>> TCG_TYPE_V1024.  And even less for TCG_TYPE_V2048.  Current 
>>>> approach will give more vectors TCG_TYPE_V128 even with support 
>>>> TCG_TYPE_V1024, which will relax some guest NEON register pressure.
>>>
>>> Then you will have to teach TCG about one operand consuming and 
>>> clobbering N hard registers, so that you get the spills and fills 
>>> done correctly.
>> I think we have done this in patch 6.
>
> No, you have not.
>
> There are no modifications to tcg_reg_alloc, and there are no 
> additional calls to tcg_reg_free, which is where spills are generated. 
> There would also need to be changes on the fill side, temp_load.
Thanks. I choose the simple design as you suggest for this patch set. 
And We will fix this problem when send the longer vector operations 
implementation.
>
>
>>> I think you should make longer vector operations a longer term project, 
>>
>> Does longer vector operations implementation deserves to upstream? We 
>> can contribute it sooner as it is ready.
>
> Sure.

Good news!

Thanks,
Zhiwei

>
>
> r~
diff mbox series

Patch

diff --git a/tcg/tcg.c b/tcg/tcg.c
index 34e3056380..d26b42534d 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -4722,8 +4722,10 @@  static void tcg_reg_alloc_dup(TCGContext *s, const TCGOp *op)
         return;
     }
 
-    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs;
-    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs;
+    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs &
+                                    tcg_target_available_regs[ots->type];
+    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs &
+                                    tcg_target_available_regs[its->type];
 
     /* Allocate the output register now.  */
     if (ots->val_type != TEMP_VAL_REG) {
@@ -4876,7 +4878,7 @@  static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op)
 
         reg = ts->reg;
         i_preferred_regs = 0;
-        i_required_regs = arg_ct->regs;
+        i_required_regs = arg_ct->regs & tcg_target_available_regs[ts->type];
         allocate_new_reg = false;
         copyto_new_reg = false;
 
@@ -5078,6 +5080,7 @@  static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op)
 
         /* satisfy the output constraints */
         for(k = 0; k < nb_oargs; k++) {
+            TCGRegSet o_required_regs;
             i = def->args_ct[k].sort_index;
             arg = op->args[i];
             arg_ct = &def->args_ct[i];
@@ -5085,17 +5088,19 @@  static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op)
 
             /* ENV should not be modified.  */
             tcg_debug_assert(!temp_readonly(ts));
+            o_required_regs = arg_ct->regs &
+                              tcg_target_available_regs[ts->type];
 
             switch (arg_ct->pair) {
             case 0: /* not paired */
                 if (arg_ct->oalias && !const_args[arg_ct->alias_index]) {
                     reg = new_args[arg_ct->alias_index];
                 } else if (arg_ct->newreg) {
-                    reg = tcg_reg_alloc(s, arg_ct->regs,
+                    reg = tcg_reg_alloc(s, o_required_regs,
                                         i_allocated_regs | o_allocated_regs,
                                         output_pref(op, k), ts->indirect_base);
                 } else {
-                    reg = tcg_reg_alloc(s, arg_ct->regs, o_allocated_regs,
+                    reg = tcg_reg_alloc(s, o_required_regs, o_allocated_regs,
                                         output_pref(op, k), ts->indirect_base);
                 }
                 break;
@@ -5104,12 +5109,13 @@  static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op)
                 if (arg_ct->oalias) {
                     reg = new_args[arg_ct->alias_index];
                 } else if (arg_ct->newreg) {
-                    reg = tcg_reg_alloc_pair(s, arg_ct->regs,
+                    reg = tcg_reg_alloc_pair(s, o_required_regs,
                                              i_allocated_regs | o_allocated_regs,
                                              output_pref(op, k),
                                              ts->indirect_base);
                 } else {
-                    reg = tcg_reg_alloc_pair(s, arg_ct->regs, o_allocated_regs,
+                    reg = tcg_reg_alloc_pair(s, o_required_regs,
+                                             o_allocated_regs,
                                              output_pref(op, k),
                                              ts->indirect_base);
                 }