[v1,03/15] tcg: Fix register allocation constraints

Message ID	20240813113436.831-4-zhiwei_liu@linux.alibaba.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: LIU Zhiwei <zhiwei_liu@linux.alibaba.com> To: qemu-devel@nongnu.org Cc: qemu-riscv@nongnu.org, palmer@dabbelt.com, alistair.francis@wdc.com, dbarboza@ventanamicro.com, liwei1518@gmail.com, bmeng.cn@gmail.com, zhiwei_liu@linux.alibaba.com, richard.henderson@linaro.org, TANG Tiancheng <tangtiancheng.ttc@alibaba-inc.com> Subject: [PATCH v1 03/15] tcg: Fix register allocation constraints Date: Tue, 13 Aug 2024 19:34:24 +0800 Message-Id: <20240813113436.831-4-zhiwei_liu@linux.alibaba.com> In-Reply-To: <20240813113436.831-1-zhiwei_liu@linux.alibaba.com> References: <20240813113436.831-1-zhiwei_liu@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=115.124.30.124; envelope-from=zhiwei_liu@linux.alibaba.com; helo=out30-124.freemail.mail.aliyun.com X-Spam_score_int: -174 X-Spam_score: -17.5 X-Spam_bar: ----------------- X-Spam_report: (-17.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, UNPARSEABLE_RELAY=0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Series	tcg/riscv: Add support for vector \| expand [v1,00/15] tcg/riscv: Add support for vector [v1,01/15] util: Add RISC-V vector extension probe in cpuinfo [v1,02/15] tcg/op-gvec: Fix iteration step in 32-bit operation [v1,03/15] tcg: Fix register allocation constraints [v1,04/15] tcg/riscv: Add basic support for vector [v1,05/15] tcg/riscv: Add riscv vset{i}vli support [v1,06/15] tcg/riscv: Implement vector load/store [v1,07/15] tcg/riscv: Implement vector mov/dup{m/i} [v1,08/15] tcg/riscv: Add support for basic vector opcodes [v1,09/15] tcg/riscv: Implement vector cmp ops [v1,10/15] tcg/riscv: Implement vector not/neg ops [v1,11/15] tcg/riscv: Implement vector sat/mul ops [v1,12/15] tcg/riscv: Implement vector min/max ops [v1,13/15] tcg/riscv: Implement vector shs/v ops [v1,14/15] tcg/riscv: Implement vector roti/v/x shi ops [v1,15/15] tcg/riscv: Enable vector TCG host-native

LIU Zhiwei Aug. 13, 2024, 11:34 a.m. UTC

From: TANG Tiancheng <tangtiancheng.ttc@alibaba-inc.com>

When allocating registers for input and output, ensure they match
the available registers to avoid allocating illeagal registers.

We should respect RISC-V vector extension's variable-length registers
and LMUL-based register grouping. Coordinate with tcg_target_available_regs
initialization tcg_target_init (behind this commit) to ensure proper
handling of vector register constraints.

Note: While mov_vec doesn't have constraints, dup_vec and other IRs do.
We need to strengthen constraints for all IRs except mov_vec, and this
is sufficient.

Signed-off-by: TANG Tiancheng <tangtiancheng.ttc@alibaba-inc.com>
Fixes: 29f5e92502 (tcg: Introduce paired register allocation)
Reviewed-by: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
---
 tcg/tcg.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Richard Henderson Aug. 13, 2024, 11:52 a.m. UTC | #1

On 8/13/24 21:34, LIU Zhiwei wrote:
> From: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
> 
> When allocating registers for input and output, ensure they match
> the available registers to avoid allocating illeagal registers.
> 
> We should respect RISC-V vector extension's variable-length registers
> and LMUL-based register grouping. Coordinate with tcg_target_available_regs
> initialization tcg_target_init (behind this commit) to ensure proper
> handling of vector register constraints.
> 
> Note: While mov_vec doesn't have constraints, dup_vec and other IRs do.
> We need to strengthen constraints for all IRs except mov_vec, and this
> is sufficient.
> 
> Signed-off-by: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
> Fixes: 29f5e92502 (tcg: Introduce paired register allocation)
> Reviewed-by: Liu Zhiwei<zhiwei_liu@linux.alibaba.com>
> ---
>   tcg/tcg.c | 20 +++++++++++++-------
>   1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 34e3056380..d26b42534d 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -4722,8 +4722,10 @@ static void tcg_reg_alloc_dup(TCGContext *s, const TCGOp *op)
>           return;
>       }
>   
> -    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs;
> -    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs;
> +    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs &
> +                                    tcg_target_available_regs[ots->type];
> +    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs &
> +                                    tcg_target_available_regs[its->type];
>   

Why would you ever have constraints that resolve to unavailable registers?

If you don't want to fix this in the backend, then the next best place is in 
process_op_defs(), so that we take care of this once at startup, and never have to think 
about it again.


r~

LIU Zhiwei Aug. 14, 2024, 12:58 a.m. UTC | #2

On 2024/8/13 19:52, Richard Henderson wrote:
> On 8/13/24 21:34, LIU Zhiwei wrote:
>> From: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
>>
>> When allocating registers for input and output, ensure they match
>> the available registers to avoid allocating illeagal registers.
>>
>> We should respect RISC-V vector extension's variable-length registers
>> and LMUL-based register grouping. Coordinate with 
>> tcg_target_available_regs
>> initialization tcg_target_init (behind this commit) to ensure proper
>> handling of vector register constraints.
>>
>> Note: While mov_vec doesn't have constraints, dup_vec and other IRs do.
>> We need to strengthen constraints for all IRs except mov_vec, and this
>> is sufficient.
>>
>> Signed-off-by: TANG Tiancheng<tangtiancheng.ttc@alibaba-inc.com>
>> Fixes: 29f5e92502 (tcg: Introduce paired register allocation)
>> Reviewed-by: Liu Zhiwei<zhiwei_liu@linux.alibaba.com>
>> ---
>>   tcg/tcg.c | 20 +++++++++++++-------
>>   1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/tcg/tcg.c b/tcg/tcg.c
>> index 34e3056380..d26b42534d 100644
>> --- a/tcg/tcg.c
>> +++ b/tcg/tcg.c
>> @@ -4722,8 +4722,10 @@ static void tcg_reg_alloc_dup(TCGContext *s, 
>> const TCGOp *op)
>>           return;
>>       }
>>   -    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs;
>> -    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs;
>> +    dup_out_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[0].regs &
>> + tcg_target_available_regs[ots->type];
>> +    dup_in_regs = tcg_op_defs[INDEX_op_dup_vec].args_ct[1].regs &
>> + tcg_target_available_regs[its->type];
>
> Why would you ever have constraints that resolve to unavailable 
> registers?
>
> If you don't want to fix this in the backend, then the next best place 
> is in process_op_defs(), so that we take care of this once at startup, 
> and never have to think about it again.

Hi Richard,

The constraints provided in process_op_defs() are static and tied to the 
IR operations. For example, if we create constraints for add_vec, the 
same constraints will apply to all types of add_vec operations 
(TCG_TYPE_V64, TCG_TYPE_V128, TCG_TYPE_V256). This means the constraints 
don't change based on the specific type of operation being performed.
In contrast, RISC-V's LMUL (Length Multiplier) can change at runtime 
depending on the type of IR operation. Different LMUL values affect 
which vector registers are available for use in RISC-V. Let's consider 
an example where the host's vector register width is 128 bits:

For an add_vec operation on v256 (256-bit vectors), only even-numbered 
vector registers like 0, 2, 4 can be used.
However, for an add_vec operation on v128 (128-bit vectors), all vector 
registers (0, 1, 2, etc.) are available.

Thus if we want to use all registers of vectors, we have to add a 
dynamic constraint on register allocation based on IR types.

Thanks,
Zhiwei

>
>
> r~

Richard Henderson Aug. 14, 2024, 2:04 a.m. UTC | #3

On 8/14/24 10:58, LIU Zhiwei wrote:
> Thus if we want to use all registers of vectors, we have to add a dynamic constraint on 
> register allocation based on IR types.

My comment vs patch 4 is that you can't do that, at least not without large changes to TCG.

In addition, I said that the register pressure on vector regs is not high enough to 
justify such changes.  There is, so far, little benefit in having more than 4 or 5 vector 
registers, much less 32.  Thus 7 (lmul 4, omitting v0) is sufficient.

r~

LIU Zhiwei Aug. 14, 2024, 2:27 a.m. UTC | #4

On 2024/8/14 10:04, Richard Henderson wrote:
> On 8/14/24 10:58, LIU Zhiwei wrote:
>> Thus if we want to use all registers of vectors, we have to add a 
>> dynamic constraint on register allocation based on IR types.
>
> My comment vs patch 4 is that you can't do that, at least not without 
> large changes to TCG.
>
> In addition, I said that the register pressure on vector regs is not 
> high enough to justify such changes.  There is, so far, little benefit 
> in having more than 4 or 5 vector registers, much less 32.  Thus 7 
> (lmul 4, omitting v0) is sufficient.

At least on QEMU, SVE can support 2048 bit vector length with 
'sve-default-vector-length=256'.  Software optimized with SVE, such as 
X264 can benefit with long SVE length in less dynamic A64 instructions.

We want to expose all host vector ability. Thus the largest 
TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 width 
operation. We have expand TCG_TYPE_V512/1024/2048 types(not in this 
patch set, but intend to upstream later).
With large TCG_TYPE_V1024/2048, we get better performance on RISC-V 
board with much less translated RISC-V vector instructions. We can give 
a more detailed experiment result if needed.

However, we will only have 3 vector register when support 
TCG_TYPE_V1024.  And even less for TCG_TYPE_V2048.  Current approach 
will give more vectors TCG_TYPE_V128 even with support TCG_TYPE_V1024, 
which will relax some guest NEON register pressure.

Thanks,
Zhiwei

>
>
> r~

Richard Henderson Aug. 14, 2024, 3:08 a.m. UTC | #5

On 8/14/24 12:27, LIU Zhiwei wrote:
> 
> On 2024/8/14 10:04, Richard Henderson wrote:
>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>> Thus if we want to use all registers of vectors, we have to add a dynamic constraint on 
>>> register allocation based on IR types.
>>
>> My comment vs patch 4 is that you can't do that, at least not without large changes to TCG.
>>
>> In addition, I said that the register pressure on vector regs is not high enough to 
>> justify such changes.  There is, so far, little benefit in having more than 4 or 5 
>> vector registers, much less 32.  Thus 7 (lmul 4, omitting v0) is sufficient.
> 
> At least on QEMU, SVE can support 2048 bit vector length with 'sve-default-vector- 
> length=256'.  Software optimized with SVE, such as X264 can benefit with long SVE length 
> in less dynamic A64 instructions.
> 
> We want to expose all host vector ability. Thus the largest TCG_TYPE_V256 is not enough, 
> as 128-bit RVV can give 8*128=1024 width operation. We have expand TCG_TYPE_V512/1024/2048 
> types(not in this patch set, but intend to upstream later).
> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V board with much less 
> translated RISC-V vector instructions. We can give a more detailed experiment result if 
> needed.
> 
> However, we will only have 3 vector register when support TCG_TYPE_V1024.  And even less 
> for TCG_TYPE_V2048.  Current approach will give more vectors TCG_TYPE_V128 even with 
> support TCG_TYPE_V1024, which will relax some guest NEON register pressure.

Then you will have to teach TCG about one operand consuming and clobbering N hard 
registers, so that you get the spills and fills done correctly.

But you haven't done that in this patch set, so will currently generate incorrect code.

I think you should make longer vector operations a longer term project, and start with 
something simpler.


r~

LIU Zhiwei Aug. 14, 2024, 3:30 a.m. UTC | #6

On 2024/8/14 11:08, Richard Henderson wrote:
> On 8/14/24 12:27, LIU Zhiwei wrote:
>>
>> On 2024/8/14 10:04, Richard Henderson wrote:
>>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>>> Thus if we want to use all registers of vectors, we have to add a 
>>>> dynamic constraint on register allocation based on IR types.
>>>
>>> My comment vs patch 4 is that you can't do that, at least not 
>>> without large changes to TCG.
>>>
>>> In addition, I said that the register pressure on vector regs is not 
>>> high enough to justify such changes.  There is, so far, little 
>>> benefit in having more than 4 or 5 vector registers, much less 32.  
>>> Thus 7 (lmul 4, omitting v0) is sufficient.
>>
>> At least on QEMU, SVE can support 2048 bit vector length with 
>> 'sve-default-vector- length=256'.  Software optimized with SVE, such 
>> as X264 can benefit with long SVE length in less dynamic A64 
>> instructions.
>>
>> We want to expose all host vector ability. Thus the largest 
>> TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 width 
>> operation. We have expand TCG_TYPE_V512/1024/2048 types(not in this 
>> patch set, but intend to upstream later).
>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V 
>> board with much less translated RISC-V vector instructions. We can 
>> give a more detailed experiment result if needed.
>>
>> However, we will only have 3 vector register when support 
>> TCG_TYPE_V1024.  And even less for TCG_TYPE_V2048.  Current approach 
>> will give more vectors TCG_TYPE_V128 even with support 
>> TCG_TYPE_V1024, which will relax some guest NEON register pressure.
>
> Then you will have to teach TCG about one operand consuming and 
> clobbering N hard registers, so that you get the spills and fills done 
> correctly.
I think we have done this in patch 6.
>
> But you haven't done that in this patch set, so will currently 
> generate incorrect code.
>
> I think you should make longer vector operations a longer term project, 

Does longer vector operations implementation deserves to upstream? We 
can contribute it sooner as it is ready.

> and start with something simpler.

Agree if you insist. 🙂

Zhiwei

>
>
> r~

Richard Henderson Aug. 14, 2024, 4:18 a.m. UTC | #7

On 8/14/24 13:30, LIU Zhiwei wrote:
> 
> On 2024/8/14 11:08, Richard Henderson wrote:
>> On 8/14/24 12:27, LIU Zhiwei wrote:
>>>
>>> On 2024/8/14 10:04, Richard Henderson wrote:
>>>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>>>> Thus if we want to use all registers of vectors, we have to add a dynamic constraint 
>>>>> on register allocation based on IR types.
>>>>
>>>> My comment vs patch 4 is that you can't do that, at least not without large changes to 
>>>> TCG.
>>>>
>>>> In addition, I said that the register pressure on vector regs is not high enough to 
>>>> justify such changes.  There is, so far, little benefit in having more than 4 or 5 
>>>> vector registers, much less 32. Thus 7 (lmul 4, omitting v0) is sufficient.
>>>
>>> At least on QEMU, SVE can support 2048 bit vector length with 'sve-default-vector- 
>>> length=256'.  Software optimized with SVE, such as X264 can benefit with long SVE 
>>> length in less dynamic A64 instructions.
>>>
>>> We want to expose all host vector ability. Thus the largest TCG_TYPE_V256 is not 
>>> enough, as 128-bit RVV can give 8*128=1024 width operation. We have expand 
>>> TCG_TYPE_V512/1024/2048 types(not in this patch set, but intend to upstream later).
>>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V board with much 
>>> less translated RISC-V vector instructions. We can give a more detailed experiment 
>>> result if needed.
>>>
>>> However, we will only have 3 vector register when support TCG_TYPE_V1024.  And even 
>>> less for TCG_TYPE_V2048.  Current approach will give more vectors TCG_TYPE_V128 even 
>>> with support TCG_TYPE_V1024, which will relax some guest NEON register pressure.
>>
>> Then you will have to teach TCG about one operand consuming and clobbering N hard 
>> registers, so that you get the spills and fills done correctly.
> I think we have done this in patch 6.

No, you have not.

There are no modifications to tcg_reg_alloc, and there are no additional calls to 
tcg_reg_free, which is where spills are generated. There would also need to be changes on 
the fill side, temp_load.


>> I think you should make longer vector operations a longer term project, 
> 
> Does longer vector operations implementation deserves to upstream? We can contribute it 
> sooner as it is ready.

Sure.


r~

LIU Zhiwei Aug. 14, 2024, 7:47 a.m. UTC | #8

On 2024/8/14 12:18, Richard Henderson wrote:
> On 8/14/24 13:30, LIU Zhiwei wrote:
>>
>> On 2024/8/14 11:08, Richard Henderson wrote:
>>> On 8/14/24 12:27, LIU Zhiwei wrote:
>>>>
>>>> On 2024/8/14 10:04, Richard Henderson wrote:
>>>>> On 8/14/24 10:58, LIU Zhiwei wrote:
>>>>>> Thus if we want to use all registers of vectors, we have to add a 
>>>>>> dynamic constraint on register allocation based on IR types.
>>>>>
>>>>> My comment vs patch 4 is that you can't do that, at least not 
>>>>> without large changes to TCG.
>>>>>
>>>>> In addition, I said that the register pressure on vector regs is 
>>>>> not high enough to justify such changes.  There is, so far, little 
>>>>> benefit in having more than 4 or 5 vector registers, much less 32. 
>>>>> Thus 7 (lmul 4, omitting v0) is sufficient.
>>>>
>>>> At least on QEMU, SVE can support 2048 bit vector length with 
>>>> 'sve-default-vector- length=256'.  Software optimized with SVE, 
>>>> such as X264 can benefit with long SVE length in less dynamic A64 
>>>> instructions.
>>>>
>>>> We want to expose all host vector ability. Thus the largest 
>>>> TCG_TYPE_V256 is not enough, as 128-bit RVV can give 8*128=1024 
>>>> width operation. We have expand TCG_TYPE_V512/1024/2048 types(not 
>>>> in this patch set, but intend to upstream later).
>>>> With large TCG_TYPE_V1024/2048, we get better performance on RISC-V 
>>>> board with much less translated RISC-V vector instructions. We can 
>>>> give a more detailed experiment result if needed.
>>>>
>>>> However, we will only have 3 vector register when support 
>>>> TCG_TYPE_V1024.  And even less for TCG_TYPE_V2048.  Current 
>>>> approach will give more vectors TCG_TYPE_V128 even with support 
>>>> TCG_TYPE_V1024, which will relax some guest NEON register pressure.
>>>
>>> Then you will have to teach TCG about one operand consuming and 
>>> clobbering N hard registers, so that you get the spills and fills 
>>> done correctly.
>> I think we have done this in patch 6.
>
> No, you have not.
>
> There are no modifications to tcg_reg_alloc, and there are no 
> additional calls to tcg_reg_free, which is where spills are generated. 
> There would also need to be changes on the fill side, temp_load.
Thanks. I choose the simple design as you suggest for this patch set. 
And We will fix this problem when send the longer vector operations 
implementation.
>
>
>>> I think you should make longer vector operations a longer term project, 
>>
>> Does longer vector operations implementation deserves to upstream? We 
>> can contribute it sooner as it is ready.
>
> Sure.

Good news!

Thanks,
Zhiwei

>
>
> r~

[v1,03/15] tcg: Fix register allocation constraints

Commit Message

Comments

Patch