diff mbox series

[v3,08/14] RISC-V: Adding T-Head MemPair extension

Message ID 20230124195945.181842-9-christoph.muellner@vrull.eu
State New
Headers show
Series Add support for the T-Head vendor extensions | expand

Commit Message

Christoph Müllner Jan. 24, 2023, 7:59 p.m. UTC
From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds support for the T-Head MemPair instructions.
The patch uses the T-Head specific decoder and translation.

Co-developed-by: LIU Zhiwei <zhiwei_liu@linux.alibaba.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
Changes in v2:
- Add ISA_EXT_DATA_ENTRY()
- Use single decoder for XThead extensions
- Use get_address() to calculate addresses

 target/riscv/cpu.c                         |  2 +
 target/riscv/cpu.h                         |  1 +
 target/riscv/insn_trans/trans_xthead.c.inc | 88 ++++++++++++++++++++++
 target/riscv/translate.c                   |  2 +-
 target/riscv/xthead.decode                 | 13 ++++
 5 files changed, 105 insertions(+), 1 deletion(-)

Comments

Richard Henderson Jan. 24, 2023, 8:44 p.m. UTC | #1
On 1/24/23 09:59, Christoph Muellner wrote:
> +static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop,
> +                            int shamt)
> +{
> +    TCGv rd1 = dest_gpr(ctx, a->rd1);
> +    TCGv rd2 = dest_gpr(ctx, a->rd2);
> +    TCGv addr1 = tcg_temp_new();
> +    TCGv addr2 = tcg_temp_new();
> +
> +    addr1 = get_address(ctx, a->rs, a->sh2 << shamt);
> +    if ((memop & MO_SIZE) == MO_64) {
> +        addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt));
> +    } else {
> +        addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt));
> +    }
> +
> +    tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop);
> +    tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop);
> +    gen_set_gpr(ctx, a->rd1, rd1);
> +    gen_set_gpr(ctx, a->rd2, rd2);

Since dest_gpr may return cpu_gpr[n], this may update the rd1 before recognizing the 
exception that the second load may generate.  Is that correct?

The manual says that rd1, rd2, and rs1 must not be the same, but you do not check this.


r~
LIU Zhiwei Jan. 30, 2023, 2:03 a.m. UTC | #2
On 2023/1/25 4:44, Richard Henderson wrote:
> On 1/24/23 09:59, Christoph Muellner wrote:
>> +static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp 
>> memop,
>> +                            int shamt)
>> +{
>> +    TCGv rd1 = dest_gpr(ctx, a->rd1);
>> +    TCGv rd2 = dest_gpr(ctx, a->rd2);
>> +    TCGv addr1 = tcg_temp_new();
>> +    TCGv addr2 = tcg_temp_new();
>> +
>> +    addr1 = get_address(ctx, a->rs, a->sh2 << shamt);
>> +    if ((memop & MO_SIZE) == MO_64) {
>> +        addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt));
>> +    } else {
>> +        addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt));
>> +    }
>> +
>> +    tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop);
>> +    tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop);
>> +    gen_set_gpr(ctx, a->rd1, rd1);
>> +    gen_set_gpr(ctx, a->rd2, rd2);
>
> Since dest_gpr may return cpu_gpr[n], this may update the rd1 before 
> recognizing the exception that the second load may generate.  Is that 
> correct?

Thanks. It's a bug. We should load all memory addresses to  local TCG  
temps first.

Do you think we should probe all the memory addresses for the store pair 
instructions? If so, can we avoid the use of a helper function?

>
> The manual says that rd1, rd2, and rs1 must not be the same, but you 
> do not check this.

The main reason is that assembler can do this check. Is it necessary to 
check this in QEMU?

Best Regards,
Zhiwei

>
>
> r~
Richard Henderson Jan. 30, 2023, 5:43 a.m. UTC | #3
On 1/29/23 16:03, LIU Zhiwei wrote:
> Thanks. It's a bug. We should load all memory addresses to  local TCG temps first.
> 
> Do you think we should probe all the memory addresses for the store pair instructions? If 
> so, can we avoid the use of a helper function?

Depends on what the hardware does.  Even with a trap in the middle the stores are 
restartable, since no register state changes.

But if you'd like no changes verifying both stores, for this case you can pack the pair 
into a larger data type: TCGv_i64 for pair of 32-bit, and TCGv_i128 for pair of 64-bit.
Patches for TCGv_i128 [1] are just finishing review; patches to describe atomicity of the 
larger operation are also on list [2]. Anyway, the idea is that you issue one TCG memory 
operation, the entire operation is validated, and then the stores happen.


> The main reason is that assembler can do this check. Is it necessary to check this in QEMU?

Yes.  Conciser what happens when the insn is encoded with .long.  Does the hardware trap 
an illegal instruction?  Is the behavior simply unspecified?  The manual could be improved 
to specify, akin to the Arm terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, IMPLEMENTATION 
DEFINED, etc.


r~

[1] https://patchew.org/QEMU/20230126043824.54819-1-richard.henderson@linaro.org/
[2] https://patchew.org/QEMU/20221118094754.242910-1-richard.henderson@linaro.org/
LIU Zhiwei Jan. 30, 2023, 8:41 a.m. UTC | #4
On 2023/1/30 13:43, Richard Henderson wrote:
> On 1/29/23 16:03, LIU Zhiwei wrote:
>> Thanks. It's a bug. We should load all memory addresses to  local TCG 
>> temps first.
>>
>> Do you think we should probe all the memory addresses for the store 
>> pair instructions? If so, can we avoid the use of a helper function?
>
> Depends on what the hardware does.  Even with a trap in the middle the 
> stores are restartable, since no register state changes.

I refer to the specification of LDP and STP on AARCH64. The 
specification allows

"any access performed before the exception was taken is repeated".

In detailed,

"If, according to these rules, an instruction is executed as a sequence of accesses, exceptions, including interrupts,
can be taken during that sequence, regardless of the memory type being accessed. If any of these exceptions are
returned from using their preferred return address, the instruction that generated the sequence of accesses is
re-executed, and so any access performed before the exception was taken is repeated. See also Taking an interrupt
during a multi-access load or store on page D1-4664."

However I see the implementation of LDP and STP on QEMU are in different 
ways. LDP will only load the first register when it ensures no trap in 
the second access.

So I have two questions here.

1) One for the QEMU implementation about LDP. Can we implement the LDP 
as two directly loads to cpu registers instead of local TCG temps?

2) One for the comment. Why register state changes cause 
non-restartable? Do you mean if the first register changes, it may 
influence the calculation of address after the trap?

"Even with a trap in the middle the stores are restartable, since no register state changes."

>
> But if you'd like no changes verifying both stores, for this case you 
> can pack the pair into a larger data type: TCGv_i64 for pair of 
> 32-bit, and TCGv_i128 for pair of 64-bit.
> Patches for TCGv_i128 [1] are just finishing review; patches to 
> describe atomicity of the larger operation are also on list [2]. 
> Anyway, the idea is that you issue one TCG memory operation, the 
> entire operation is validated, and then the stores happen.
>
>
>> The main reason is that assembler can do this check. Is it necessary 
>> to check this in QEMU?
>
> Yes.  Conciser what happens when the insn is encoded with .long. Does 
> the hardware trap an illegal instruction?  Is the behavior simply 
> unspecified?  The manual could be improved to specify, akin to the Arm 
> terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, IMPLEMENTATION DEFINED, etc.
>
>
Thanks, I will fix the manual.

Best Regards,
Zhiwei

> r~
>
> [1] 
> https://patchew.org/QEMU/20230126043824.54819-1-richard.henderson@linaro.org/
> [2] 
> https://patchew.org/QEMU/20221118094754.242910-1-richard.henderson@linaro.org/
Richard Henderson Jan. 30, 2023, 7:03 p.m. UTC | #5
On 1/29/23 22:41, LIU Zhiwei wrote:
> 
> On 2023/1/30 13:43, Richard Henderson wrote:
>> On 1/29/23 16:03, LIU Zhiwei wrote:
>>> Thanks. It's a bug. We should load all memory addresses to  local TCG temps first.
>>>
>>> Do you think we should probe all the memory addresses for the store pair instructions? 
>>> If so, can we avoid the use of a helper function?
>>
>> Depends on what the hardware does.  Even with a trap in the middle the stores are 
>> restartable, since no register state changes.
> 
> I refer to the specification of LDP and STP on AARCH64. The specification allows
> 
> "any access performed before the exception was taken is repeated".
> 
> In detailed,
> 
> "If, according to these rules, an instruction is executed as a sequence of accesses, exceptions, including interrupts,
> can be taken during that sequence, regardless of the memory type being accessed. If any of these exceptions are
> returned from using their preferred return address, the instruction that generated the sequence of accesses is
> re-executed, and so any access performed before the exception was taken is repeated. See also Taking an interrupt
> during a multi-access load or store on page D1-4664."
> 
> However I see the implementation of LDP and STP on QEMU are in different ways. LDP will 
> only load the first register when it ensures no trap in the second access.
> 
> So I have two questions here.
> 
> 1) One for the QEMU implementation about LDP. Can we implement the LDP as two directly 
> loads to cpu registers instead of local TCG temps?

For the Thead specification, where rd1 != rs1 (and you enforce it), then yes, I suppose 
you could load directly to the cpu registers, because on restart rs1 would be unmodified.

For AArch64, which you quote above, there is no constraint that the destinations do not 
overlap the address register, so we must implement "LDP r0, r1, [r0]" as a load into temps.


> 2) One for the comment. Why register state changes cause non-restartable? Do you mean if 
> the first register changes, it may influence the calculation of address after the trap?

Yes, that's what I mean about non-restartable -- if any of the input registers are changed 
before the trap is recognized.


>> Yes.  Conciser what happens when the insn is encoded with .long. Does the hardware trap 
>> an illegal instruction?  Is the behavior simply unspecified?  The manual could be 
>> improved to specify, akin to the Arm terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, 
>> IMPLEMENTATION DEFINED, etc.
>>
>>
> Thanks, I will fix the manual.

Excellent, thanks.


r~
LIU Zhiwei Jan. 31, 2023, 2:34 a.m. UTC | #6
On 2023/1/31 3:03, Richard Henderson wrote:
> On 1/29/23 22:41, LIU Zhiwei wrote:
>>
>> On 2023/1/30 13:43, Richard Henderson wrote:
>>> On 1/29/23 16:03, LIU Zhiwei wrote:
>>>> Thanks. It's a bug. We should load all memory addresses to  local 
>>>> TCG temps first.
>>>>
>>>> Do you think we should probe all the memory addresses for the store 
>>>> pair instructions? If so, can we avoid the use of a helper function?
>>>
>>> Depends on what the hardware does.  Even with a trap in the middle 
>>> the stores are restartable, since no register state changes.
>>
>> I refer to the specification of LDP and STP on AARCH64. The 
>> specification allows
>>
>> "any access performed before the exception was taken is repeated".
>>
>> In detailed,
>>
>> "If, according to these rules, an instruction is executed as a 
>> sequence of accesses, exceptions, including interrupts,
>> can be taken during that sequence, regardless of the memory type 
>> being accessed. If any of these exceptions are
>> returned from using their preferred return address, the instruction 
>> that generated the sequence of accesses is
>> re-executed, and so any access performed before the exception was 
>> taken is repeated. See also Taking an interrupt
>> during a multi-access load or store on page D1-4664."
>>
>> However I see the implementation of LDP and STP on QEMU are in 
>> different ways. LDP will only load the first register when it ensures 
>> no trap in the second access.
>>
>> So I have two questions here.
>>
>> 1) One for the QEMU implementation about LDP. Can we implement the 
>> LDP as two directly loads to cpu registers instead of local TCG temps?
>
> For the Thead specification, where rd1 != rs1 (and you enforce it), 
> then yes, I suppose you could load directly to the cpu registers, 
> because on restart rs1 would be unmodified.
>
> For AArch64, which you quote above, there is no constraint that the 
> destinations do not overlap the address register, so we must implement 
> "LDP r0, r1, [r0]" as a load into temps.
>
Got it. Thanks.
>
>> 2) One for the comment. Why register state changes cause 
>> non-restartable? Do you mean if the first register changes, it may 
>> influence the calculation of address after the trap?
>
> Yes, that's what I mean about non-restartable -- if any of the input 
> registers are changed before the trap is recognized.
>
>
Thanks for the clarification.

Once I thought the reason of non-restartable is the side effects of 
repeat execution, which may cause watchpoint matches twice or access 
MMIO device twice.

>>> Yes.  Conciser what happens when the insn is encoded with .long. 
>>> Does the hardware trap an illegal instruction?  Is the behavior 
>>> simply unspecified?  The manual could be improved to specify, akin 
>>> to the Arm terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, 
>>> IMPLEMENTATION DEFINED, etc.
>>>
>>>
>> Thanks, I will fix the manual.

The manual has been fixed  by Christopher.  Thanks.

Best Regards,
Zhiwei

>
> Excellent, thanks.
>
>
> r~
Christoph Müllner Jan. 31, 2023, 6:01 p.m. UTC | #7
On Tue, Jan 24, 2023 at 9:44 PM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 1/24/23 09:59, Christoph Muellner wrote:
> > +static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop,
> > +                            int shamt)
> > +{
> > +    TCGv rd1 = dest_gpr(ctx, a->rd1);
> > +    TCGv rd2 = dest_gpr(ctx, a->rd2);
> > +    TCGv addr1 = tcg_temp_new();
> > +    TCGv addr2 = tcg_temp_new();
> > +
> > +    addr1 = get_address(ctx, a->rs, a->sh2 << shamt);
> > +    if ((memop & MO_SIZE) == MO_64) {
> > +        addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt));
> > +    } else {
> > +        addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt));
> > +    }
> > +
> > +    tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop);
> > +    tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop);
> > +    gen_set_gpr(ctx, a->rd1, rd1);
> > +    gen_set_gpr(ctx, a->rd2, rd2);
>
> Since dest_gpr may return cpu_gpr[n], this may update the rd1 before recognizing the
> exception that the second load may generate.  Is that correct?

Solved in v4 by using temporaries.

>
> The manual says that rd1, rd2, and rs1 must not be the same, but you do not check this.

Fixed in v4.

Thank you!

>
>
> r~
diff mbox series

Patch

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 2ce8eb6a6f..e3a10f782c 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -115,6 +115,7 @@  static const struct isa_ext_data isa_edata_arr[] = {
     ISA_EXT_DATA_ENTRY(xtheadcmo, true, PRIV_VERSION_1_11_0, ext_xtheadcmo),
     ISA_EXT_DATA_ENTRY(xtheadcondmov, true, PRIV_VERSION_1_11_0, ext_xtheadcondmov),
     ISA_EXT_DATA_ENTRY(xtheadmac, true, PRIV_VERSION_1_11_0, ext_xtheadmac),
+    ISA_EXT_DATA_ENTRY(xtheadmempair, true, PRIV_VERSION_1_11_0, ext_xtheadmempair),
     ISA_EXT_DATA_ENTRY(xtheadsync, true, PRIV_VERSION_1_11_0, ext_xtheadsync),
     ISA_EXT_DATA_ENTRY(xventanacondops, true, PRIV_VERSION_1_12_0, ext_XVentanaCondOps),
 };
@@ -1084,6 +1085,7 @@  static Property riscv_cpu_extensions[] = {
     DEFINE_PROP_BOOL("xtheadcmo", RISCVCPU, cfg.ext_xtheadcmo, false),
     DEFINE_PROP_BOOL("xtheadcondmov", RISCVCPU, cfg.ext_xtheadcondmov, false),
     DEFINE_PROP_BOOL("xtheadmac", RISCVCPU, cfg.ext_xtheadmac, false),
+    DEFINE_PROP_BOOL("xtheadmempair", RISCVCPU, cfg.ext_xtheadmempair, false),
     DEFINE_PROP_BOOL("xtheadsync", RISCVCPU, cfg.ext_xtheadsync, false),
     DEFINE_PROP_BOOL("xventanacondops", RISCVCPU, cfg.ext_XVentanaCondOps, false),
 
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 55aea777a0..4f5f3b2c20 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -479,6 +479,7 @@  struct RISCVCPUConfig {
     bool ext_xtheadcmo;
     bool ext_xtheadcondmov;
     bool ext_xtheadmac;
+    bool ext_xtheadmempair;
     bool ext_xtheadsync;
     bool ext_XVentanaCondOps;
 
diff --git a/target/riscv/insn_trans/trans_xthead.c.inc b/target/riscv/insn_trans/trans_xthead.c.inc
index 1c583ea8ec..7ab2a7a48e 100644
--- a/target/riscv/insn_trans/trans_xthead.c.inc
+++ b/target/riscv/insn_trans/trans_xthead.c.inc
@@ -52,6 +52,12 @@ 
     }                                            \
 } while (0)
 
+#define REQUIRE_XTHEADMEMPAIR(ctx) do {          \
+    if (!ctx->cfg_ptr->ext_xtheadmempair) {      \
+        return false;                            \
+    }                                            \
+} while (0)
+
 #define REQUIRE_XTHEADSYNC(ctx) do {             \
     if (!ctx->cfg_ptr->ext_xtheadsync) {         \
         return false;                            \
@@ -382,6 +388,88 @@  static bool trans_th_mulsw(DisasContext *ctx, arg_th_mulsw *a)
     return gen_th_mac(ctx, a, tcg_gen_sub_tl, NULL);
 }
 
+/* XTheadMemPair */
+
+static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop,
+                            int shamt)
+{
+    TCGv rd1 = dest_gpr(ctx, a->rd1);
+    TCGv rd2 = dest_gpr(ctx, a->rd2);
+    TCGv addr1 = tcg_temp_new();
+    TCGv addr2 = tcg_temp_new();
+
+    addr1 = get_address(ctx, a->rs, a->sh2 << shamt);
+    if ((memop & MO_SIZE) == MO_64) {
+        addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt));
+    } else {
+        addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt));
+    }
+
+    tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop);
+    tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop);
+    gen_set_gpr(ctx, a->rd1, rd1);
+    gen_set_gpr(ctx, a->rd2, rd2);
+
+    tcg_temp_free(addr1);
+    tcg_temp_free(addr2);
+    return true;
+}
+
+static bool trans_th_ldd(DisasContext *ctx, arg_th_pair *a)
+{
+    REQUIRE_XTHEADMEMPAIR(ctx);
+    REQUIRE_64BIT(ctx);
+    return gen_loadpair_tl(ctx, a, MO_TESQ, 4);
+}
+
+static bool trans_th_lwd(DisasContext *ctx, arg_th_pair *a)
+{
+    REQUIRE_XTHEADMEMPAIR(ctx);
+    return gen_loadpair_tl(ctx, a, MO_TESL, 3);
+}
+
+static bool trans_th_lwud(DisasContext *ctx, arg_th_pair *a)
+{
+    REQUIRE_XTHEADMEMPAIR(ctx);
+    return gen_loadpair_tl(ctx, a, MO_TEUL, 3);
+}
+
+static bool gen_storepair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop,
+                             int shamt)
+{
+    TCGv data1 = get_gpr(ctx, a->rd1, EXT_NONE);
+    TCGv data2 = get_gpr(ctx, a->rd2, EXT_NONE);
+    TCGv addr1 = tcg_temp_new();
+    TCGv addr2 = tcg_temp_new();
+
+    addr1 = get_address(ctx, a->rs, a->sh2 << shamt);
+    if ((memop & MO_SIZE) == MO_64) {
+        addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt));
+    } else {
+        addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt));
+    }
+
+    tcg_gen_qemu_st_tl(data1, addr1, ctx->mem_idx, memop);
+    tcg_gen_qemu_st_tl(data2, addr2, ctx->mem_idx, memop);
+
+    tcg_temp_free(addr1);
+    tcg_temp_free(addr2);
+    return true;
+}
+
+static bool trans_th_sdd(DisasContext *ctx, arg_th_pair *a)
+{
+    REQUIRE_XTHEADMEMPAIR(ctx);
+    REQUIRE_64BIT(ctx);
+    return gen_storepair_tl(ctx, a, MO_TESQ, 4);
+}
+
+static bool trans_th_swd(DisasContext *ctx, arg_th_pair *a)
+{
+    REQUIRE_XTHEADMEMPAIR(ctx);
+    return gen_storepair_tl(ctx, a, MO_TESL, 3);
+}
+
 /* XTheadSync */
 
 static bool trans_th_sfence_vmas(DisasContext *ctx, arg_th_sfence_vmas *a)
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 5be1c9da69..27bab07994 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -133,7 +133,7 @@  static bool has_xthead_p(DisasContext *ctx  __attribute__((__unused__)))
     return ctx->cfg_ptr->ext_xtheadba || ctx->cfg_ptr->ext_xtheadbb ||
            ctx->cfg_ptr->ext_xtheadbs || ctx->cfg_ptr->ext_xtheadcmo ||
            ctx->cfg_ptr->ext_xtheadcondmov || ctx->cfg_ptr->ext_xtheadmac ||
-           ctx->cfg_ptr->ext_xtheadsync;
+           ctx->cfg_ptr->ext_xtheadmempair || ctx->cfg_ptr->ext_xtheadsync;
 }
 
 #define MATERIALISE_EXT_PREDICATE(ext)  \
diff --git a/target/riscv/xthead.decode b/target/riscv/xthead.decode
index 696de6cecf..ff2a83b56d 100644
--- a/target/riscv/xthead.decode
+++ b/target/riscv/xthead.decode
@@ -11,16 +11,21 @@ 
 
 # Fields:
 %rd        7:5
+%rd1       7:5
+%rs        15:5
 %rs1       15:5
+%rd2       20:5
 %rs2       20:5
 %sh5       20:5
 %sh6       20:6
+%sh2       25:2
 
 # Argument sets
 &r         rd rs1 rs2                               !extern
 &r2        rd rs1                                   !extern
 &shift     shamt rs1 rd                             !extern
 &th_bfext  msb lsb rs1 rd
+&th_pair   rd1 rs rd2 sh2
 
 # Formats
 @sfence_vm  ....... ..... .....   ... ..... ....... %rs1
@@ -30,6 +35,7 @@ 
 @th_bfext   msb:6  lsb:6  .....  ... ..... .......  &th_bfext %rs1 %rd
 @sh5        ....... ..... .....  ... ..... .......  &shift  shamt=%sh5      %rs1 %rd
 @sh6        ...... ...... .....  ... ..... .......  &shift shamt=%sh6 %rs1 %rd
+@th_pair    ..... .. ..... ..... ... ..... .......  &th_pair %rd1 %rs %rd2 %sh2
 
 # XTheadBa
 # Instead of defining a new encoding, we simply use the decoder to
@@ -96,6 +102,13 @@  th_muls          00100 01 ..... ..... 001 ..... 0001011 @r
 th_mulsh         00101 01 ..... ..... 001 ..... 0001011 @r
 th_mulsw         00100 11 ..... ..... 001 ..... 0001011 @r
 
+# XTheadMemPair
+th_ldd           11111 .. ..... ..... 100 ..... 0001011 @th_pair
+th_lwd           11100 .. ..... ..... 100 ..... 0001011 @th_pair
+th_lwud          11110 .. ..... ..... 100 ..... 0001011 @th_pair
+th_sdd           11111 .. ..... ..... 101 ..... 0001011 @th_pair
+th_swd           11100 .. ..... ..... 101 ..... 0001011 @th_pair
+
 # XTheadSync
 th_sfence_vmas   0000010 ..... ..... 000 00000 0001011 @rs2_s
 th_sync          0000000 11000 00000 000 00000 0001011