This is an archive of the discontinued LLVM Phabricator instance.

[LoongArch][test] (6/6) Add encoding and mnemonics tests
ClosedPublic

Authored by SixWeining on Dec 16 2021, 12:53 AM.

Details

Summary

With the benefit of D88392, instruction encoding and mnemonic testing can be
achieved within MIR files before AsmParser is ready. This patch adds such
tests which cover all basic integer instructions we defined in previous patch.
Similarly those tests will be rewrote by .s and moved to test/MC/LoongArch.

Depends on D115861 and D116100

Diff Detail

Event Timeline

SixWeining requested review of this revision.Dec 16 2021, 12:53 AM
SixWeining created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2021, 12:53 AM
SixWeining added a reviewer: myhsu.

rename the pathes number from n to 5

SixWeining retitled this revision from [LoongArch][test] (5/n) Add encoding and mnemonics tests to [LoongArch][test] (5/5) Add encoding and mnemonics tests.Dec 17 2021, 7:56 PM
xen0n added a subscriber: xen0n.Dec 20 2021, 2:25 AM

These encoding format names are atrocious to look at and use, especially the "non-standard" ones not mentioned in the manuals yet still have to be named. The names mostly start with a number so cannot be used directly as identifiers, and the irregular formats are named after instructions themselves -- what if some instructions are added later that happen to share the same encoding, like renaming the format is okay?

For fixing these problems once and for good, please consider adopting the unambiguous notation defined by the loongarch-opcodes project. While you may or may not like the idea of renaming a few instruction mnemonics, the instruction format naming is fixed for you, and I really hope the official manuals could follow suit and just ditch the xRIxx names.

xen0n added a comment.Dec 20 2021, 2:26 AM

Also please separate the changes adding --little-endian into a separate patch, so we only tackle one concern in one commit.

myhsu added a comment.Dec 20 2021, 3:18 AM

Just want to double check: AsmParser is not implemented yet, is that correct?
Also, can you rename the test filename to be more expressive and straightforward.

llvm/utils/extract-section.py
62 ↗(On Diff #395256)

nit: can we use --endian=<little/big> instead with default value being "big"?

Adding owners of recent new targets to help onboarding this new target.

These encoding format names are atrocious to look at and use, especially the "non-standard" ones not mentioned in the manuals yet still have to be named. The names mostly start with a number so cannot be used directly as identifiers, and the irregular formats are named after instructions themselves -- what if some instructions are added later that happen to share the same encoding, like renaming the format is okay?

For fixing these problems once and for good, please consider adopting the unambiguous notation defined by the loongarch-opcodes project. While you may or may not like the idea of renaming a few instruction mnemonics, the instruction format naming is fixed for you, and I really hope the official manuals could follow suit and just ditch the xRIxx names.

@xen0n Thanks for your suggestion. We have read the notation defined in the repo you referenced. Yes, one name can exactly specifies one encoding. But there are a couple of issues left to adopt it in llvm. For example:

  • immediate operand

Some immediate operands usually have specail meanings like the msbd and lsbd of the bstrpick.d instruction. This instruction will be named with DJUk6Um6 under the natation you proposed and thus lost the meaning of msbd and lsbd because we have to use names like imm1 and imm2 in instruction definition.

In current impl:

// FmtBSTR_D
// <opcode | msbd | lsbd | rj | rd>
class FmtBSTR_D<bits<10> op, dag outs, dag ins, string asmstr,
                list<dag> pattern = []> : LAInst<outs, ins, asmstr, pattern> {
  bits<6> msbd;
  bits<6> lsbd;
  bits<5> rj; 
  bits<5> rd; 

  let Inst{31-22} = op; 
  let Inst{21-16} = msbd;
  let Inst{15-10} = lsbd;
  let Inst{9-5} = rj; 
  let Inst{4-0} = rd; 
}
...

class ALU_BSTRD<bits<10> op, string opstr, Operand ImmOpnd>
    : FmtBSTR_D<op, (outs GPR:$rd), (ins GPR:$rj, ImmOpnd:$msb, ImmOpnd:$lsb),
                !strconcat(opstr, "\t$rd, $rj, $msb, $lsb")>;

...

def BSTRPICK_D : ALU_BSTRD<0b0000000011, "bstrpick.d", uimm6>;

Wth your proposal:

// DJUk6Um6
// <opcode | imm1 | imm2 | rj | rd>
class DJUk6Um6<bits<10> op, dag outs, dag ins, string asmstr,
               list<dag> pattern = []> : LAInst<outs, ins, asmstr, pattern> {
  bits<6> imm1; 
  bits<6> imm2;  
  bits<5> rj;                                              
  bits<5> rd;
    
  let Inst{31-22} = op;
  let Inst{21-16} = imm1;
  let Inst{15-10} = imm2;
  let Inst{9-5} = rj;
  let Inst{4-0} = rd;
}

...

def BSTRPICK_D : DJUk6Um6<0b0000000011, (outs GPR:$rd), (ins GPR:$rj, uimm6:$imm1, uimm6:$imm2), !strconcat(opstr, "\t$rd, $rj, $imm1, $imm2")>;
  • Understanding cost

Inoder to understand what the name like DJUk6Um6 means people may have to spend more time than the current impl.

But we do not totally disagree this proposal and let's wait more reviewers' comments about that. Thanks.

Also please separate the changes adding --little-endian into a separate patch, so we only tackle one concern in one commit.

Reasonable. I will update that. Thanks.

Just want to double check: AsmParser is not implemented yet, is that correct?
Also, can you rename the test filename to be more expressive and straightforward.

Yes. AsmParser is not implemented yet which is in our next plan. Until that these MIR tests would be removed and rewrote with .s MC tests and put to llvm/tests/MC/LoongArch like the M68K target does.
About the tes filename, we'll update them until the comments from @xen0n (about the instruction format names) are addressed.

Address comments of @myhsu . Thanks.

llvm/utils/extract-section.py
62 ↗(On Diff #395256)

No problem. I will update that. Thanks.

xen0n added a comment.Dec 20 2021, 9:33 PM

These encoding format names are atrocious to look at and use, especially the "non-standard" ones not mentioned in the manuals yet still have to be named. The names mostly start with a number so cannot be used directly as identifiers, and the irregular formats are named after instructions themselves -- what if some instructions are added later that happen to share the same encoding, like renaming the format is okay?

For fixing these problems once and for good, please consider adopting the unambiguous notation defined by the loongarch-opcodes project. While you may or may not like the idea of renaming a few instruction mnemonics, the instruction format naming is fixed for you, and I really hope the official manuals could follow suit and just ditch the xRIxx names.

@xen0n Thanks for your suggestion. We have read the notation defined in the repo you referenced. Yes, one name can exactly specifies one encoding. But there are a couple of issues left to adopt it in llvm. For example:

  • immediate operand

Some immediate operands usually have [special] meanings like the msbd and lsbd of the bstrpick.d instruction. This instruction will be named with DJUk6Um6 under the [notation] you proposed and thus lost the meaning of msbd and lsbd because we have to use names like imm1 and imm2 in instruction definition.

In current impl:

[snip]

[With] your proposal:

[snip]

  • Understanding cost

[In order] to understand what the name like DJUk6Um6 means people may have to spend more time than the current impl.

But we do not totally disagree this proposal and let's wait more reviewers' comments about that. Thanks.

Hi,

Thanks for your detailed reply; although I'd like to clarify a bit. Instruction formats are for describing fixed and variable bit patterns as recognized by the decoder hardware; they are very low-level details that should not be concerned with higher-level semantics (instead, opcodes and mnemonics do).

Take the bstr* family for example: there is nothing stopping you from creating completely unrelated insns also taking two registers and two 6-bit immediates, and with LoongArch encoding conventions (slots placed from LSB to MSB mostly sequentially) this would almost definitely just be DJUk6Um6. With your current approach, the new insns would be of BSTR_D format, yet they don't belong to bstr* family at all.

In addition to that, there's no reason to only "give meaning" to special formats alone; why not rename "2R" to "BINARY_ARITH", "2RI12" to "MEM" and "I26" to "LONG_BRANCH"? Again this is just exposing inconsistencies and double-standards in the current approach; some semantics just have to be looked up in the manuals and I believe this is acceptable for compiler maintainers.

Plus, with the current naming scheme, names for all "regular" formats begin with a number, thus unusable as identifiers in most programming languages. This forces people to come up with ad-hoc names every time they have to implement LoongArch machine code in a project. Indeed, from what I see, the LoongArch port of binutils, llvm and qemu all use different names from each other, presumably because the Loongson people behind each port all come from different (sub)teams; this is not good for maintenance in the long run.

xen0n added a reviewer: xen0n.Dec 20 2021, 9:33 PM

These encoding format names are atrocious to look at and use, especially the "non-standard" ones not mentioned in the manuals yet still have to be named. The names mostly start with a number so cannot be used directly as identifiers, and the irregular formats are named after instructions themselves -- what if some instructions are added later that happen to share the same encoding, like renaming the format is okay?

For fixing these problems once and for good, please consider adopting the unambiguous notation defined by the loongarch-opcodes project. While you may or may not like the idea of renaming a few instruction mnemonics, the instruction format naming is fixed for you, and I really hope the official manuals could follow suit and just ditch the xRIxx names.

@xen0n Thanks for your suggestion. We have read the notation defined in the repo you referenced. Yes, one name can exactly specifies one encoding. But there are a couple of issues left to adopt it in llvm. For example:

  • immediate operand

Some immediate operands usually have [special] meanings like the msbd and lsbd of the bstrpick.d instruction. This instruction will be named with DJUk6Um6 under the [notation] you proposed and thus lost the meaning of msbd and lsbd because we have to use names like imm1 and imm2 in instruction definition.

In current impl:

[snip]

[With] your proposal:

[snip]

  • Understanding cost

[In order] to understand what the name like DJUk6Um6 means people may have to spend more time than the current impl.

But we do not totally disagree this proposal and let's wait more reviewers' comments about that. Thanks.

Hi,

Thanks for your detailed reply; although I'd like to clarify a bit. Instruction formats are for describing fixed and variable bit patterns as recognized by the decoder hardware; they are very low-level details that should not be concerned with higher-level semantics (instead, opcodes and mnemonics do).

Take the bstr* family for example: there is nothing stopping you from creating completely unrelated insns also taking two registers and two 6-bit immediates, and with LoongArch encoding conventions (slots placed from LSB to MSB mostly sequentially) this would almost definitely just be DJUk6Um6. With your current approach, the new insns would be of BSTR_D format, yet they don't belong to bstr* family at all.

In addition to that, there's no reason to only "give meaning" to special formats alone; why not rename "2R" to "BINARY_ARITH", "2RI12" to "MEM" and "I26" to "LONG_BRANCH"? Again this is just exposing inconsistencies and double-standards in the current approach; some semantics just have to be looked up in the manuals and I believe this is acceptable for compiler maintainers.

Plus, with the current naming scheme, names for all "regular" formats begin with a number, thus unusable as identifiers in most programming languages. This forces people to come up with ad-hoc names every time they have to implement LoongArch machine code in a project. Indeed, from what I see, the LoongArch port of binutils, llvm and qemu all use different names from each other, presumably because the Loongson people behind each port all come from different (sub)teams; this is not good for maintenance in the long run.

Different compiler communities use different names. This is the reality not only for LoongArch but also other targets.
Yes, I agree with you that binutils, llvm, qemu using same names is good for maintenace although this has not been achieved by other targets (like RISCV).
So, are you happy with the names defined for LoongArch in binutils?

In binutils-gdb/opcodes/loongarch-opc.c

{ 0x00c00000, 0xffc00000, "bstrpick.d",   "r0:5,r5:5,u16:6,u10:6",    0,          0,  0,  0 },

We may consider using R0_5_R5_5_U16_6_U10_6 as the instruction format name for the bstrpick.d instruction.

Different compiler communities use different names. This is the reality not only for LoongArch but also other targets.
Yes, I agree with you that binutils, llvm, qemu using same names is good for maintenace although this has not been achieved by other targets (like RISCV).

Yeah maybe others didn't do this, but we're a new target, so we might as well do better. In the end we could even figure out the best practice for others to possibly follow, so I still think such an approach is feasible.

Also note that RISC-V actually has the centralized riscv-opcodes project for instruction encodings; and their insn format names are sane, unlike LoongArch, so in fact the RISC-V ports from different projects still can standardize naming.

So, are you happy with the names defined for LoongArch in binutils?

In binutils-gdb/opcodes/loongarch-opc.c

{ 0x00c00000, 0xffc00000, "bstrpick.d",   "r0:5,r5:5,u16:6,u10:6",    0,          0,  0,  0 },

We may consider using R0_5_R5_5_U16_6_U10_6 as the instruction format name for the bstrpick.d instruction.

This is even worse than ad-hoc BSTR_D names, because the name becomes too long and unwieldy to pronounce and use. (With names like DJUk6Um6 or FdFjFkCa you can at least spell every letter out individually.) And a lot of details are redundant, like the "r" prefix implying the slot is 5-bit wide, and so on.

Address comments from @xen0n and @myhsu separating the changes to extract-section into a separate patch D116100

SixWeining retitled this revision from [LoongArch][test] (5/5) Add encoding and mnemonics tests to [LoongArch][test] (6/6) Add encoding and mnemonics tests.Dec 21 2021, 4:27 AM
SixWeining edited the summary of this revision. (Show Details)

@xen0n @myhsu are you happy with this patch?

In theory, this is the last one to approve, so if we approve this one, the entire series can land.

I'm happy with the remaining patches, but I didn't follow the discussion in this one as closely as you two, so I'll leave for you to approve.

Thanks!

myhsu added a comment.Feb 8 2022, 4:09 AM

@xen0n @myhsu are you happy with this patch?

In theory, this is the last one to approve, so if we approve this one, the entire series can land.

I'm happy with the remaining patches, but I didn't follow the discussion in this one as closely as you two, so I'll leave for you to approve.

Thanks!

My comment had been addressed and I'm satisfied with this patch too. @xen0n do you have any further comments?

xen0n added a comment.EditedFeb 8 2022, 4:40 AM

Well, as the patch splitting is already done, and it's already decided in the 4th patch's discussion thread that the instruction format renaming is not happening at this time, I have no further comment regarding this; and I think this patch (and the initial series) can be accepted. More refactoring could always come later.

Let the fun begin!

(EDIT: it was the 4th patch's thread, not the 5th.)

myhsu accepted this revision.Feb 8 2022, 5:35 AM

Cheers

This revision is now accepted and ready to land.Feb 8 2022, 5:35 AM
This revision was landed with ongoing or failed builds.Feb 10 2022, 2:24 AM
This revision was automatically updated to reflect the committed changes.