Matches sp3 behavior. op_sel[0:1] must be 0. abs and neg src modifiers for bf16
Details
Diff Detail
Unit Tests
Event Timeline
I agree with the problem description in the commit message, but I'm not sure this patch fully fixes the problem. There are no tests actually using the op_sel operand. I think those tests would reveal that the op_sel value is not being propagated from src_modifiers to the op_sel operand. We will need something similar to convertVOP3PDPPInst
In addition, OPSEL[1:0] are ignored by the hardware for these instructions. That should be reflected in the instruction definition. If it doesn't impact the dpp implementation, it could be a separate patch.
This patch does enable assembly and disassembly of v_dot2_f16_f16_e64_dpp and v_dot2_bf16_bf16_e64_dpp (with op_sel=0). Currently, these instructions are simply rejected as if they are not defined.
I agree that this patch does not fix the op_sel issue, but that is another problem that may be addressed separately.
The problem with doing the work separately is that codegen will be affected by this patch as well. Currently, dot2_bf16_bf16_e64_dpp does not exist, so GCNDPPCombine will simply fail to generate that instruction. But after this patch we will generate it. Now this test I tried appears to code generate correctly, but the problem is we cannot assemble or disassemble the resulting instructions. So I would recommend the full asm/disasm implementation and including this codegen test too.
RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -o - %s | FileCheck %s -check-prefix=GCN
name: dot2_bf16_bf16
tracksRegLiveness: true
body: |
bb.0: liveins: $vgpr0, $vgpr1, $vgpr2 %0:vgpr_32 = COPY $vgpr0 %1:vgpr_32 = COPY $vgpr1 %2:vgpr_32 = COPY $vgpr2 %3:vgpr_32 = IMPLICIT_DEF %4:vgpr_32 = V_MOV_B32_dpp %3, %1, 1, 15, 15, 1, implicit $exec %5:vgpr_32 = V_DOT2_BF16_BF16_e64 8, %4, 4, %3, 4, %2, 0, implicit $exec S_ENDPGM 0, implicit %5
...
Yields
0xd6677000 0x040a00fa 0xff080101; v_dot2_bf16_bf16 v0.h, v1, v0, v2.h quad_perm:[1,0,0,0] bound_ctrl:1
op_sel value is not being propagated from src_modifiers to the op_sel operand
I was convinced that op_sel value is not important for instruction printing. Non dpp version does not even have op_sel operand and still gets printed correctly, helper prints op_sel[a,b,c] from src_modifiers.
I missed to test assembler, missing part is setting op_sel bit in src_modifiers, done via cvtVOP3P.
I will double check both.
llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp | ||
---|---|---|
8832 | The conditional check here should really be checking for is DOT2_BF16_BF16_e64_dpp_gfx11 || DOT2_BF16_BF16_e64_dpp8_gfx11 || DOT2_F16_F16 .... | |
llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
763 ↗ | (On Diff #442382) | This will set op_sel to the wrong value if it was input as 1. We need to call convertVOP3PDPPInst to copy the operand correctly, or something equivalent. |
llvm/lib/Target/AMDGPU/VOP3Instructions.td | ||
705 | DOT2_BF16_BF16_e64_dpp should disallow modifiers on src0 and src1, though DOT2_BF16_BF16_e64 allows them. They should allow op_sel on dst, but this may be an issue since the bit for that is in src0 modifiers. Setting HasSrcMods effects codegen, and dpp16. Can you please include the codegen test I proposed or some variation on it? Can you also include asm and disasm tests for dpp16? |
Matches sp3 behavior. op_sel[0:1] must be 0. abs and neg src modifiers for bf16
FixMe(wip, requires some td file changes) clang uses i16 for bf16, I used f16 to get abs and neg parsing support.
llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp | ||
---|---|---|
4199 | Is it possible for an opcode to be both VOP3 and VOP3P? | |
llvm/lib/Target/AMDGPU/VOPInstructions.td | ||
273–274 | Bits op_sel[1:0] are ignored, so opcodes with these bits set to 1 are legal. Using ? instead of 0 would allow decoding of such opcodes. | |
1341 | Looks like this is not necessary. | |
1378–1379 | Ditto. | |
1401–1402 | Ditto. |
llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp | ||
---|---|---|
4199 | afaik no, I was surprised that VOP3P instructions also have VOP3 flag, VOP3 instructions don't have VOP3P flag (maybe a bug?). I could check opcodes (I think that there should be 6 opcodes). | |
llvm/lib/Target/AMDGPU/VOPInstructions.td | ||
273–274 | What is desired behavior for 'ignored bits' then? |
llvm/lib/Target/AMDGPU/VOPInstructions.td | ||
---|---|---|
273–274 | I think that assembler should be strict and report an error if op_sel[1:0] bits are not 0. Disassembler should be able to decode instructions with ignored bits to aid in binary code analysis (ignored bits may be displayed as 0 in op_sel, this is fine). |
llvm/test/MC/Disassembler/AMDGPU/gfx11_dasm_vop3_dpp8.txt | ||
---|---|---|
5274 | Like this? Ignored bits set to zero (0x18), check diff with previous in History tab there are a few more. |
Oh, sorry. I see that you have added them. Probably these tests need a comment to make clear what is being tested.
Overall it looks pretty good. The main thing I don't know is if making the bf16 operands float16 will work correctly with codegen. Hopefully @rampitec can answer.
llvm/include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
2072 ↗ | (On Diff #446138) | @rampitec should look at these intrinsic changes. I am not familiar enough with other BF16 support to know if this is reasonable. |
llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp | ||
4199 |
It should not be, IMO. Those fields designate the encoding, and the instruction should only have one encoding. It could be considered a bug but a long standing one. I have been having similar discussions about an instruction being both VOPC and VOP3. @foad | |
llvm/lib/Target/AMDGPU/VOPInstructions.td | ||
272 | Make the parent class of this VOP3OpSel_gfx11. No functional change, but it seems better. | |
llvm/test/MC/AMDGPU/gfx11_asm_dpp16.s | ||
1–2 | Can you add --implicit-check-not=error to the runlines? | |
llvm/test/MC/AMDGPU/gfx11_asm_dpp8.s | ||
1–2 | Can you add --implicit-check-not=error to the runlines? |
llvm/include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
2072 ↗ | (On Diff #446138) | This only gets desired behavior in testing. The problem is that there is no bf16 in clang, i16 is used instead. Our td files rely on types. I am looking if can bypass type checks. It would be much cleaner if we had the bf16 type. |
No, this is incorrect. bf16 is not the same as f16. One need a cast to/from integer type to use bf16.
llvm/include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
2072 ↗ | (On Diff #446138) | We cannot have bf16 type. This is not a question of compiler support, this is the support which does not exist in HW. If you can only do dots on the type you cannot generally support it. |
Can you add disassembler tests with op_sel:[1,1,1,1] and show the bit is ignored? Otherwise LGTM
llvm/test/MC/Disassembler/AMDGPU/gfx11_dasm_vop3_dpp8.txt | ||
---|---|---|
5287 | Ok, I was confused because the AsmPrinter prints what the hardware will do with that instruction, not what the bits actually disassemble to. This matches sp3 and there are plenty of warnings how those bits are treated. I think this is good. |
LGTM with a nit.
llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp | ||
---|---|---|
4198 | Please add parenthesis around binary operations. |
Please add parenthesis around binary operations.