This provides a uniform way to lower into the relevant instructions across all generations.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
60,030 ms | x64 debian > MLIR.Examples/standalone::test.toy |
Event Timeline
Title is confusing, this isn't adding new intrinsics
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | I don't understand how these cases are different, the intrinsic name is just slightly different from the instruction name? |
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and int_amdgcn_sdot8. However, we don't support these on gfx1100 -- instead, gfx100 has int_amdgcn_sUdot4 / int_amdgcn_sUdot8. The result is that users of these intrinsics must always check the target to use the corresponding one (sudot4 for gfx1100, and sdot4 for all others). This removes that responsibility from the user, so they are able to use sdot4 across all targets and generate the corresponding instructions. |
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | Are there unit tests for these somewhere? I don't really know the full history of these instructions and I'm worried there was some random edge case behavior change |
Properly handle neg modifier
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | Apologies, It is my mistake potentially causing confusion. The main difference between V_DOT4_I32_IU8 on gfx1100 and V_DOT4_I32_I8 on gfx90a (for example), is that V_DOT4_I32_IU8 can be either signed or unsigned depending on NEG bit in operand modifier. This target specific feature is probably why there is special handling. See llvm.amdgcn.sudot4 for unit tests. |
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | I mean tests that actually execute, not lit tests |
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | So I've tracked down some unit tests. These calls are implemented as calls to __ockl_sdot4: Which is, in turn, implemented as calls to target specific builtins: For gfx1100, this lowers to __builtin_amdgcn_sudot4 builtin. If you want, I can hack a compiler to lower the __builtin_amdgcn_sudot4 into int_amdgcn_sdot4 and find a way to run these tests. |
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit test suite, which -- due to https://reviews.llvm.org/D155995 -- has many existing tests that lower into int_amdgcn_sdot4 for gfx1100. |
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | Ugh, this test is bad. It barely tests it compiles. Really these should test all the edge cases | |
437–453 | So apparently we have overlapping intrinsics. We should probably canonicalize llvm.amdgcn.sudot4 cases representable with sdot/udot in AMDGPUInstCombineIntrinsic |
llvm/lib/Target/AMDGPU/VOP3PInstructions.td | ||
---|---|---|
437–453 | Not even that, this is barely a front end test. the optimizer can delete most all of this |
llvm/docs/AMDGPUUsage.rst | ||
---|---|---|
1062 | Think this needs a new line separator |
Think this needs a new line separator