This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11
ClosedPublic

Authored by jrbyrnes on Aug 21 2023, 5:21 PM.

Details

Summary

This provides a uniform way to lower into the relevant instructions across all generations.

Diff Detail

Unit TestsFailed

Event Timeline

jrbyrnes created this revision.Aug 21 2023, 5:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 5:21 PM
jrbyrnes requested review of this revision.Aug 21 2023, 5:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 5:21 PM

Title is confusing, this isn't adding new intrinsics

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

I don't understand how these cases are different, the intrinsic name is just slightly different from the instruction name?

jrbyrnes retitled this revision from [AMDGPU] Add sdot4 / sdot8 intrinsics for gfx11 to [AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11.Aug 22 2023, 11:09 AM
jrbyrnes added inline comments.Aug 22 2023, 11:23 AM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and int_amdgcn_sdot8. However, we don't support these on gfx1100 -- instead, gfx100 has int_amdgcn_sUdot4 / int_amdgcn_sUdot8. The result is that users of these intrinsics must always check the target to use the corresponding one (sudot4 for gfx1100, and sdot4 for all others).

This removes that responsibility from the user, so they are able to use sdot4 across all targets and generate the corresponding instructions.

arsenm added inline comments.Aug 22 2023, 12:05 PM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

Are there unit tests for these somewhere? I don't really know the full history of these instructions and I'm worried there was some random edge case behavior change

jrbyrnes updated this revision to Diff 552466.Aug 22 2023, 12:17 PM

Properly handle neg modifier

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

Apologies, It is my mistake potentially causing confusion.

The main difference between V_DOT4_I32_IU8 on gfx1100 and V_DOT4_I32_I8 on gfx90a (for example), is that V_DOT4_I32_IU8 can be either signed or unsigned depending on NEG bit in operand modifier. This target specific feature is probably why there is special handling.

See llvm.amdgcn.sudot4 for unit tests.

arsenm added inline comments.Aug 22 2023, 12:24 PM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

I mean tests that actually execute, not lit tests

jrbyrnes added inline comments.Aug 23 2023, 9:45 AM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

So I've tracked down some unit tests.

https://github.com/ROCm-Developer-Tools/HIP/blob/b8965f1f3d58d7adf7d702c09e75ebf3dd718f8c/tests/src/deviceLib/hipTestDotFunctions.cpp#L34

These calls are implemented as calls to __ockl_sdot4:

https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/hipamd/include/hip/amd_detail/amd_math_functions.h#L148C60-L148C60

Which is, in turn, implemented as calls to target specific builtins:

https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/46939af92ad91238c878a82aad2220822073ffa1/ockl/src/dots.cl#L124

For gfx1100, this lowers to __builtin_amdgcn_sudot4 builtin.

If you want, I can hack a compiler to lower the __builtin_amdgcn_sudot4 into int_amdgcn_sdot4 and find a way to run these tests.

jrbyrnes added inline comments.Aug 23 2023, 9:48 AM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit test suite, which -- due to https://reviews.llvm.org/D155995 -- has many existing tests that lower into int_amdgcn_sdot4 for gfx1100.

arsenm accepted this revision.Aug 23 2023, 2:43 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

Ugh, this test is bad. It barely tests it compiles. Really these should test all the edge cases

437–453

So apparently we have overlapping intrinsics. We should probably canonicalize llvm.amdgcn.sudot4 cases representable with sdot/udot in AMDGPUInstCombineIntrinsic

This revision is now accepted and ready to land.Aug 23 2023, 2:43 PM
arsenm added inline comments.Aug 23 2023, 2:44 PM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–453

Not even that, this is barely a front end test. the optimizer can delete most all of this

Can you also document these intrinsics in AMDGPUUsage?

jrbyrnes updated this revision to Diff 553184.Aug 24 2023, 10:35 AM

Add note for intrinsics

jrbyrnes updated this revision to Diff 553196.Aug 24 2023, 10:50 AM

Include comments about dot*c, remove unintended changes to test

arsenm accepted this revision.Aug 24 2023, 12:45 PM
arsenm added inline comments.
llvm/docs/AMDGPUUsage.rst
1062

Think this needs a new line separator

Joe_Nash accepted this revision.Aug 24 2023, 1:10 PM
Joe_Nash added a reviewer: Petar.Avramovic.

Seems like a good convenience feature. LGTM

This revision was landed with ongoing or failed builds.Aug 25 2023, 11:46 AM
This revision was automatically updated to reflect the committed changes.