This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support idot2 pattern.
ClosedPublic

Authored by FarhanaAleen on Jul 30 2018, 3:47 PM.

Details

Summary

Transform add (mul ((i32)S0.x, (i32)S1.x),

add( mul ((i32)S0.y, (i32)S1.y), (i32)S3) => i/udot2((v2i16)S0, (v2i16)S1, (i32)S3)

Diff Detail

Event Timeline

FarhanaAleen created this revision.Jul 30 2018, 3:47 PM

This seems to be a lot of code to try different commutations. Why can't you just use table gen patterns which would generate these for you?

Also seem to be missing a lot of hasOneUse checks

lib/Target/AMDGPU/SIISelLowering.cpp
4949–4954

I would split the intrinsics into a separate patch

test/CodeGen/AMDGPU/idot2.ll
271–272

Avoid using function calls in tests that aren't specifically testing calls

arsenm added inline comments.Jul 31 2018, 6:25 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7354–7355

return the SDValue results rather than bool + out argument?

7358

Don't need explicit initializers

arsenm added inline comments.Jul 31 2018, 6:27 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4949–4954

The intrinsic definition also seems to be missing from the patch

kzhuravl added inline comments.Jul 31 2018, 8:30 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4949–4954

Intrinsic definitions were already committed.

Thanks Matt.

This seems to be a lot of code to try different commutations. Why can't you just use table gen patterns which would generate these for you?

We can't use table gen patterns because all the types are promoted by then. Also, recognition of vector elements would be very complex.

Also seem to be missing a lot of hasOneUse checks

They are not missing, I did not add them intentionally. In general, we care about throughput on GPU, right? With multiple uses, generating dot instruction saves 2 instructions in the best case(1 op having multiple uses), 0 instructions in the worse case(all 3 ops having multiple uses).

As per latency, I was thinking dot instruction should be as fast as one MAD since it will be performing the multiplication on packed data which can be as fast as performing one 16bit multiplication. With multiple uses, if dot is not generated, we will be generating 2 MADs. So, it should be better to generate 1 dot instead of 2 MADs, right?

  • Removed SDValue initialization
  • Removed function calls from the testcases.
  • Returned SDValue instead of bool+SDValue

Thanks Matt.

This seems to be a lot of code to try different commutations. Why can't you just use table gen patterns which would generate these for you?

We can't use table gen patterns because all the types are promoted by then. Also, recognition of vector elements would be very complex.

But all the types are legal already?

We should think about treating the vector extracts as legal, and then cleaning up the mess from selecting those later. I think at this point doing what we do is causing more problems than it's worth

Also seem to be missing a lot of hasOneUse checks

They are not missing, I did not add them intentionally. In general, we care about throughput on GPU, right? With multiple uses, generating dot instruction saves 2 instructions in the best case(1 op having multiple uses), 0 instructions in the worse case(all 3 ops having multiple uses).

Register pressure also matters, probably more. In the every instruction has another use case, you're just increasing the register count without saving anything. I would need to think more carefully about the other cases.

Either way, there should be tests with the different operands with multiple uses.

As per latency, I was thinking dot instruction should be as fast as one MAD since it will be performing the multiplication on packed data which can be as fast as performing one 16bit multiplication. With multiple uses, if dot is not generated, we will be generating 2 MADs. So, it should be better to generate 1 dot instead of 2 MADs, right?

Are these really full rate instructions? They are marked as such now, but I thought these were quarter rate.

test/CodeGen/AMDGPU/idot2.ll
25

What happens if everything is done in i16? Can this still be matched?

Supported the transformation using table gen patterns.

But all the types are legal already?

Yes, they are legal types but we have packed instructions that can operation on a pair of 16bits, therefore packed types can be treated as 32bit scalar type.

We should think about treating the vector extracts as legal, and then cleaning up the mess from selecting those later. I think at this point doing what we do is causing more problems than it's worth

I agree. I also realized that we don't need to worry about the exact types. We should treat them as 32 bit since we have packed instructions.
test/CodeGen/AMDGPU/idot2.ll
25

The pattern will not be matched if everything is done in 16. I will support it in a separate patch.

If all the operations happen in 16bit, the pattern is detected. Added a testcase of that pattern.

arsenm added inline comments.Aug 10 2018, 1:03 AM
lib/Target/AMDGPU/VOP3PInstructions.td
180–181

Patterns matching sext_inreg directly is kind of unusual, especially for a specific size. I would expect this to be a number of known sign bits check?

test/CodeGen/AMDGPU/idot2.ll
10–13

These tests are a bit thin. Perhaps use update_llc_test_checks?

87–91

Should include a test with the explicit sext_inreg patterns done on i32, and with different bit widths than i16

arsenm added inline comments.Aug 10 2018, 1:07 AM
test/CodeGen/AMDGPU/idot2.ll
3

Should also include run lines with a CI and VI target to make sure those don't break

Added - a testcase with sign_extend_inreg happening on 32bit from 8bit.

  • checks for SI and VI
  • update_llc_test checks
lib/Target/AMDGPU/VOP3PInstructions.td
180–181

It's not unusual, I see all the other targets doing this sext_inreg matching with a specific size, specially for vectors. For vectors lying in a 32bit register, we need to make sure that each element is lying on a specific location inside the register.

I feel like doing the known bits check would be redundant since DAG combiner already performed this check before generating sign_extend_inreg. Also, performing this check does not provide any benefit in our vector case since we cannot allow other sizes being sign extended to 16 or higher. It has to be exactly coming from the lower/upper 16bit of a 32bit register unless we rearrange the data orientation inside the 32bit register.

It looks like there are no further comments. In that case, I will go ahead and check it in.

arsenm accepted this revision.Aug 21 2018, 8:11 AM

LGTM

This revision is now accepted and ready to land.Aug 21 2018, 8:11 AM
This revision was automatically updated to reflect the committed changes.