Transform add (mul ((i32)S0.x, (i32)S1.x),
add( mul ((i32)S0.y, (i32)S1.y), (i32)S3) => i/udot2((v2i16)S0, (v2i16)S1, (i32)S3)
Paths
| Differential D50024
[AMDGPU] Support idot2 pattern. ClosedPublic Authored by FarhanaAleen on Jul 30 2018, 3:47 PM.
Details Summary Transform add (mul ((i32)S0.x, (i32)S1.x), add( mul ((i32)S0.y, (i32)S1.y), (i32)S3) => i/udot2((v2i16)S0, (v2i16)S1, (i32)S3)
Diff Detail Event TimelineHerald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptJul 30 2018, 3:47 PM Comment Actions This seems to be a lot of code to try different commutations. Why can't you just use table gen patterns which would generate these for you? Also seem to be missing a lot of hasOneUse checks
Comment Actions Thanks Matt.
We can't use table gen patterns because all the types are promoted by then. Also, recognition of vector elements would be very complex.
They are not missing, I did not add them intentionally. In general, we care about throughput on GPU, right? With multiple uses, generating dot instruction saves 2 instructions in the best case(1 op having multiple uses), 0 instructions in the worse case(all 3 ops having multiple uses). As per latency, I was thinking dot instruction should be as fast as one MAD since it will be performing the multiplication on packed data which can be as fast as performing one 16bit multiplication. With multiple uses, if dot is not generated, we will be generating 2 MADs. So, it should be better to generate 1 dot instead of 2 MADs, right? Comment Actions
Comment Actions
But all the types are legal already? We should think about treating the vector extracts as legal, and then cleaning up the mess from selecting those later. I think at this point doing what we do is causing more problems than it's worth
Register pressure also matters, probably more. In the every instruction has another use case, you're just increasing the register count without saving anything. I would need to think more carefully about the other cases. Either way, there should be tests with the different operands with multiple uses.
Are these really full rate instructions? They are marked as such now, but I thought these were quarter rate.
Comment Actions
Yes, they are legal types but we have packed instructions that can operation on a pair of 16bits, therefore packed types can be treated as 32bit scalar type.
I agree. I also realized that we don't need to worry about the exact types. We should treat them as 32 bit since we have packed instructions.
Comment Actions If all the operations happen in 16bit, the pattern is detected. Added a testcase of that pattern.
Comment Actions Added - a testcase with sign_extend_inreg happening on 32bit from 8bit.
Comment Actions It looks like there are no further comments. In that case, I will go ahead and check it in. This revision is now accepted and ready to land.Aug 21 2018, 8:11 AM Closed by commit rL340295: [AMDGPU] Support idot2 pattern. (authored by faaleen). · Explain WhyAug 21 2018, 9:22 AM This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 160646 lib/Target/AMDGPU/AMDGPUInstructions.td
lib/Target/AMDGPU/VOP3PInstructions.td
test/CodeGen/AMDGPU/idot2.ll
|
Patterns matching sext_inreg directly is kind of unusual, especially for a specific size. I would expect this to be a number of known sign bits check?