Page MenuHomePhabricator

[AMDGPU] Improve Codegen for build_vector
ClosedPublic

Authored by jpages on Mar 5 2021, 3:07 PM.

Details

Summary

Improve the code generation of build_vector.
Use the v_pack_b32_f16 instruction instead of
v_and_b32 + v_lshl_or_b32

Diff Detail

Event Timeline

jpages created this revision.Mar 5 2021, 3:07 PM
jpages requested review of this revision.Mar 5 2021, 3:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2021, 3:07 PM
arsenm added inline comments.Mar 5 2021, 3:09 PM
llvm/lib/Target/AMDGPU/SIInstructions.td
2284–2286

This isn't a simple bitpacking, this has FP output effects like flushing

arsenm requested changes to this revision.Mar 8 2021, 5:54 AM
This revision now requires changes to proceed.Mar 8 2021, 5:54 AM
jpages updated this revision to Diff 332745.Mar 23 2021, 12:07 PM

It appears that this instruction is harder to select than expected on integers.

One suggested way in offline discussions was to use isCanonicalized on the inputs.
With these modifications the instruction will be matched if and only if the inputs have been flushed/quieted already (for example by some floating point operations before).

Let me know it that seems reasonable or if I'm missing something.

arsenm added inline comments.Mar 23 2021, 12:13 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2702–2704 ↗(On Diff #332745)

We have some stripBitcast and other helpers to simplify this

foad added a comment.Mar 24 2021, 4:22 AM

Please fix the test first, and then we can iterate on the code changes required to select v_pack_b32_f16.

As a general comment I was hoping that you could still implement this using patterns in SIInstructions.td, but using a PatFrag like HasOneUseUnaryOp to call into some C++ code just to check the isCanonicalized condition. The advantage of this is that it should work for GlobalISel as well as SelectionDAG (as long as you set GISelPredicateCode in the PatFrag).

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
2690–2691 ↗(On Diff #332745)

"bitcast" not "bitconvert".

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9624

If we're going to define isCanonicalized on integer types (which seems reasonable to me) then how about changing this to just:

case ISD::BITCAST:
  return isCanonicalized(DAG, Src, MaxDepth - 1);

... and moving the "hack round the mess" code into case ISD::TRUNCATE?

llvm/test/CodeGen/AMDGPU/v_pack.ll
3

Maybe add a second RUN line with "llc -global-isel ..."?

31–36

I think you are testing the wrong thing here. The result of "fadd float" is canonicalized, but if you extract the low 16 bits and bitcast it to half, then you get a meaningless value that is not canonicalized.

Instead you should use two "fadd half" instructions and insert the results into a <2 x half>, with no bitcasting.

jpages updated this revision to Diff 343426.Thu, May 6, 8:55 AM

Rebased to it in Tablegen, it should be cleaner now. I removed the too complex pattern with the conversions from i16 to f16.
Instead this v_pack_b32_f16 instruction will be used only for f16 when we are sure the two inputs have already been flushed if needed.

I tried to do it for global isel, but there are several problems with the legalization of build_vector in this version. build_vector seems to be expanded into g_build_vector_trunc or other variations depending on the test. I'm not sure of what needs to be done in gisel, so I let a TODO for later.

foad added inline comments.Fri, May 7, 2:16 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
204–205

In MIR, instructions have dst operands followed by src operands. You probably need to check operand 1 and operand 2 here.

foad added inline comments.Fri, May 7, 2:19 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
188

This frag matches: a binary operator whose inputs are both canonicalized.

I think would be cleaner to have a frag (maybe a PatLeaf?) that matches just: a node that is canonicalized. Then instead of a pattern like "is_canonicalized<build_vector> src0, src1" you would write "build_vector (is_canonicalized src0), (is_canonicalized src1)". Unfortunately my TableGen skills are not great, so I don't know exactly how to implement this.

arsenm added inline comments.Fri, May 7, 6:27 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2284–2286

I believe source modifiers should work as normal, so you can use the VOP3Mods complex patterns for the sources

llvm/test/CodeGen/AMDGPU/v_pack.ll
149

Can you add some cases with source modifier combinations?

jpages updated this revision to Diff 344222.Mon, May 10, 3:45 PM

Added the VOP3Mods in the pattern and associated tests, thanks for the suggestion!

jpages marked 3 inline comments as done.Mon, May 10, 3:46 PM
jpages added inline comments.Mon, May 10, 3:58 PM
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
188

I tried to do it as you suggested. My tableGen skills are not great so maybe there is the way but I couldn't find it.

From my understanding:

  • A PatLeaf can not have an argument so I can't write is_canonicalized $src
  • The only solution seems to use a PatFrag instead
  • I have to give a "type" to such a PatFrag, as well as specifying the number of parameters. I tried to do something generic but the unary pattern looked like that is_canonicalized_unary<BitConvert> or some other SDNode with only one operand.
arsenm accepted this revision.Mon, May 10, 4:09 PM

LGTM with test nit

llvm/test/CodeGen/AMDGPU/v_pack.ll
8

s/v2half/v2f16 for consistency

This revision is now accepted and ready to land.Mon, May 10, 4:09 PM
jpages updated this revision to Diff 344532.Tue, May 11, 12:15 PM
jpages marked an inline comment as done.
This revision was automatically updated to reflect the committed changes.