This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32
ClosedPublic

Authored by nhaehnle on Apr 14 2022, 10:39 PM.

Details

Summary

Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us
to keep the calculation entirely on the SALU.

We never fold if the mul has multiple uses, since that effectively
duplicates the (expensive) multiply.

Finally, we expand 64x32 and 64x64 multiplies here as well, if they
feed into an addition. This results in better code generation than
the generic expansion for such multiplies because we end up using
the accumulator of the MAD instructions.

Diff Detail

Event Timeline

nhaehnle created this revision.Apr 14 2022, 10:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 10:39 PM
nhaehnle requested review of this revision.Apr 14 2022, 10:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 10:39 PM
Herald added a subscriber: wdng. · View Herald Transcript

I would have expected this to put this back together after the generic multiply expansion though.

Also it would be nice to get this one ported to GlobalISel

llvm/test/CodeGen/AMDGPU/mad_64_32.ll
535–536

This is a regression? It looks to be the same cycle count for more code size

rampitec added inline comments.Apr 15 2022, 10:20 AM
llvm/test/CodeGen/AMDGPU/mad_64_32.ll
535–536

Actually since gfx90a v_mad_u64/i64 is full rate, so it is even more cycles in that case.

I would have expected this to put this back together after the generic multiply expansion though.

So the original code is (add (mul x, y), z). The generic expansion turns that into (add M, z), where the generic multiply expansion expands M into something along the lines of:

lo,hi = umul_lohi (trunc x), (trunc y)
a = mul (trunc x), (shr y)
b = mul (shr x), (trunc y)
M = or (shl (zext (add3 hi, a, b)), 32), (zext lo)

The umul_lohi becomes v_mad_u64_u32 but it's used only as an addition, as you can see in the pre-change lit tests. Reliably reassociating the code sequence above into a form where the pre-change performAddCombine would trigger seems pretty complicated.

Also it would be nice to get this one ported to GlobalISel

Agreed, but it looks like GlobalISel doesn't generate mad_64_32 at all at the moment. So it's more than just porting this particular tweak.

llvm/test/CodeGen/AMDGPU/mad_64_32.ll
535–536

Good to know, I'm going to rework that part.

Agreed, but it looks like GlobalISel doesn't generate mad_64_32 at all at the moment. So it's more than just porting this particular tweak.

Right, that's the point. Might as well start porting combines as they're looked at again

nhaehnle updated this revision to Diff 423758.Apr 19 2022, 3:59 PM
  • use hasSMulHi()
  • tweak the heuristics for when not to fold based on multiple uses:
    • always fold when multiplication is fast
    • otherwise, fold unless there is one non-foldable use or three foldable uses; this is aimed at reducing cycle counts with code density as a tie-breaker
arsenm added inline comments.Apr 25 2022, 1:30 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10750

s/number of multiply/number of multiplies/

10756

I don't understand why you're checking this if you bail on not ISD:ADD. I guess it would make sense if you were handling the carry out adds in a separate patch?

10803

Could just hardcode this to i32 instead of going through getShiftAmountConstant

10808–10813

A comment with the DAG formed here would be helpful

10822

Ditto

foad added a comment.Apr 26 2022, 3:06 AM

Only fold for uniform values on pre-GFX9 chips.

This made my brain hurt. I think it might be clearer as "For uniform values, only fold on pre-GFX9 chips." English is really imprecise.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10756

LHS is the MUL here, not an ADD, so there's really no need to check ResNo.

10804

I don't know if it makes any practical difference, but other code like AMDGPUTargetLowering::LowerUDIVREM64 uses EXTRACT_ELEMENT to split an i64 into a pair of i32s, and BITCAST(BUILD_VECTOR ...) to reassemble them.

arsenm added inline comments.Apr 26 2022, 6:25 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10804

Using the shift adds extra steps. The combine on 64 bit shifts will turn this into the vector build

nhaehnle updated this revision to Diff 426533.May 2 2022, 3:38 PM
nhaehnle marked 8 inline comments as done.

Address review comments

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10756

Makes sense, thanks.

10804

I'm rearranging the code slightly to use more EXTRACT_ELEMENT and BUILD_VECTOR directly. However, I'm keeping some TRUNCATEs around (instead of extracing the low part) because that turns out to result in better code generation in some tests. Actually also noticed a crash in doing so and fixed that.

10808–10813

Ok.

foad accepted this revision.May 4 2022, 2:13 AM

LGTM.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10734

Nit: use range-based loop over uses()?

This revision is now accepted and ready to land.May 4 2022, 2:13 AM
This revision was landed with ongoing or failed builds.May 10 2022, 7:16 AM
This revision was automatically updated to reflect the committed changes.