This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32
ClosedPublic

Authored by foad on Nov 16 2021, 4:26 AM.

Details

Summary

Select SelectionDAG ops smul_lohi/umul_lohi to
v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0.
v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions
so it is better to use one instruction than two.

Further improvements are possible to make better use of the addend
operand, but this is already a strict improvement over what we have
now.

Diff Detail

Event Timeline

foad created this revision.Nov 16 2021, 4:26 AM
foad requested review of this revision.Nov 16 2021, 4:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2021, 4:26 AM
foad added a comment.Nov 16 2021, 4:28 AM

The change in the instruction mix in the generated lit tests looks like this:

$ git show | awk '/^-/&&$3~/[sv]_/{a[$3]--}/^+/&&$3~/[sv]_/{a[$3]++}END{for(i in a)if(a[i])printf"%+d %s\n",a[i],i}' | sort -n
-604 v_mul_hi_u32
-567 v_mul_lo_u32
-31 s_mul_i32
-15 v_mul_hi_i32
-10 s_and_b32
-4 v_cndmask_b32_e32
-2 s_mov_b32
-2 v_cmp_ne_u32_e32
-2 v_lshlrev_b32_e32
-2 v_lshrrev_b32_e32
-2 v_mov_b32_e32
+15 v_mad_i64_i32
+1 v_mul_hi_i32_i24_e32
+20 v_addc_u32_e32
+21 v_add_i32_e32
+2 s_waitcnt
+2 v_cmp_ne_u32_e64
+2 v_mul_i32_i24_e64
+2 v_mul_u32_u24_e64
+4 v_cndmask_b32_e64
+4 v_mul_i32_i24_e32
+4 v_mul_u32_u24_e32
+617 v_mad_u64_u32

So it is mostly replacing v_mul_hi_u32+v_mul_lo_u32 with v_mad_u64_u32.

GlobalISel version?

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1026

Could this try to do better than 0 if the source is an add?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
812–813

Seems missing a subtarget check, I'm pretty sure gfx6 didn't have mad_u64_u32

foad added a comment.Nov 16 2021, 6:23 AM

GlobalISel version?

Yes that would be nice, but GlobalISel doesn't seem to have a GMIR opcode corresponding to [su]mul_lohi, at least not yet, so I'm not sure what the best approach would be. Anyway this patch is big enough already so I'd prefer to keep it for SelectionDAG and work on GlobalISel later.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1026

No, not if the source is an add. It would have to be if the result is used in an add, like what SITargetLowering::performAddCombine does for i64 mul. That's the further improvement I alluded to in the commit message.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
812–813

Good point, will fix.

foad updated this revision to Diff 388893.Nov 22 2021, 6:00 AM

Add missing subtarget check.

arsenm accepted this revision.Nov 23 2021, 3:04 PM
This revision is now accepted and ready to land.Nov 23 2021, 3:04 PM