This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][GlobalISel] Code quality: Combine V_RSQ
ClosedPublic

Authored by matejam on Sep 20 2021, 8:16 AM.

Details

Summary

Combine V_RCP and V_SQRT into V_RSQ on AMDGPU for GlobalISel.
A similar combiner already exists for SDAG.

Diff Detail

Event Timeline

matejam created this revision.Sep 20 2021, 8:16 AM
matejam requested review of this revision.Sep 20 2021, 8:16 AM
arsenm added inline comments.Sep 20 2021, 2:07 PM
llvm/lib/Target/AMDGPU/SIInstructions.td
830

I don't understand this change. Are you saying this is a dead selection pattern for the DAG? Should we be doing this in the combiner instead and just delete this? That way we could consider the fast math flags and not rely on the function attribute

matejam added inline comments.Sep 21 2021, 5:54 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
830

I am, with or without this pattern SDAG combines v_sqrt + v_rcp into v_rsq. I'm not sure which would be better to leave this as a pattern or write a combiner for this. In fact SDAG doesn't even need any flags to combine into v_rsq.

arsenm added inline comments.Sep 21 2021, 8:59 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
830

If this is a dead pattern in the DAG, I would just delete it. When you say without flags, I assume you mean with the unsafe attribute? I'm a bit worried this pattern is just broken as-is. This depends on the denormal mode, and also could be augmented to use the per-instruction flags. I think it's safer to move this to a combine.

matejam added inline comments.Oct 4 2021, 6:20 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
830

I tried deleting the SDAG combiner (SITargetLowering::performRcpCombine()) for v_rsq and then SDAG uses this pattern instead. So I assume it's either this pattern without the SDAG rcp combiner or the SDAG rcp combiner + new GlobalISel combiner?

matejam updated this revision to Diff 377812.Oct 7 2021, 5:42 AM

Instead of a pattern, use a combiner on AMDGPU for GlobalISel.

matejam edited the summary of this revision. (Show Details)Oct 7 2021, 5:43 AM
arsenm requested changes to this revision.Oct 20 2021, 4:56 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIInstructions.td
832

Can probably delete the definition of RsqPat too

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-rsq.mir
15–17

This looks like it lost the fast math flags

This revision now requires changes to proceed.Oct 20 2021, 4:56 PM
matejam updated this revision to Diff 383014.Oct 28 2021, 6:12 AM

Delete RsqPat pattern definition and uses and copy the flags from the original instruction to the newly built instruction (fast math flags...).

arsenm accepted this revision.Oct 28 2021, 7:03 AM
This revision is now accepted and ready to land.Oct 28 2021, 7:03 AM
foad added a comment.Oct 28 2021, 7:09 AM

Do we also need to handle:

  • sqrt(rcp(x)) as well as rcp(sqrt(x)) ?
  • 1.0 / x as well as llvm.amdgcn.rcp(x) ?
matejam updated this revision to Diff 384418.Nov 3 2021, 6:31 AM

Added implementation for all possible cases which should be combined into rsq (rcp(sqrt(x)), sqrt(rcp(x)), 1/sqrt(x), sqrt(1/x)).

matejam updated this revision to Diff 384483.Nov 3 2021, 9:03 AM

Formatting.

matejam updated this revision to Diff 384511.Nov 3 2021, 10:27 AM

Formatting.

foad added a comment.Nov 4 2021, 4:47 AM

Added implementation for all possible cases which should be combined into rsq (rcp(sqrt(x)), sqrt(rcp(x)), 1/sqrt(x), sqrt(1/x)).

I thought this would be two separate combines:

  1. (1.0 / x) -> (rcp x)
  2. (sqrt (rcp x)) or (rcp (sqrt x)) -> (rsq x)

Is there some reason we don't implement the first combine, e.g. because of the precision of the rcp instruction is not good enough? What does SelectionDAG do?

Added implementation for all possible cases which should be combined into rsq (rcp(sqrt(x)), sqrt(rcp(x)), 1/sqrt(x), sqrt(1/x)).

I thought this would be two separate combines:

  1. (1.0 / x) -> (rcp x)
  2. (sqrt (rcp x)) or (rcp (sqrt x)) -> (rsq x)

Is there some reason we don't implement the first combine, e.g. because of the precision of the rcp instruction is not good enough? What does SelectionDAG do?

If we run an .ll test which has (1.0 / x), by the time it gets to the amdgpu-postlegalizer-combiner it will be combined into rcp, just like SDAG.
This is a 'fake' case of a .mir test, where we put the (1.0 / x) in the test and let the combiner take care of that.

foad added inline comments.Nov 25 2021, 4:41 AM
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp
217

I still think it's wrong to handle G_FDIV here.

  • it's unnecessary, because we are running post-legalizer and G_FDIV will always get legalized to something else.
  • even if G_FDIV did appear here, I don't think it should be combined into an rsq instruction without checking for all the fast/unsafe math flags, like in AMDGPULegalizerInfo::legalizeFastUnsafeFDIV.

I think we just need an IR test to check that fdiv float 1.0, %x1 with appropriate fast math flags get combined with @llvm.fsqrt to generate a v_rsq instruction.

matejam updated this revision to Diff 390062.Nov 26 2021, 7:52 AM

Added .ll test. Don't cover the G_FDIV + G_FSQRT case, only with rcp intrinsic (by the time it gets to the postlegalizer it will be transformed to that).

foad accepted this revision.Nov 26 2021, 8:30 AM

LGTM, thanks!

llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp
234

I'm not sure whether it's best to copy flags from MI or RcpSrcMI or somehow combine both. I guess this is fine for now.