Combine V_RCP and V_SQRT into V_RSQ on AMDGPU for GlobalISel.
A similar combiner already exists for SDAG.
Details
- Reviewers
foad arsenm Petar.Avramovic mbrkusanin
Diff Detail
Event Timeline
llvm/lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
830 | I don't understand this change. Are you saying this is a dead selection pattern for the DAG? Should we be doing this in the combiner instead and just delete this? That way we could consider the fast math flags and not rely on the function attribute |
llvm/lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
830 | I am, with or without this pattern SDAG combines v_sqrt + v_rcp into v_rsq. I'm not sure which would be better to leave this as a pattern or write a combiner for this. In fact SDAG doesn't even need any flags to combine into v_rsq. |
llvm/lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
830 | If this is a dead pattern in the DAG, I would just delete it. When you say without flags, I assume you mean with the unsafe attribute? I'm a bit worried this pattern is just broken as-is. This depends on the denormal mode, and also could be augmented to use the per-instruction flags. I think it's safer to move this to a combine. |
llvm/lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
830 | I tried deleting the SDAG combiner (SITargetLowering::performRcpCombine()) for v_rsq and then SDAG uses this pattern instead. So I assume it's either this pattern without the SDAG rcp combiner or the SDAG rcp combiner + new GlobalISel combiner? |
Delete RsqPat pattern definition and uses and copy the flags from the original instruction to the newly built instruction (fast math flags...).
Do we also need to handle:
- sqrt(rcp(x)) as well as rcp(sqrt(x)) ?
- 1.0 / x as well as llvm.amdgcn.rcp(x) ?
Added implementation for all possible cases which should be combined into rsq (rcp(sqrt(x)), sqrt(rcp(x)), 1/sqrt(x), sqrt(1/x)).
Added implementation for all possible cases which should be combined into rsq (rcp(sqrt(x)), sqrt(rcp(x)), 1/sqrt(x), sqrt(1/x)).
I thought this would be two separate combines:
- (1.0 / x) -> (rcp x)
- (sqrt (rcp x)) or (rcp (sqrt x)) -> (rsq x)
Is there some reason we don't implement the first combine, e.g. because of the precision of the rcp instruction is not good enough? What does SelectionDAG do?
If we run an .ll test which has (1.0 / x), by the time it gets to the amdgpu-postlegalizer-combiner it will be combined into rcp, just like SDAG.
This is a 'fake' case of a .mir test, where we put the (1.0 / x) in the test and let the combiner take care of that.
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
219 | I still think it's wrong to handle G_FDIV here.
I think we just need an IR test to check that fdiv float 1.0, %x1 with appropriate fast math flags get combined with @llvm.fsqrt to generate a v_rsq instruction. |
Added .ll test. Don't cover the G_FDIV + G_FSQRT case, only with rcp intrinsic (by the time it gets to the postlegalizer it will be transformed to that).
LGTM, thanks!
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
234 | I'm not sure whether it's best to copy flags from MI or RcpSrcMI or somehow combine both. I guess this is fine for now. |
clang-format: please reformat the code