This patch reduces the number of adds in the test case of https://github.com/RadeonOpenCompute/ROCm/issues/488.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Even though this patch reduces the number of adds in the added combine-srl-add.ll test, it introduces a copy from vcc_hi. I.e. we get:
v_add_co_u32_e32 v0, vcc, v0, v1
v_mov_b32_e32 v1, vcc_hi
v_add_co_u32_e32 v0, vcc, vcc_lo, v2
v_addc_co_u32_e32 v1, vcc, v3, v1, vcc
The problem is that:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%9:vreg_64 = V_ADD_U64_PSEUDO killed %13:vreg_64, killed %8:sreg_64, implicit-def dead $vcc, implicit-def dead $exec, implicit $exec
after finalize-isel becomes:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%18:vgpr_32 = COPY %13.sub0:vreg_64
%19:sgpr_32 = COPY %8.sub0:sreg_64
%20:vgpr_32 = COPY %13.sub1:vreg_64
%21:sgpr_32 = COPY %8.sub1:sreg_64
%14:vgpr_32, %16:sreg_64_xexec = V_ADD_CO_U32_e64 %18:vgpr_32, %19:sgpr_32, 0, implicit $exec
%21:sgpr_32 = COPY %8.sub1:sreg_64 is created by adding SrcReg1Sub1 to the MachineInstr HiHalf in the expansion of V_ADD_U64_PSEUDO in SITargetLowering::EmitInstrWithCustomInserter(). This is responsible for the vcc_hi copy. How can we eliminate the copy correctly? I only found clumsy ways to do this :(
This seems like a target-independent optimization that InstCombine could do on IR, using the uadd_with_overflow intrinsic.
Yes. I think this pattern can also appear as a result of legalization expansions, so could also be a generic combine
llvm/test/CodeGen/AMDGPU/combine-srl-add.ll | ||
---|---|---|
13 ↗ | (On Diff #359273) | This could use a lot more tests, like making sure the number of bits in the shift match, and that the sources are zexts. Also what if tere are multiple uses of the add? Also should cover sub, and stress both vector and scalar versions. Does this also work with sext if the shift is ashr? |
This isn't a correct expansion. You cannot interpret the carry out directly as the source. This needs a v_cndmask_b32 to get a vector source
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
3256 ↗ | (On Diff #359273) | Just initialize above and avoid the weird !(= |
3262–3263 ↗ | (On Diff #359273) | Move this check first |
3285–3286 ↗ | (On Diff #359273) | I don't understand why you are looking at the uses, you probably should just skip this if !N0.hasOnEUse |
I was trying to get this:
v_add_co_u32_e32 v0, vcc, v0, v1
v_add_co_u32_e32 v0, vcc, vcc_lo, v2
v_addc_co_u32_e32 v1, vcc, 0, v3, vcc
by disabling the hi-sub-register access of VCC in SIIselLowering's handling of
V_ADD_U64_PSEUDO, but I guess this is going in the wrong direction since I
misunderstood the handling of VCC
Before I address the comments on the code, test-case, coverage etc. we need to be on the same page on where and how this should be done. I can see 2 approaches, but I'm unable to decide between the 2: Approach A ---------- Continue with the current approach, which transforms the DAG: t16: i64 = add t9, t15 ... t15: i64 = srl t12, Constant:i32<32> t12: i64 = add t10, t11 t10: i64 = zero_extend t2 t2: i32,ch = CopyFromReg t0, Register:i32 %0 t11: i64 = zero_extend t4 t4: i32,ch = CopyFromReg t0, Register:i32 %1 To: t16: i64 = add t9, t40:1 ... t40: i32,i64 = addcarry t2, t4, Constant:i64<0> t2: i32,ch = CopyFromReg t0, Register:i32 %0 t4: i32,ch = CopyFromReg t0, Register:i32 %1 I'm not sure what we can do from the DAG-combiner's perspective to emit v_cndmask_b32. But instead, we can fix the expansion (in this case, V_ADD_U64_PSEUDO) to handle "carry-out as a source" with v_cndmask_b32. Approach B ---------- Go with @foad's suggestion of using inst-combine and transform the input to: define i64 @uaddCarry(i32 %a, i32 %b, i64 %c) { entry: %uaddCarry = call {i32, i1} @llvm.uadd.with.overflow.i32(i32 %a, i32 %b) %uadd = extractvalue {i32, i1} %uaddCarry, 0 %carry = extractvalue {i32, i1} %uaddCarry, 1 %carry.zext = zext i1 %carry to i64 %add = add i64 %c, %carry.zext ret i64 %add } declare {i32, i1} @llvm.uadd.with.overflow.i32(i32 %a, i32 %b) which would be lowered to: v_add_co_u32_e32 v0, vcc, v0, v1 v_cndmask_b32_e64 v0, 0, 1, vcc v_add_co_u32_e32 v0, vcc, v2, v0 v_addc_co_u32_e32 v1, vcc, 0, v3, vcc We can keep this transformation target independent, but which approach should we go with: Approach A (DAG-combine) or Approach B (inst-combine)?
These aren't mutually exclusive. A huge number of things like this should be implemented in both. I would lean towards the DAG version being more important at the moment
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
3285–3286 ↗ | (On Diff #359273) | The add I'm handling in https://github.com/RadeonOpenCompute/ROCm/issues/488 is used by a truncate, so we need to check for it to improve coverage. https://alive2.llvm.org/ce/z/Vv4DSr is what I'm trying to cover here. |
llvm/test/CodeGen/AMDGPU/combine-srl-add.ll | ||
13 ↗ | (On Diff #359273) | I would like this patch to just handle (srl (add (zext))) kinds. I'll improve the coverage for ashr, vector, sub etc. in different patches.
You mean like adding test-cases that shows no transformation for unfavourable shift value, (srl (add (sext))) kinds? |
s/ISD::ADDCARRY/ISD::UADDO seems to generate the v_cndmask_b32_e64. I'm thinking of making this target independent in a different patch later since I'm not confident if this will benefit all targets in LLVM at the moment.
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
3285–3286 ↗ | (On Diff #359273) | If the transform can only be done from a trunc root, the combine should start by looking at the trunc, not the add |
llvm/test/CodeGen/AMDGPU/combine-srl-add.ll | ||
13 ↗ | (On Diff #359273) | Yes, negative tests where this transform is invalid |
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
3285–3286 ↗ | (On Diff #359273) | We need to handle both N0.hasOneUse() case (i.e. only srl uses the add) and !N0.hasOneUse() iff the other uses are truncates to the type of the UADDO.add that we intend to create. If we do the combine rooted at trunc, we miss out the N0.hasOneUse() case. I think it makes more sense to do this rooted at srl since it's at srl's perspective we see that the add can be a uaddo. I'm not happy with the traversal of the uses of add since I like DAG-combine functions to only worry about the operands directions (i.e. towards the leaves) and not worry about the uses direction (i.e. towards the root), but to get the coverage I'm looking for, I can't see a better way to do this. |
I still think there is something wrong if you need to consider the truncated and non truncated case. Are we missing an add width reduction case?
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
3285–3286 ↗ | (On Diff #359273) | Is it actually profitable to handle the multiple use case? |
llvm/test/CodeGen/AMDGPU/combine-srl-add.ll | ||
39 ↗ | (On Diff #360853) | Cases with both sext and zext would be good |
- Made the transformation target independent.
- Covered SRA case.
- Added more tests.
- Rebased.
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
3285–3286 ↗ | (On Diff #359273) | I'm not sure if it will be profitable in all the cases, but we need to handle it to cover https://github.com/RadeonOpenCompute/ROCm/issues/488 |
I haven't found an input where this transformation can go wrong, I'm not sure if we missed a corner case.
The width reduction is implied in the basicXXX and truncUse functions in the added test.
I moved this to the target independent DAG-combiner.
There is a regression in X86/addcarry.ll's d() function. I haven't looked into why we're getting 2 more instructions there, let me know if there's anything we can do in this patch to avoid it. I also added newer tests in X86/addcarry.ll that shows lesser instructions, i.e. the f() is 2 instructions less and g() is 3 instructions less with this transformation.
I have realized that this works for sra as well (https://alive2.llvm.org/ce/z/ZtGz-o), so I made the necessary changes.
Two comments:
- the transform you want is https://alive2.llvm.org/ce/z/NFDE9C i.e. you need to start from lshr i64 %z, 32, match it's operands, and do *not* look at it's uses.
- This is really something that should go into instcombine
We're not looking at lshr's uses here, but that of the add in lshr(add()). I would like this to cover truncUse() in AMDGPU/combine-srx-add.ll (g() in X86/addcarry.ll). I'm not sure if we can do that without looking at the add's uses.
@foad mentioned the same. While I understand that target-specific combines should go to lib/target/<target> with setTargetDAGCombine(), I'm not able to determine if the target independent ones should go to target independent version of inst-combine, DAG-combine, or both. I understand that keeping at both has the advantage of better coverage since inst-combine might catch the pattern that DAG-combine might not be able to catch or vice-versa. I'm not able to see any motivation in *moving* this to inst-combine, but I do see a motivation in *copying* this to inst-combine.
This is a very straight-forward fold for a simple IR pattern. There is absolutely nothing target-specific there.
Unless said pattern can be produced by the backend lowering, it should be done in the middle-end, not here.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
8743 | Will we hit the DAG.SignBitIsZero(N0) case on line 8736 before we get a chance to match this pattern? | |
8763 | If we keep this in DAGCombine, we should use LegalOperations rather than Level >= AfterLegalizeVectorOps | |
8818 | N0.getValue(0) and UAddO.getValue(0) don't have the same type do they? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
8818 | Correct. |
The DAG-combine we're doing here is target independent. I'm still not sure about moving this to inst-combine. We can move and/or copy this to inst-combine once we see cases where the DAG-combine approach is underperforming. I hope that's OK.
We can move and/or copy this to inst-combine once we see cases where the DAG-combine approach is underperforming. I hope that's OK.
This is backwards.
*If* the pattern in question can somehow appear after the last time instcombine pass run in pipeline,
then it's justified to have a similar DAG combine for it. But in general it should be in instcombine.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
8818 | If the types are different then calling ReplaceAllUsesOfValueWith seems incorrect. What am I missing? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
8818 | I'm sorry, that ReplaceAllUsesOfValueWith() should not have been here, I have removed it. I have added some more comments since the update of uses here is tricky. |
@lebedev.ri Can you confirm the revision you require in this? Is there an issue with the code in this patch and/or do you want us to move this to inst-combine? I still agree with @arsenm that this sort of thing might need to be done in both the places, in which case I don't see any issue in the order in which we push for it.
Is there an issue with the code in this patch and/or do you want us to move this to inst-combine?
I still agree with @arsenm that this sort of thing might need to be done in both the places, in which case I don't see any issue in the order in which we push for it.
I agree that we might need to do this in both the places, emphasis is mine,
but mainly this should be done in middle-end, and afterwards the back-end fold
should be motivated by an end-to-end test showing the missed optimization.
That makes sense. My hesitation in moving this to inst-combine immediately is because it works fine in DAG-combine. Given the fact that I'm new to LLVM and that this is my first pull request, my hesitation might be going in the wrong direction. I'll initiate a pull request for the same in inst-combine. I'll address the review comments here as well since we might need this in the future.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
8735 | I had no idea that the ashr inputs were getting combined from visitSRL() instead of visitSRA(). I can see that SRA is changed to SRL in TargetLowering::SimplifyDemandedBits(). From looking at the handling of SRA in SimplifyDemandedBits(), I can see that it transforms it to SRL whenever it can claim that the input's sign-bit will always be zero. I'm not sure if we'll ever need this transformation in visistSRA(). I have removed it now. | |
8812 | I have made the exit condition to check LegalTypes instead of LegalOperations. In that case, do we need the (!TLI.isOperationLegalOrCustom(ISD::UADDO)) check above? |
Do we usually create uadd.with.overflow in instcombine? I thought we moved all of that to CodeGenPrep as is was causing regressions otherwise.
(I mean creating uadd.with.overflow from instructions that didn't start out as different add.with.overflow, generating them from plain llvm instructions. That might have been problems with the particular transform, but from what I remember instcombine was creating cross-basic-block overflow values, that the backend then couldn't handle very well. I'm not sure that was the exact reason for moving it though, I just remember codesize regressions).
This has gone back and forth. I don't think we have a consistent story yet. We clearly do create intrinsics in some cases like:
https://github.com/llvm/llvm-project/blob/0f4b41e038537ab2ab6fa2aa048e55c28a03ab68/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L1127
...but I have knowingly stayed away from some of those because of the backend problems that we've seen.
Side note: I was just looking at a buggy overflow fold in D106983 where it's not clear to me that there is a universal correct answer.
I pushed the inst-combine version of this at https://reviews.llvm.org/D107552.
The inst-combine approach had some advantages like: the traversal was depth-first so I can visit the ancestor lshr after the uaddo is generated; I'm concerned with the cross-basic-block overflow use that @dmgreen mentioned above, I could add a bail out check for this since inst-combine works on LLVM-IR and not on a per basic-block DAG like the DAG-combine invoked from SelectionDAGISel::SelectBasicBlock(). But I'm still concerned that we're generating an intrinsic in inst-combine, I couldn't find any TLI.isOperationLegalOrCustom() version in LLVM-IR.
We clearly do create intrinsics in some cases like:
https://github.com/llvm/llvm-project/blob/0f4b41e038537ab2ab6fa2aa048e55c28a03ab68/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L1127
Yeah. The saturate instrinsics I consider differently because they don't produce two results. That at least keeps them simpler. That also allows vectorization (which is something that might not be needed in the future if the vectorizer learns to fold vector patterns) but at the moment is pretty important.
I should have phrased by question better. Is there any reason to check this in visitSRA or can we just let it get converted to SRL and matched in visitSRL?