Page MenuHomePhabricator

[DAGCombine] Combine srX of add that intends to get the carry as uaddo
Needs ReviewPublic

Authored by abinavpp on Jul 16 2021, 3:28 AM.

Details

Summary

This patch reduces the number of adds in the test case of https://github.com/RadeonOpenCompute/ROCm/issues/488.

Diff Detail

Event Timeline

abinavpp created this revision.Jul 16 2021, 3:28 AM
abinavpp requested review of this revision.Jul 16 2021, 3:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2021, 3:28 AM
abinavpp edited the summary of this revision. (Show Details)Jul 16 2021, 3:37 AM
abinavpp edited the summary of this revision. (Show Details)

Even though this patch reduces the number of adds in the added combine-srl-add.ll test, it introduces a copy from vcc_hi. I.e. we get:
v_add_co_u32_e32 v0, vcc, v0, v1
v_mov_b32_e32 v1, vcc_hi
v_add_co_u32_e32 v0, vcc, vcc_lo, v2
v_addc_co_u32_e32 v1, vcc, v3, v1, vcc

The problem is that:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%9:vreg_64 = V_ADD_U64_PSEUDO killed %13:vreg_64, killed %8:sreg_64, implicit-def dead $vcc, implicit-def dead $exec, implicit $exec

after finalize-isel becomes:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%18:vgpr_32 = COPY %13.sub0:vreg_64
%19:sgpr_32 = COPY %8.sub0:sreg_64
%20:vgpr_32 = COPY %13.sub1:vreg_64
%21:sgpr_32 = COPY %8.sub1:sreg_64
%14:vgpr_32, %16:sreg_64_xexec = V_ADD_CO_U32_e64 %18:vgpr_32, %19:sgpr_32, 0, implicit $exec

%21:sgpr_32 = COPY %8.sub1:sreg_64 is created by adding SrcReg1Sub1 to the MachineInstr HiHalf in the expansion of V_ADD_U64_PSEUDO in SITargetLowering::EmitInstrWithCustomInserter(). This is responsible for the vcc_hi copy. How can we eliminate the copy correctly? I only found clumsy ways to do this :(

foad added a comment.Jul 16 2021, 6:08 AM

This seems like a target-independent optimization that InstCombine could do on IR, using the uadd_with_overflow intrinsic.

This seems like a target-independent optimization that InstCombine could do on IR, using the uadd_with_overflow intrinsic.

Yes. I think this pattern can also appear as a result of legalization expansions, so could also be a generic combine

llvm/test/CodeGen/AMDGPU/combine-srl-add.ll
13 ↗(On Diff #359273)

This could use a lot more tests, like making sure the number of bits in the shift match, and that the sources are zexts. Also what if tere are multiple uses of the add?

Also should cover sub, and stress both vector and scalar versions. Does this also work with sext if the shift is ashr?

Even though this patch reduces the number of adds in the added combine-srl-add.ll test, it introduces a copy from vcc_hi. I.e. we get:
v_add_co_u32_e32 v0, vcc, v0, v1
v_mov_b32_e32 v1, vcc_hi
v_add_co_u32_e32 v0, vcc, vcc_lo, v2
v_addc_co_u32_e32 v1, vcc, v3, v1, vcc

The problem is that:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%9:vreg_64 = V_ADD_U64_PSEUDO killed %13:vreg_64, killed %8:sreg_64, implicit-def dead $vcc, implicit-def dead $exec, implicit $exec

after finalize-isel becomes:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%18:vgpr_32 = COPY %13.sub0:vreg_64
%19:sgpr_32 = COPY %8.sub0:sreg_64
%20:vgpr_32 = COPY %13.sub1:vreg_64
%21:sgpr_32 = COPY %8.sub1:sreg_64
%14:vgpr_32, %16:sreg_64_xexec = V_ADD_CO_U32_e64 %18:vgpr_32, %19:sgpr_32, 0, implicit $exec

%21:sgpr_32 = COPY %8.sub1:sreg_64 is created by adding SrcReg1Sub1 to the MachineInstr HiHalf in the expansion of V_ADD_U64_PSEUDO in SITargetLowering::EmitInstrWithCustomInserter(). This is responsible for the vcc_hi copy. How can we eliminate the copy correctly? I only found clumsy ways to do this :(

This isn't a correct expansion. You cannot interpret the carry out directly as the source. This needs a v_cndmask_b32 to get a vector source

arsenm added inline comments.Jul 19 2021, 5:17 PM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3256 ↗(On Diff #359273)

Just initialize above and avoid the weird !(=

3262–3263 ↗(On Diff #359273)

Move this check first

3285–3286 ↗(On Diff #359273)

I don't understand why you are looking at the uses, you probably should just skip this if !N0.hasOnEUse

Even though this patch reduces the number of adds in the added combine-srl-add.ll test, it introduces a copy from vcc_hi. I.e. we get:
v_add_co_u32_e32 v0, vcc, v0, v1
v_mov_b32_e32 v1, vcc_hi
v_add_co_u32_e32 v0, vcc, vcc_lo, v2
v_addc_co_u32_e32 v1, vcc, v3, v1, vcc

The problem is that:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%9:vreg_64 = V_ADD_U64_PSEUDO killed %13:vreg_64, killed %8:sreg_64, implicit-def dead $vcc, implicit-def dead $exec, implicit $exec

after finalize-isel becomes:
%7:vgpr_32, %8:sreg_64 = V_ADD_CO_U32_e64 %0:vgpr_32, %1:vgpr_32, 0, implicit $exec
%18:vgpr_32 = COPY %13.sub0:vreg_64
%19:sgpr_32 = COPY %8.sub0:sreg_64
%20:vgpr_32 = COPY %13.sub1:vreg_64
%21:sgpr_32 = COPY %8.sub1:sreg_64
%14:vgpr_32, %16:sreg_64_xexec = V_ADD_CO_U32_e64 %18:vgpr_32, %19:sgpr_32, 0, implicit $exec

%21:sgpr_32 = COPY %8.sub1:sreg_64 is created by adding SrcReg1Sub1 to the MachineInstr HiHalf in the expansion of V_ADD_U64_PSEUDO in SITargetLowering::EmitInstrWithCustomInserter(). This is responsible for the vcc_hi copy. How can we eliminate the copy correctly? I only found clumsy ways to do this :(

This isn't a correct expansion. You cannot interpret the carry out directly as the source. This needs a v_cndmask_b32 to get a vector source

I was trying to get this:
v_add_co_u32_e32 v0, vcc, v0, v1
v_add_co_u32_e32 v0, vcc, vcc_lo, v2
v_addc_co_u32_e32 v1, vcc, 0, v3, vcc

by disabling the hi-sub-register access of VCC in SIIselLowering's handling of
V_ADD_U64_PSEUDO, but I guess this is going in the wrong direction since I
misunderstood the handling of VCC

Before I address the comments on the code, test-case, coverage etc. we need to
be on the same page on where and how this should be done. I can see 2
approaches, but I'm unable to decide between the 2:

Approach A
----------
Continue with the current approach, which transforms the DAG:
t16: i64 = add t9, t15
  ...
  t15: i64 = srl t12, Constant:i32<32>
    t12: i64 = add t10, t11
      t10: i64 = zero_extend t2
        t2: i32,ch = CopyFromReg t0, Register:i32 %0
      t11: i64 = zero_extend t4
        t4: i32,ch = CopyFromReg t0, Register:i32 %1

To:
t16: i64 = add t9, t40:1
  ...
  t40: i32,i64 = addcarry t2, t4, Constant:i64<0>
    t2: i32,ch = CopyFromReg t0, Register:i32 %0
    t4: i32,ch = CopyFromReg t0, Register:i32 %1

I'm not sure what we can do from the DAG-combiner's perspective to emit
v_cndmask_b32. But instead, we can fix the expansion (in this case,
V_ADD_U64_PSEUDO) to handle "carry-out as a source" with v_cndmask_b32.

Approach B
----------
Go with @foad's suggestion of using inst-combine and transform the input to:
define i64 @uaddCarry(i32 %a, i32 %b, i64 %c) {
entry:
  %uaddCarry = call {i32, i1} @llvm.uadd.with.overflow.i32(i32 %a, i32 %b)
  %uadd = extractvalue {i32, i1} %uaddCarry, 0
  %carry = extractvalue {i32, i1} %uaddCarry, 1
  %carry.zext = zext i1 %carry to i64
  %add = add i64 %c, %carry.zext
  ret i64 %add
}
declare {i32, i1} @llvm.uadd.with.overflow.i32(i32 %a, i32 %b)

which would be lowered to:
v_add_co_u32_e32 v0, vcc, v0, v1
v_cndmask_b32_e64 v0, 0, 1, vcc
v_add_co_u32_e32 v0, vcc, v2, v0
v_addc_co_u32_e32 v1, vcc, 0, v3, vcc

We can keep this transformation target independent, but which approach should we
go with: Approach A (DAG-combine) or Approach B (inst-combine)?

We can keep this transformation target independent, but which approach should we
go with: Approach A (DAG-combine) or Approach B (inst-combine)?

These aren't mutually exclusive. A huge number of things like this should be implemented in both. I would lean towards the DAG version being more important at the moment

abinavpp updated this revision to Diff 360508.Jul 21 2021, 10:18 AM
  • s/addcarry/uaddo
  • Addressed some review comments
abinavpp marked 2 inline comments as done.Jul 21 2021, 10:31 AM
abinavpp added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3285–3286 ↗(On Diff #359273)

The add I'm handling in https://github.com/RadeonOpenCompute/ROCm/issues/488 is used by a truncate, so we need to check for it to improve coverage. https://alive2.llvm.org/ce/z/Vv4DSr is what I'm trying to cover here.

llvm/test/CodeGen/AMDGPU/combine-srl-add.ll
13 ↗(On Diff #359273)

I would like this patch to just handle (srl (add (zext))) kinds. I'll improve the coverage for ashr, vector, sub etc. in different patches.

like making sure the number of bits in the shift match, and that the sources are zexts. Also what if tere are multiple uses of the add?

You mean like adding test-cases that shows no transformation for unfavourable shift value, (srl (add (sext))) kinds?

s/ISD::ADDCARRY/ISD::UADDO seems to generate the v_cndmask_b32_e64. I'm thinking of making this target independent in a different patch later since I'm not confident if this will benefit all targets in LLVM at the moment.

arsenm added inline comments.Jul 21 2021, 3:07 PM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3285–3286 ↗(On Diff #359273)

If the transform can only be done from a trunc root, the combine should start by looking at the trunc, not the add

llvm/test/CodeGen/AMDGPU/combine-srl-add.ll
13 ↗(On Diff #359273)

Yes, negative tests where this transform is invalid

abinavpp updated this revision to Diff 360853.Jul 22 2021, 9:13 AM

Rebased; Added negative tests

abinavpp marked an inline comment as done.Jul 22 2021, 9:36 AM
abinavpp added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3285–3286 ↗(On Diff #359273)

We need to handle both N0.hasOneUse() case (i.e. only srl uses the add) and !N0.hasOneUse() iff the other uses are truncates to the type of the UADDO.add that we intend to create. If we do the combine rooted at trunc, we miss out the N0.hasOneUse() case. I think it makes more sense to do this rooted at srl since it's at srl's perspective we see that the add can be a uaddo. I'm not happy with the traversal of the uses of add since I like DAG-combine functions to only worry about the operands directions (i.e. towards the leaves) and not worry about the uses direction (i.e. towards the root), but to get the coverage I'm looking for, I can't see a better way to do this.

abinavpp retitled this revision from [AMDGPU] Combine srl of add that intends to get the carry of the add as addcarry to [AMDGPU] Combine srl of add that intends to get the carry as uaddo.Jul 22 2021, 10:22 AM

I still think there is something wrong if you need to consider the truncated and non truncated case. Are we missing an add width reduction case?

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3285–3286 ↗(On Diff #359273)

Is it actually profitable to handle the multiple use case?

llvm/test/CodeGen/AMDGPU/combine-srl-add.ll
39 ↗(On Diff #360853)

Cases with both sext and zext would be good

abinavpp updated this revision to Diff 361607.Jul 26 2021, 3:20 AM
  • Made the transformation target independent.
  • Covered SRA case.
  • Added more tests.
  • Rebased.
abinavpp marked an inline comment as done.Jul 26 2021, 3:23 AM
abinavpp added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3285–3286 ↗(On Diff #359273)

I'm not sure if it will be profitable in all the cases, but we need to handle it to cover https://github.com/RadeonOpenCompute/ROCm/issues/488

abinavpp retitled this revision from [AMDGPU] Combine srl of add that intends to get the carry as uaddo to [DAGCombine] Combine srX of add that intends to get the carry as uaddo.Jul 26 2021, 3:24 AM

I still think there is something wrong if you need to consider the truncated and non truncated case.

I haven't found an input where this transformation can go wrong, I'm not sure if we missed a corner case.

Are we missing an add width reduction case?

The width reduction is implied in the basicXXX and truncUse functions in the added test.

I moved this to the target independent DAG-combiner.

There is a regression in X86/addcarry.ll's d() function. I haven't looked into why we're getting 2 more instructions there, let me know if there's anything we can do in this patch to avoid it. I also added newer tests in X86/addcarry.ll that shows lesser instructions, i.e. the f() is 2 instructions less and g() is 3 instructions less with this transformation.

I have realized that this works for sra as well (https://alive2.llvm.org/ce/z/ZtGz-o), so I made the necessary changes.

Two comments:

  1. the transform you want is https://alive2.llvm.org/ce/z/NFDE9C i.e. you need to start from lshr i64 %z, 32, match it's operands, and do *not* look at it's uses.
  2. This is really something that should go into instcombine

Two comments:

  1. the transform you want is https://alive2.llvm.org/ce/z/NFDE9C i.e. you need to start from lshr i64 %z, 32, match it's operands, and do *not* look at it's uses.

We're not looking at lshr's uses here, but that of the add in lshr(add()). I would like this to cover truncUse() in AMDGPU/combine-srx-add.ll (g() in X86/addcarry.ll). I'm not sure if we can do that without looking at the add's uses.

  1. This is really something that should go into instcombine

@foad mentioned the same. While I understand that target-specific combines should go to lib/target/<target> with setTargetDAGCombine(), I'm not able to determine if the target independent ones should go to target independent version of inst-combine, DAG-combine, or both. I understand that keeping at both has the advantage of better coverage since inst-combine might catch the pattern that DAG-combine might not be able to catch or vice-versa. I'm not able to see any motivation in *moving* this to inst-combine, but I do see a motivation in *copying* this to inst-combine.

Two comments:

  1. the transform you want is https://alive2.llvm.org/ce/z/NFDE9C i.e. you need to start from lshr i64 %z, 32, match it's operands, and do *not* look at it's uses.

We're not looking at lshr's uses here, but that of the add in lshr(add()). I would like this to cover truncUse() in AMDGPU/combine-srx-add.ll (g() in X86/addcarry.ll). I'm not sure if we can do that without looking at the add's uses.

  1. This is really something that should go into instcombine

@foad mentioned the same. While I understand that target-specific combines should go to lib/target/<target> with setTargetDAGCombine(), I'm not able to determine if the target independent ones should go to target independent version of inst-combine, DAG-combine, or both. I understand that keeping at both has the advantage of better coverage since inst-combine might catch the pattern that DAG-combine might not be able to catch or vice-versa. I'm not able to see any motivation in *moving* this to inst-combine, but I do see a motivation in *copying* this to inst-combine.

This is a very straight-forward fold for a simple IR pattern. There is absolutely nothing target-specific there.
Unless said pattern can be produced by the backend lowering, it should be done in the middle-end, not here.

craig.topper added inline comments.Jul 26 2021, 9:55 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8748

Will we hit the DAG.SignBitIsZero(N0) case on line 8736 before we get a chance to match this pattern?

8765

If we keep this in DAGCombine, we should use LegalOperations rather than Level >= AfterLegalizeVectorOps

8820

N0.getValue(0) and UAddO.getValue(0) don't have the same type do they?

abinavpp updated this revision to Diff 361973.Jul 27 2021, 4:27 AM

Rebased; Addressed review comments.

abinavpp marked 2 inline comments as done.Jul 27 2021, 4:29 AM
abinavpp added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8820

Correct.

Two comments:

  1. the transform you want is https://alive2.llvm.org/ce/z/NFDE9C i.e. you need to start from lshr i64 %z, 32, match it's operands, and do *not* look at it's uses.

We're not looking at lshr's uses here, but that of the add in lshr(add()). I would like this to cover truncUse() in AMDGPU/combine-srx-add.ll (g() in X86/addcarry.ll). I'm not sure if we can do that without looking at the add's uses.

  1. This is really something that should go into instcombine

@foad mentioned the same. While I understand that target-specific combines should go to lib/target/<target> with setTargetDAGCombine(), I'm not able to determine if the target independent ones should go to target independent version of inst-combine, DAG-combine, or both. I understand that keeping at both has the advantage of better coverage since inst-combine might catch the pattern that DAG-combine might not be able to catch or vice-versa. I'm not able to see any motivation in *moving* this to inst-combine, but I do see a motivation in *copying* this to inst-combine.

This is a very straight-forward fold for a simple IR pattern. There is absolutely nothing target-specific there.
Unless said pattern can be produced by the backend lowering, it should be done in the middle-end, not here.

The DAG-combine we're doing here is target independent. I'm still not sure about moving this to inst-combine. We can move and/or copy this to inst-combine once we see cases where the DAG-combine approach is underperforming. I hope that's OK.

Two comments:

  1. the transform you want is https://alive2.llvm.org/ce/z/NFDE9C i.e. you need to start from lshr i64 %z, 32, match it's operands, and do *not* look at it's uses.

We're not looking at lshr's uses here, but that of the add in lshr(add()). I would like this to cover truncUse() in AMDGPU/combine-srx-add.ll (g() in X86/addcarry.ll). I'm not sure if we can do that without looking at the add's uses.

  1. This is really something that should go into instcombine

@foad mentioned the same. While I understand that target-specific combines should go to lib/target/<target> with setTargetDAGCombine(), I'm not able to determine if the target independent ones should go to target independent version of inst-combine, DAG-combine, or both. I understand that keeping at both has the advantage of better coverage since inst-combine might catch the pattern that DAG-combine might not be able to catch or vice-versa. I'm not able to see any motivation in *moving* this to inst-combine, but I do see a motivation in *copying* this to inst-combine.

This is a very straight-forward fold for a simple IR pattern. There is absolutely nothing target-specific there.
Unless said pattern can be produced by the backend lowering, it should be done in the middle-end, not here.

The DAG-combine we're doing here is target independent. I'm still not sure about moving this to inst-combine.

We can move and/or copy this to inst-combine once we see cases where the DAG-combine approach is underperforming. I hope that's OK.

This is backwards.
*If* the pattern in question can somehow appear after the last time instcombine pass run in pipeline,
then it's justified to have a similar DAG combine for it. But in general it should be in instcombine.

craig.topper added inline comments.Jul 27 2021, 9:00 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8820

If the types are different then calling ReplaceAllUsesOfValueWith seems incorrect. What am I missing?

abinavpp updated this revision to Diff 362068.Jul 27 2021, 9:35 AM

Fixed the use replacement and added more comments there.

abinavpp added inline comments.Jul 27 2021, 9:39 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8820

I'm sorry, that ReplaceAllUsesOfValueWith() should not have been here, I have removed it. I have added some more comments since the update of uses here is tricky.

lebedev.ri requested changes to this revision.Jul 27 2021, 9:39 AM
This revision now requires changes to proceed.Jul 27 2021, 9:39 AM

@lebedev.ri Can you confirm the revision you require in this? Is there an issue with the code in this patch and/or do you want us to move this to inst-combine? I still agree with @arsenm that this sort of thing might need to be done in both the places, in which case I don't see any issue in the order in which we push for it.

craig.topper added inline comments.Jul 27 2021, 9:44 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8740

I should have phrased by question better. Is there any reason to check this in visitSRA or can we just let it get converted to SRL and matched in visitSRL?

8814

i1 is likely not a legal type. Do we need to check that we're before legalize types?

@lebedev.ri Can you confirm the revision you require in this?

Is there an issue with the code in this patch and/or do you want us to move this to inst-combine?
I still agree with @arsenm that this sort of thing might need to be done in both the places, in which case I don't see any issue in the order in which we push for it.

I agree that we might need to do this in both the places, emphasis is mine,
but mainly this should be done in middle-end, and afterwards the back-end fold
should be motivated by an end-to-end test showing the missed optimization.

abinavpp updated this revision to Diff 362364.Jul 28 2021, 6:46 AM
abinavpp marked an inline comment as done.

Rebased; Addressed review comments.

@lebedev.ri Can you confirm the revision you require in this?

Is there an issue with the code in this patch and/or do you want us to move this to inst-combine?
I still agree with @arsenm that this sort of thing might need to be done in both the places, in which case I don't see any issue in the order in which we push for it.

I agree that we might need to do this in both the places, emphasis is mine,
but mainly this should be done in middle-end, and afterwards the back-end fold
should be motivated by an end-to-end test showing the missed optimization.

That makes sense. My hesitation in moving this to inst-combine immediately is because it works fine in DAG-combine. Given the fact that I'm new to LLVM and that this is my first pull request, my hesitation might be going in the wrong direction. I'll initiate a pull request for the same in inst-combine. I'll address the review comments here as well since we might need this in the future.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8740

I had no idea that the ashr inputs were getting combined from visitSRL() instead of visitSRA(). I can see that SRA is changed to SRL in TargetLowering::SimplifyDemandedBits(). From looking at the handling of SRA in SimplifyDemandedBits(), I can see that it transforms it to SRL whenever it can claim that the input's sign-bit will always be zero. I'm not sure if we'll ever need this transformation in visistSRA(). I have removed it now.

8814

I have made the exit condition to check LegalTypes instead of LegalOperations. In that case, do we need the (!TLI.isOperationLegalOrCustom(ISD::UADDO)) check above?

dmgreen added a subscriber: dmgreen.

@lebedev.ri Can you confirm the revision you require in this?

Is there an issue with the code in this patch and/or do you want us to move this to inst-combine?
I still agree with @arsenm that this sort of thing might need to be done in both the places, in which case I don't see any issue in the order in which we push for it.

I agree that we might need to do this in both the places, emphasis is mine,
but mainly this should be done in middle-end, and afterwards the back-end fold
should be motivated by an end-to-end test showing the missed optimization.

Do we usually create uadd.with.overflow in instcombine? I thought we moved all of that to CodeGenPrep as is was causing regressions otherwise.
(I mean creating uadd.with.overflow from instructions that didn't start out as different add.with.overflow, generating them from plain llvm instructions. That might have been problems with the particular transform, but from what I remember instcombine was creating cross-basic-block overflow values, that the backend then couldn't handle very well. I'm not sure that was the exact reason for moving it though, I just remember codesize regressions).

@lebedev.ri Can you confirm the revision you require in this?

Is there an issue with the code in this patch and/or do you want us to move this to inst-combine?
I still agree with @arsenm that this sort of thing might need to be done in both the places, in which case I don't see any issue in the order in which we push for it.

I agree that we might need to do this in both the places, emphasis is mine,
but mainly this should be done in middle-end, and afterwards the back-end fold
should be motivated by an end-to-end test showing the missed optimization.

Do we usually create uadd.with.overflow in instcombine? I thought we moved all of that to CodeGenPrep as is was causing regressions otherwise.
(I mean creating uadd.with.overflow from instructions that didn't start out as different add.with.overflow, generating them from plain llvm instructions. That might have been problems with the particular transform, but from what I remember instcombine was creating cross-basic-block overflow values, that the backend then couldn't handle very well. I'm not sure that was the exact reason for moving it though, I just remember codesize regressions).

This has gone back and forth. I don't think we have a consistent story yet. We clearly do create intrinsics in some cases like:
https://github.com/llvm/llvm-project/blob/0f4b41e038537ab2ab6fa2aa048e55c28a03ab68/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L1127
...but I have knowingly stayed away from some of those because of the backend problems that we've seen.
Side note: I was just looking at a buggy overflow fold in D106983 where it's not clear to me that there is a universal correct answer.

I pushed the inst-combine version of this at https://reviews.llvm.org/D107552.

from what I remember instcombine was creating cross-basic-block overflow values, that the backend then couldn't handle very well. I'm not sure that was the exact reason for moving it though, I just remember codesize regressions).

The inst-combine approach had some advantages like: the traversal was depth-first so I can visit the ancestor lshr after the uaddo is generated; I'm concerned with the cross-basic-block overflow use that @dmgreen mentioned above, I could add a bail out check for this since inst-combine works on LLVM-IR and not on a per basic-block DAG like the DAG-combine invoked from SelectionDAGISel::SelectBasicBlock(). But I'm still concerned that we're generating an intrinsic in inst-combine, I couldn't find any TLI.isOperationLegalOrCustom() version in LLVM-IR.

We clearly do create intrinsics in some cases like:
https://github.com/llvm/llvm-project/blob/0f4b41e038537ab2ab6fa2aa048e55c28a03ab68/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L1127

Yeah. The saturate instrinsics I consider differently because they don't produce two results. That at least keeps them simpler. That also allows vectorization (which is something that might not be needed in the future if the vectorizer learns to fold vector patterns) but at the moment is pretty important.