Page MenuHomePhabricator

[TargetLowering] Simplify expansion of S{ADD,SUB}O
ClosedPublic

Authored by rogfer01 on Jun 7 2018, 10:34 PM.

Details

Summary

ISD::SADDO uses the suggested sequence described in the section §2.4 of the RISCV Spec v2.2. ISD::SSUBO uses the dual approach but checking for positive.

Diff Detail

Repository
rL LLVM

Event Timeline

rogfer01 created this revision.Jun 7 2018, 10:34 PM

It would probably make sense to change SelectionDAGLegalize::ExpandNode to use this lowering for SADDO/SSUBO, instead of making this target-specific. The target-independent version uses approximately the same operations anyway, just in a less efficient way.

The lowering for UADDO/USUBO appears to be essentially identical to the lowering in SelectionDAGLegalize::ExpandNode, so there isn't really any point.

lib/Target/RISCV/RISCVISelLowering.cpp
553 ↗(On Diff #150448)

The XOR here is a bit weird; could you just use SETGE instead?

Thanks for the review @efriedma

I'll look into making this target-independent.

lenary added a subscriber: lenary.Jul 31 2019, 7:45 AM

@rogfer01 have you managed to implement this in a target-independent way?

rogfer01 updated this revision to Diff 213056.Aug 2 2019, 8:44 AM
rogfer01 retitled this revision from [RISCV] Custom lower ISD::{U,S}{ADD,SUB}O nodes to [WIP][TargetLowering] Simplify expansion of S{ADD,SUB}O.
rogfer01 edited the summary of this revision. (Show Details)
rogfer01 added a reviewer: efriedma.
rogfer01 set the repository for this revision to rG LLVM Github Monorepo.

ChangeLog:

  • Remove RISC-V dependent expansions
  • Simplify current target-independent expansion of S{ADD,SUB}O. Not considering U{ADD,SUB}O anymore.
  • Update tests that saw codegen changes after this

For now marking this as WIP because I want to test this some more.

Also test CodeGen/AMDGPU/saddo.ll needs updating but I'm rather clueless when it comes to that target.

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2019, 8:44 AM

Also test CodeGen/AMDGPU/saddo.ll needs updating but I'm rather clueless when it comes to that target.

I've committed rL367698 - you should be able to use the update script now

The changes for vector arithmetic look nice.

llvm/test/CodeGen/RISCV/arith-with-overflow.ll
34 ↗(On Diff #213056)

It's probably worth adding a testcase where the i1 result is used as the operand to a branch, to show we correctly fold the xor into the branch. (See also https://bugs.llvm.org/show_bug.cgi?id=42876 .)

nikic added a subscriber: nikic.Aug 2 2019, 2:32 PM

I've committed rL367698 - you should be able to use the update script now

Thanks @RKSimon I'll update this change shortly.

rogfer01 updated this revision to Diff 213359.Aug 5 2019, 7:25 AM

ChangeLog:

  • Update AMDGPU test.
  • Add RISC-V tests to show that we fold the xor in the branch.

The X86 changes LGTM - does anyone else have any comments?

one minor - @efriedma / @lebedev.ri anything to add?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
6494 ↗(On Diff #213359)

Not sure if its necessary, but you could merge the comments and just have:

SDValue ConditionRHS = DAG.getSetCC(dl, OType, RHS, Zero, IsAdd ? ISD::SETLT : ISD::SETGT);

Correct in general case:

----------------------------------------
Name: sadd
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = add i8 %LHS, %RHS
  %ResultLowerThanLHS = icmp slt i8 %v0, %LHS
  %ConditionRHS = icmp slt i8 %RHS, 0
  %v1 = xor i1 %ConditionRHS, %ResultLowerThanLHS

Done: 1
Optimization is correct!

----------------------------------------
Name: ssub
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = sub i8 %LHS, %RHS
  %ResultLowerThanLHS = icmp slt i8 %v0, %LHS
  %ConditionRHS = icmp sgt i8 %RHS, 0
  %v1 = xor i1 %ConditionRHS, %ResultLowerThanLHS

Done: 1
Optimization is correct!

Invalid for undef:

----------------------------------------
Name: sadd
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = add i8 %LHS, %RHS
  %ResultLowerThanLHS = icmp slt i8 %v0, %LHS
  %ConditionRHS = icmp slt i8 %RHS, 0
  %v1 = xor i1 %ConditionRHS, %ResultLowerThanLHS

ERROR: Value mismatch for i1 %v1

Example:
i8 %LHS = #x00 (0)
i8 %RHS = undef
{i8, i1} %t = { #x00 (0), #x0 (0) }     [based on undef value]
i8 %v0 = undef
i1 %ResultLowerThanLHS = undef
i1 %ConditionRHS = undef
Source value: #x0 (0)
Target value: #x1 (1)


----------------------------------------
Name: ssub
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = sub i8 %LHS, %RHS
  %ResultLowerThanLHS = icmp slt i8 %v0, %LHS
  %ConditionRHS = icmp sgt i8 %RHS, 0
  %v1 = xor i1 %ConditionRHS, %ResultLowerThanLHS

ERROR: Value mismatch for i1 %v1

Example:
i8 %LHS = undef
i8 %RHS = #x00 (0)
{i8, i1} %t = { #x00 (0), #x0 (0) }     [based on undef value]
i8 %v0 = undef
i1 %ResultLowerThanLHS = undef
i1 %ConditionRHS = #x0 (0)
Source value: #x0 (0)
Target value: #x1 (1)
lebedev.ri accepted this revision.Aug 8 2019, 1:44 PM

Though the current expansion is just as incorrect for undef:

----------------------------------------
Name: sadd
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = add i8 %LHS, %RHS
  %LHSSign = icmp sge i8 %LHS, 0
  %RHSSign = icmp sge i8 %RHS, 0
  %SignsMatch = icmp eq i1 %LHSSign, %RHSSign
  %SumSign = icmp sge i8 %v0, 0
  %SumSignNE = icmp ne i1 %LHSSign, %SumSign
  %v1 = and i1 %SignsMatch, %SumSignNE

Done: 1
Optimization is correct!

----------------------------------------
Name: ssub
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = sub i8 %LHS, %RHS
  %LHSSign = icmp sge i8 %LHS, 0
  %RHSSign = icmp sge i8 %RHS, 0
  %SignsMatch = icmp ne i1 %LHSSign, %RHSSign
  %SumSign = icmp sge i8 %v0, 0
  %SumSignNE = icmp ne i1 %LHSSign, %SumSign
  %v1 = and i1 %SignsMatch, %SumSignNE

Done: 1
Optimization is correct!
----------------------------------------
Name: sadd
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = sadd_overflow i8 %LHS, %RHS
  %v0 = add i8 %LHS, %RHS
  %LHSSign = icmp sge i8 %LHS, 0
  %RHSSign = icmp sge i8 %RHS, 0
  %SignsMatch = icmp eq i1 %LHSSign, %RHSSign
  %SumSign = icmp sge i8 %v0, 0
  %SumSignNE = icmp ne i1 %LHSSign, %SumSign
  %v1 = and i1 %SignsMatch, %SumSignNE

ERROR: Value mismatch for i1 %v1

Example:
i8 %LHS = #x00 (0)
i8 %RHS = undef
{i8, i1} %t = { #x00 (0), #x0 (0) }     [based on undef value]
i8 %v0 = undef
i1 %LHSSign = #x1 (1)
i1 %RHSSign = undef
i1 %SignsMatch = undef
i1 %SumSign = undef
i1 %SumSignNE = undef
Source value: #x0 (0)
Target value: #x1 (1)


----------------------------------------
Name: ssub
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = extractvalue {i8, i1} %t, 0
  %v1 = extractvalue {i8, i1} %t, 1
=>
  %t = ssub_overflow i8 %LHS, %RHS
  %v0 = sub i8 %LHS, %RHS
  %LHSSign = icmp sge i8 %LHS, 0
  %RHSSign = icmp sge i8 %RHS, 0
  %SignsMatch = icmp ne i1 %LHSSign, %RHSSign
  %SumSign = icmp sge i8 %v0, 0
  %SumSignNE = icmp ne i1 %LHSSign, %SumSign
  %v1 = and i1 %SignsMatch, %SumSignNE

ERROR: Value mismatch for i1 %v1

Example:
i8 %LHS = undef
i8 %RHS = #x00 (0)
{i8, i1} %t = { #x00 (0), #x0 (0) }     [based on undef value]
i8 %v0 = undef
i1 %LHSSign = undef
i1 %RHSSign = #x1 (1)
i1 %SignsMatch = undef
i1 %SumSign = undef
i1 %SumSignNE = undef
Source value: #x0 (0)
Target value: #x1 (1)

So LG i guess..

This revision is now accepted and ready to land.Aug 8 2019, 1:44 PM

Hmm, while there, just to point out the obvious, the another approach here would be to teach DAGCombine/TargetLowering::SimplifySetCC() about these folds.
That is explicitly one of a few valid reasons to add optimizations into backend 'instead' of middle-end.

rogfer01 updated this revision to Diff 214310.Aug 8 2019, 11:39 PM
rogfer01 retitled this revision from [WIP][TargetLowering] Simplify expansion of S{ADD,SUB}O to [TargetLowering] Simplify expansion of S{ADD,SUB}O.

ChangeLog:

  • Combine if-else logic.
  • Adjust subtraction comment to is (non-zero) positive to avoid ambiguity.

Hmm, while there, just to point out the obvious, the another approach here would be to teach DAGCombine/TargetLowering::SimplifySetCC() about these folds.
That is explicitly one of a few valid reasons to add optimizations into backend 'instead' of middle-end.

Sorry, I'm not sure to understand your comment: do you mean, as an alternative, SimplifySetCC can be extended to simplify the original overflow detection?

Hmm, while there, just to point out the obvious, the another approach here would be to teach DAGCombine/TargetLowering::SimplifySetCC() about these folds.
That is explicitly one of a few valid reasons to add optimizations into backend 'instead' of middle-end.

Sorry, I'm not sure to understand your comment: do you mean, as an alternative, SimplifySetCC can be extended to simplify the original overflow detection?

Yes

Yes

Thanks for the clarification Roman.

unless there is a lot of interest from others on landing this, I'll look into your suggestion first.

nikic added a comment.Aug 30 2019, 6:02 AM

IMHO this should land as-is, and setcc folds can be implemented additionally if there are other places where they would be useful. My rationale would be that it is better to directly perform a simpler lowering than a complex lowering that then gets optimized. (Basically: If you can reduce the size of the implementing code and get a better result, then I think we should always be doing that.)

rogfer01 updated this revision to Diff 221468.Mon, Sep 23, 11:14 PM

ChangeLog:

  • Refresh RISC-V and X86 tests

IMHO this should land as-is, and setcc folds can be implemented additionally if there are other places where they would be useful. My rationale would be that it is better to directly perform a simpler lowering than a complex lowering that then gets optimized. (Basically: If you can reduce the size of the implementing code and get a better result, then I think we should always be doing that.)

Hi @nikic, thanks for your comments. It also makes sense to me to do this.

If there aren't any further comments I pland to land this in the next days.

This revision was automatically updated to reflect the committed changes.