This is an archive of the discontinued LLVM Phabricator instance.

Support {S,U}REMEqFold before legalization
ClosedPublic

Authored by nagisa on Oct 3 2020, 11:33 AM.

Details

Summary

This allows these optimisations to apply to e.g. urem i16 directly
before urem is promoted to i32 on architectures where i16 operations
are not intrinsically legal (such as on Aarch64). The legalization then
later can happen more directly and generated code gets a chance to avoid
wasting time on computing results in types wider than necessary, in the end.

Seems like mostly an improvement in terms of results at least as far as x86_64 and aarch64 are concerned, with a few regressions here and there. It also helps in preventing regressions in changes like D87976: Support the division-by-constant strength reduction for more integer types.

Diff Detail

Event Timeline

nagisa created this revision.Oct 3 2020, 11:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2020, 11:33 AM
nagisa requested review of this revision.Oct 3 2020, 11:33 AM
nagisa edited the summary of this revision. (Show Details)Oct 3 2020, 11:37 AM
nagisa updated this revision to Diff 295999.Oct 3 2020, 12:49 PM

Dont attempt to force legalization of non-trivial ops

(Mostly VSELECT)

nagisa added inline comments.Oct 3 2020, 1:09 PM
llvm/test/CodeGen/AArch64/urem-seteq-vec-tautological.ll
77 ↗(On Diff #295999)

This is a regression.

Originally urem-seteq fold did not fire because isOperationLegalOrCustom(ISD::XOR, SETCCVT) was returning false and thus we never actually executed the lowering this is supposed to be testing in the first place.

Instead DAG would move on to lower UREM via BuildDIV (i.e. mul-shift strength reduction) – which, turns out, produces significantly better code in this particular instance.

I suspect that most of the regressions seen here will be due to a similar cause as this one.

nagisa updated this revision to Diff 296002.Oct 3 2020, 1:11 PM

Don't force lowering of weird xors for aarch64

nagisa edited the summary of this revision. (Show Details)Oct 3 2020, 1:19 PM
nagisa added a comment.Oct 3 2020, 6:03 PM

The build error appears to be unrelated to the changes here – other differentials fail in much the same way.

Is there anything I could do to make this proceed?

nagisa edited the summary of this revision. (Show Details)Oct 25 2020, 8:36 AM

This seems like an improvement overall, however there seem to be some regressions, at least for aarch64.
Can we live with them?

nagisa added a comment.EditedMar 9 2021, 11:30 AM

In cases where we do end up with a instruction count regression, the reason appears to be because lowering would fall back to BuildUDIV (and the MULHU/SHIFT strength reduction it implements). The BuildUDIV reduction is slightly shorter in instruction count, but it also depends on the target's ability to cheaply compute the higher half of the multiplication result.

I think the regressions here can be mitigated by checking if we have a cheap MULHU for the type and letting the codegen to fall-through to lowering via BuildUDIV. I will try to fiddle with ideas here, if the regression turns out to be blocking this change. But ideally, if possible, I'd like the experiments in that direction be independent of this diff.

(EDIT: turns out the strategy I was thinking of does not really work)

RKSimon added inline comments.Mar 9 2021, 2:24 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
5310

Doesn't this mean we're opening this to work on illegal types? At the very least we'd need decent test coverage for that case.

nagisa added inline comments.Mar 9 2021, 2:40 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
5310

The idea here was that its better to apply this fold, regardless of whether the simple types/operations are legal or not, and then let the type/operation legalization take care of legalizing the strength reduced graph, instead.

This to me makes a lot of sense, because as far as I can tell, making the ISD::MUL, ISD::SUB, ISD::ADD & ISD::ROTR operations legal ought to be significantly easier (and result in simpler code) than making a ISD::{U,S}REM legal. Furthermore, if these simpler operations are not legal, it is most likely the case that the ISD::{U,S}REM itself won't be either and legalization of some sort will have to occur regardless.

I see what you're saying about the test coverage – the current test suite only really covers Aarch64 and x86. I will look into porting these test cases to more of the other targets.

Thank you for the review.

nikic added inline comments.Mar 9 2021, 2:50 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
5310

I think @RKSimon's point about the test coverage is less about other targets, and more about the kind of types you test. From a cursory look, the current test changes in this patch are for legal vector types with illegal operations. However, this patch will also make the transform apply to i57 and <3 x i17>, which are untested.

nagisa updated this revision to Diff 330236.Mar 12 2021, 7:08 AM
nagisa marked 2 inline comments as done.

Rebase onto newly added tests for illegal types

I went through the list of what looked like regressions and attempted to root-cause and validate them. So far there seem to be a few major differences between old/new:

  1. In some cases the old code would lower to a shorter sequence of instructions via BuildUDIV and BuildSDIV strength reduction lowering. While the code is shorter in these cases, llvm-mca seems to suggest that the throughput of the rem-seteq lowering is better despite the fact that rem-seteq strength reduction produces more instructions overall;
  2. For the newly added illegal-types tests, a number of regressions are occurring due to and masking necessary to clear out the upper bits in registers between operations that rem-seteq lowering produces. This is a genuine regression and generally results in us producing 1 or 2 extra and masking operations;
  3. The new lowering needs more constants to operate, which on some backends requires more instructions to prepare the registers holding said constants (especially on targets such as MIPS which needs multiple instructions to ready one such register).

See the inline comments to see what the specific changes look like. Hopefully these address your concern about codegen regressions in AArch64, @lebedev.ri.

llvm/test/CodeGen/AArch64/srem-seteq-vec-nonsplat.ll
235

This (and the one above) instruction count regression looks like a combination of the "more constants" and "previously used lowering through BuildSDIV". In particular we have here what seems like 6 additional instructions for moving constants into registers.

llvm-mca claims a better throughput (lower cycle count) for the new version on all CPUs I tried: apple-a7, cortex-a53, exynos-m3, cortex-x1.

llvm/test/CodeGen/AArch64/srem-seteq-vec-splat.ll
45

Looks like a combination of "needs more constants" and "previously used lowering through BuildSDIV".

llvm/test/CodeGen/AArch64/urem-seteq-vec-splat.ll
39

This instruction count regression appears to be occurring because codegen used to fall-back to the strength reduction implemented by BuildUDIV, and then compare to 0. The rem-seteq lowering needs additional constants.

It is however, an improvement in cycle count as reported by llvm-mca (on apple-a10), primarily due to increased IPC:

CMD: llvm-mca -mtriple=aarch64 -mcpu=apple-a10

OLD:
Iterations:        100
Instructions:      1300
Total Cycles:      1712
Total uOps:        1400
Dispatch Width:    6
uOps Per Cycle:    0.82
IPC:               0.76
Block RThroughput: 3.3

NEW:
Iterations:        100
Instructions:      1400
Total Cycles:      1312
Total uOps:        1600
Dispatch Width:    6
uOps Per Cycle:    1.22
IPC:               1.07
Block RThroughput: 3.0
llvm/test/CodeGen/ARM/urem-seteq-illegal-types.ll
388 ↗(On Diff #330236)

I have no idea why the backend decided to construct the constants through code here, rather than loading like for the other targets.

This and similar changes below also make it quite difficult to compare which of the old or new code is better.

770 ↗(On Diff #330236)

Was calling __umoddi3 before, multiplying inline now.

llvm/test/CodeGen/Mips/urem-seteq-illegal-types.ll
135 ↗(On Diff #330236)

As far as I can tell this is instruction count regression is coming from a combination of the new lowering needing more constants (which are put into registers) as well as the fact that bits need to be masked out in between the intermediate operations that we lowered down to.

217 ↗(On Diff #330236)

This would just call i128 __umodti3 intrinsic before, and now is unrolling the multiplication over 66-bit integers inline.

llvm/test/CodeGen/PowerPC/urem-seteq-illegal-types.ll
273 ↗(On Diff #330236)

Used to call __umodti3, multiplying inline now.

llvm/test/CodeGen/Thumb2/urem-seteq-illegal-types.ll
11 ↗(On Diff #330236)

This and next regression due to bit masking between intermediate operations.

llvm/test/CodeGen/X86/urem-seteq-illegal-types.ll
54 ↗(On Diff #330236)

A regression that hasn't been present before rem-seteq fold allowed illegal types through, caused mostly by the backend being forced to legalize the i27 at each intermediate step.

I believe this regression we can live with, given its nature and how unlikely it is to occur in practice.

@nagisa I haven't explicitly checked, but I'm hoping that D98857 might help with some of the regressions you've seeing here.

nagisa added a comment.EditedMar 18 2021, 3:37 PM

@RKSimon I cherry-picked your patch on top of mine and there were no changes in test output (as reported by check-llvm), sadly. If anything your differential makes improvements introduced by this differential slightly less impressive.

I think that makes sense, though, because {S,U}REM-SetEq folds down to a regular multiplication, and not MULHU/MULHS.

Please can you rebase against trunk now that D98857 has landed?

The x86 changes look good, I think we can live with the illegal type regression.

lebedev.ri added inline comments.Mar 29 2021, 3:25 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
5309
5485

What about this one?

5495

What about this one?

5730–5733

What about these?

nagisa updated this revision to Diff 333824.Mar 29 2021, 5:12 AM
nagisa marked an inline comment as done.

Address review comments, rebase

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
5485

Letting these through has resulted in some very mixed results, as seen in this comparison between older versions of this differential.

I have decided to back these out to improve chances of this differential landing without major hurdles. I may consider looking into these operations again in a followup differential.

RKSimon added inline comments.Mar 31 2021, 4:16 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
5485

Please can you add TODO/NOTE comments to each case explaining this.

lebedev.ri accepted this revision.Mar 31 2021, 4:34 AM

LGTM once TODO/FIXME comments are added.

This revision is now accepted and ready to land.Mar 31 2021, 4:34 AM
craig.topper added inline comments.Mar 31 2021, 9:58 AM
llvm/test/CodeGen/RISCV/rvv/vmulh-sdnode-rv32.ll
5 ↗(On Diff #333824)

Is this comment wrong now? Certainly seems like something is kicking in now.

16 ↗(On Diff #333824)

Something silly happened here. The vrsub.vi+vmv.vi appears to be calculating -1 without constant folding. Probably DAG combine not constant folding scalable vectors?

nagisa added inline comments.Mar 31 2021, 12:45 PM
llvm/test/CodeGen/RISCV/rvv/vmulh-sdnode-rv32.ll
5 ↗(On Diff #333824)

Yeah, looks like you're right. I think only the last sentence went wrong, though. The test is still testing that prepareSREMEqFold is not crashing… I think.

16 ↗(On Diff #333824)

Should I look into this? Seems somewhat out of scope for this diff. Should I file an issue instead?

craig.topper added inline comments.Mar 31 2021, 12:46 PM
llvm/test/CodeGen/RISCV/rvv/vmulh-sdnode-rv32.ll
16 ↗(On Diff #333824)

File an issue and assign it to me.

nagisa updated this revision to Diff 334521.Mar 31 2021, 1:38 PM

Address review commends, rebase

nagisa marked 5 inline comments as done.Mar 31 2021, 1:40 PM
craig.topper added inline comments.Mar 31 2021, 1:49 PM
llvm/test/CodeGen/RISCV/rvv/vmulh-sdnode-rv32.ll
16 ↗(On Diff #333824)

I think https://reviews.llvm.org/D99682 will fix this.

llvm/test/CodeGen/RISCV/urem-seteq-illegal-types.ll
577 ↗(On Diff #334521)

I think this test needs to be regenerated. Some of the vset(i)vli instructions should be optimized out.

This revision was landed with ongoing or failed builds.Mar 31 2021, 3:35 PM
This revision was automatically updated to reflect the committed changes.
nagisa marked an inline comment as done.
nagisa added inline comments.Mar 31 2021, 3:35 PM
llvm/test/CodeGen/RISCV/urem-seteq-illegal-types.ll
577 ↗(On Diff #334521)

Yeah, I will take care to ensure the tests pass right before I land. Thanks for looking out!

llvm/test/CodeGen/X86/omit-urem-of-power-of-two-or-zero-when-comparing-with-zero.ll