This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Attempt shl narrowing in SimplifyDemandedBits
ClosedPublic

Authored by RKSimon on Jul 17 2023, 8:27 AM.

Details

Summary

If a shl node leaves the upper half bits zero / undemanded, then see if we can profitably perform this with a half-width shl and a free trunc/zext.

Followup to D146121

Diff Detail

Event Timeline

RKSimon created this revision.Jul 17 2023, 8:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 8:27 AM
RKSimon requested review of this revision.Jul 17 2023, 8:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 8:27 AM
arsenm added a subscriber: arsenm.Jul 23 2023, 4:34 PM
arsenm added inline comments.
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
4710

I haven't managed to spot where the 64-bit shift that got removed is, but getting rid of them is really good

goldstein.w.n added inline comments.Jul 23 2023, 4:40 PM
llvm/test/CodeGen/X86/2009-05-30-ISelBug.ll
15

This seems like a slight regression.

llvm/test/CodeGen/X86/atomic-rm-bit-test-64.ll
1226

hmm?

llvm/test/CodeGen/X86/divmod128.ll
441

Slight regression here.

llvm/test/CodeGen/X86/lsr-loop-exit-cond.ll
73

Another here. It seems some transform that does shr; AGEN is breaking down a bit.

RKSimon added inline comments.Jul 24 2023, 2:08 AM
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
4710

Yes, a lot of the AMDGPU improvements in this patch and D146121 appear to be from better handling of i64 arithmetic.

llvm/test/CodeGen/X86/atomic-rm-bit-test-64.ll
1226

we end up with zero_extend(truncate(assertzext(x))) in X86DAGToDAGISel which is too late to perform any combines to fold it all away, we'll need a peephole (or a workaround in getNode())

llvm/test/CodeGen/X86/lsr-loop-exit-cond.ll
73

Yes, the problem we have is X86DAGToDAGISel::matchAddressRecursively isn't currently setup to properly see through zext extensions, we just have a few special cases we handle. Ideally the recursion would peek through zext nodes, and we'd hopefully get rid of promoteExtBeforeAdd entirely as well (sext is much less of a problem and easier to handle).

goldstein.w.n added inline comments.Aug 17 2023, 2:27 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
1792

I'd argue the MakeValueIsZero check should be after all the other checks (including the profitability ones) as the recursion can be expensive. Although its probably not a huge deal either way.

RKSimon updated this revision to Diff 551508.Aug 18 2023, 7:36 AM

rebase (still WIP)

RKSimon marked an inline comment as done.Aug 18 2023, 7:36 AM

@pengfei, @Peter told me about how these test cases can potentially be fuzzed to check for equivalence, and I've made a little script to compare generated X86 assemblies of simple function pairs (e.g. those without ptr parameters) by randomizing initial register values, emulating each function pair separately, and comparing final register values. Each function pair is emulated 1000 times using Unicorn Engine, although any difference in register value usually comes up in the first iteration. No function pair I've tested so far has presented a discrepancy in return value (%rax), though some do present a difference in caller-saved GPR contents (%rdi, %rcx, %rsi, %r8). The list of functions that I were able to verify is as follows:

llvm/test/CodeGen/X86/bswap.ll

  • not_bswap
  • not_useful_bswap
  • finally_useful_bswap

llvm/test/CodeGen/X86/cmp-concat.ll

  • cmp_anybits_concat_shl_shl_i16
  • cmp_anybits_concat_shl_shl_i16_commute

llvm/test/CodeGen/X86/combine-bitreverse.ll

  • test_bitreverse_shli_bitreverse_i64

llvm/test/CodeGen/X86/const-shift-of-constmasked.ll

  • test_i64_2147483647_mask_shl_1

llvm/test/CodeGen/X86/dagcombine-shifts.ll

  • fun7
  • fun8
  • fun11
  • fun12

llvm/test/CodeGen/X86/divmod128.ll

  • urem_i128_12 (x86-64)
  • urem_i128_12 (win64)

llvm/test/CodeGen/X86/extract-bits.ll

  • c2_i64

llvm/test/CodeGen/X86/lea-dagdag.ll

  • and_i32_zext_shl_add_i64_overshift

llvm/test/CodeGen/X86/lea-opt2.ll

  • test9

llvm/test/CodeGen/X86/parity.ll

  • parity_64_shift (nopopcnt)

llvm/test/CodeGen/X86/select.ll

  • select_pow2_diff_neg_invert

llvm/test/CodeGen/X86/selectcc-to-shiftand.ll

  • sel_shift_bool_i64

llvm/test/CodeGen/X86/setcc.ll

  • t3

llvm/test/CodeGen/X86/shift-combine.ll

  • test_lshr_and

llvm/test/CodeGen/X86/shift-pair.ll

  • test

llvm/test/CodeGen/X86/zext-shl.ll

  • i64_zext_shift_i16_zext_i8
  • i128_zext_shift_i64_zext_i8
  • i128_zext_shift_i64_zext_i16

@oakrc Nice! By caller-saved GPR contents do you mean intermediate register values? That should be expected as SimplifyDemandedBits can lead to undemanded bits from each register value changing.

oakrc added a comment.Aug 20 2023, 2:39 PM

@oakrc Nice! By caller-saved GPR contents do you mean intermediate register values? That should be expected as SimplifyDemandedBits can lead to undemanded bits from each register value changing.

Yeah, the discrepancy just shows that the script works, and it doesn't say there's anything wrong with the lowering code.

RKSimon updated this revision to Diff 556412.Sep 11 2023, 3:40 AM
RKSimon retitled this revision from [DAG] Attempt shl narrowing in SimplifyDemandedBits (WIP) to [DAG] Attempt shl narrowing in SimplifyDemandedBits.
RKSimon edited the summary of this revision. (Show Details)

Ready for review

kaz7 added a subscriber: kaz7.Sep 11 2023, 3:43 AM

FYI, after following patch, several files are not compilable with x86 clang because of out of memory after 1 hours infinity compilation.

commit a8cef6b58e2d41f04ed4fa63c3f628eac1a28925
Author: Simon Pilgrim <llvm-dev@redking.me.uk>
Date:   Mon Sep 11 10:17:28 2023 +0100

Check https://lab.llvm.org/buildbot/#/builders/91/builds/18672 for more details.

Our buildbot has following ongoing compilations.

$ ps axww|grep clang
 14038 ?        R     30:14 /scratch/buildbot/bothome/clang-ve-ninja/build/build_llvm/./bin/clang++ --target=x86_64-unknown-linux-gnu -DDEBUG_PREFIX="PluginInterface" -DGTEST_HAS_RTTI=0 -DLIBOMPTARGET_JIT_VE -DLIBOMPTARGET_JIT_X86 -DTARGET_NAME="PluginInterface" -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/llvm/include -I/scratch/buildbot/bothome/clang-ve-ninja/build/build_llvm/include -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/include -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/plugins-nextgen/common/elf_common -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/plugins-nextgen/common/MemoryManager -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-enum-constexpr-conversion -Wno-extra -Wno-pedantic -O2 -g -DNDEBUG -fPIC -fvisibility=protected -fno-exceptions -funwind-tables -fno-rtti -std=c++17 -MD -MT openmp/libomptarget/plugins-nextgen/common/PluginInterface/CMakeFiles/PluginInterface.dir/PluginInterface.cpp.o -MF openmp/libomptarget/plugins-nextgen/common/PluginInterface/CMakeFiles/PluginInterface.dir/PluginInterface.cpp.o.d -o openmp/libomptarget/plugins-nextgen/common/PluginInterface/CMakeFiles/PluginInterface.dir/PluginInterface.cpp.o -c /scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
 14177 ?        R     30:12 /scratch/buildbot/bothome/clang-ve-ninja/build/build_llvm/./bin/clang++ --target=x86_64-unknown-linux-gnu -DGTEST_HAS_RTTI=0 -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/llvm/include -I/scratch/buildbot/bothome clang-ve-ninja/build/build_llvm/include -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-enum-constexpr-conversion -Wno-extra -Wno-pedantic -O2 -g -DNDEBUG -fPIC -fno-exceptions -funwind-tables -fno-rtti -std=c++17 -MD -MT openmp/libomptarget/src/CMakeFiles/omptarget.dir/device.cpp.o -MF openmp/libomptarget/src/CMakeFiles/omptarget.dir/device.cpp.o.d -o openmp/libomptarget/src/CMakeFiles/omptarget.dir/device.cpp.o -c /scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/src/device.cpp
 14186 ?        R     26:02 /scratch/buildbot/bothome/clang-ve-ninja/build/build_llvm/./bin/clang++ --target=x86_64-unknown-linux-gnu -DGTEST_HAS_RTTI=0 -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/llvm/include -I/scratch/buildbot/bothome clang-ve-ninja/build/build_llvm/include -I/scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-enum-constexpr-conversion -Wno-extra -Wno-pedantic -O2 -g -DNDEBUG -fPIC -fno-exceptions -funwind-tables -fno-rtti -std=c++17 -MD -MT openmp/libomptarget/src/CMakeFiles/omptarget.dir/rtl.cpp.o -MF openmp/libomptarget/src/CMakeFiles/omptarget.dir/rtl.cpp.o.d -o openmp/libomptarget/src/CMakeFiles/omptarget.dir/rtl.cpp.o -c /scratch/buildbot/bothome/clang-ve-ninja/llvm-project/openmp/libomptarget/src/rtl.cpp

Let me revert rGa8cef6b58e2d for now

foad added a comment.Sep 11 2023, 3:57 AM

AMDGPU changes look fine as far as I can tell, thanks.

kaz7 added a comment.Sep 11 2023, 4:09 AM

Let me revert rGa8cef6b58e2d for now

Thanks. And thank you for your efforts.

pengfei added inline comments.Sep 11 2023, 6:40 AM
llvm/test/CodeGen/X86/lea-opt2.ll
196–197

There can also change to 32-bit instructions, maybe improve in the future.

llvm/test/CodeGen/X86/pr22970.ll
18–19 ↗(On Diff #556412)

Regression?

41–42 ↗(On Diff #556412)

ditto.

llvm/test/CodeGen/X86/pr38217.ll
32–34 ↗(On Diff #556412)

Looks like regression?

llvm/test/CodeGen/X86/shift-combine.ll
106–107

Will addl %esi, %esi better?

llvm/test/CodeGen/X86/vector-shuffle-variable-128.ll
258–280

The change is not easy for manually check, but actually doesn't do any change expect for the register order. It would be better if we can avoid to generate such difference.

RKSimon planned changes to this revision.Sep 11 2023, 9:53 AM

Still a few more regressions to address - I'll be back :)

llvm/test/CodeGen/X86/pr22970.ll
18–19 ↗(On Diff #556412)

Yes, I missed these - it looks like we're losing NSW/NUW flags on the ADD when it gets truncated.

llvm/test/CodeGen/X86/pr38217.ll
32–34 ↗(On Diff #556412)

Similar issue - we lose the NUW flag on the shl(x,1) on truncation and value tracking can't recover later on.

llvm/test/CodeGen/X86/shift-combine.ll
106–107

lshr not shl

llvm/test/CodeGen/X86/vector-shuffle-variable-128.ll
258–280

I'll see if I can isolate the change - I'm not certain if its something to do with LowerBUILD_VECTORAsVariablePermute or something more generic.

RKSimon updated this revision to Diff 556910.Sep 17 2023, 6:28 AM

Add NSW/NUW flags for truncated SHL node when we can

goldstein.w.n added inline comments.Sep 19 2023, 11:23 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
1797

nit: I would make a variable HalfWidth = BitWidth / 2 to avoid duplicating it alot in the condition (and needing parens and such).

1802

So you essentially end up doing recursive computeKnownBits 3 times here. Once in MaskedValueIsZero, once for ComputeNumSignBits and once outright.
I'd say after all the legalization / basic checks it would make more sense to compute knownbits once and do MaskValueIsZero / ComputeNumSignBits by hand with the knownbits.

RKSimon added inline comments.Sep 20 2023, 7:44 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
1802

Yes I can merge the computeKnownBits / MaskedValueIsZero.

I don't understand why you think computeNumSignBits can be merged as well though? It only internally falls back to computeKnownBits sometimes.

goldstein.w.n added inline comments.Sep 20 2023, 9:17 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
1802

Yes I can merge the computeKnownBits / MaskedValueIsZero.

I don't understand why you think computeNumSignBits can be merged as well though? It only internally falls back to computeKnownBits sometimes.

You're right. Although you could probably edit computeNumSignBits to output KnownBits as well (as it seems we either have a constant or check against computeKnownBits).
But yeah feel free to skip for now.

RKSimon updated this revision to Diff 557309.Sep 25 2023, 9:26 AM

rebase - the only remaining codegen diff I'm not happy with is the or-address.ll change - we have different oneuse handling for SHL and ZEXT(SHL) cases which is going to take a bit longer to cleanup

RKSimon updated this revision to Diff 557484.Sep 29 2023, 8:09 AM

Fix remaining or-address.ll regression

Can the changes to X86ISelDAGToDAG.cpp and X86ISelLowering.cpp that are there to handle the regressions be split to a prior patch?
They look properly standalone.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
1799

Nit: Took me a bit to find this. Would think it would be more natural to put just ahead of NewShift (where it's used).

1803

Hmm, so with the optimization to do the knownbits methods by hand, the code is definetly harder to follow. Can you add a few comments explaining the checks taking place?

RKSimon updated this revision to Diff 557522.Oct 1 2023, 10:22 AM

address feedback and rebase - I'm still investigating how to handle more of the X86-specific changes in pre-commits

address feedback and rebase - I'm still investigating how to handle more of the X86-specific changes in pre-commits

Do they cause regressions if committed independently?

The actual shift truncation code looks good. Would prefer if the other changes where an independent commit, but if thats not feasible that shouldn't be a blocker.

address feedback and rebase - I'm still investigating how to handle more of the X86-specific changes in pre-commits

Do they cause regressions if committed independently?

They currently cause no test changes against trunk - I need to create a few tests that I'm happy with and that should cover most of the X86 changes so I will be able to pre-commit those. The DAGCombiner.cpp change might have to stay though.

RKSimon updated this revision to Diff 557534.Oct 2 2023, 8:10 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

RKSimon updated this revision to Diff 557564.Oct 3 2023, 9:34 AM
RKSimon edited the summary of this revision. (Show Details)

rebase - any more comments?

This revision is now accepted and ready to land.Oct 3 2023, 12:12 PM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.Oct 4 2023, 5:39 AM
nikic added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
1799

You shouldn't be directly calling computeKnownBits in SimplifyDemandedBits. Your new code needs to be moved after the SimplifyDemandedBits call below and use the Known it already computes for Op0.

RKSimon added inline comments.Oct 4 2023, 5:40 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
1799

I'll look into it - there are a number of cases where we don't do this though.

this is causing check-clang failures in a x86-64 stage 2 build, e.g. clang/test/Parser/cxx2b-lambdas.cpp. it showed up on a bot and I managed to repro locally, so it's likely reproducible

a lot of the crashes seem to be on this line, perhaps that code has UB? could we revert this while investigating?

yubing added a subscriber: yubing.Thu, Nov 23, 1:55 AM

there is a case that has peformance regression
ISEL trying to select load into VMOVSDZrm_alt by calculating base, index, scale, disp.

before this patch, it can easily find t116 as index:
t95: i64,ch = load<(load (s16) from %ir.5, !tbaa !33), zext from i16> t0, t6, undef:i64
t116: i64 = X86ISD::MUL_IMM t95, Constant:i64<5> index = t116
t117: i64 = shl t116, Constant:i8<3>
scale = 8
t54: i64 = add t16, t117 base = t16
t55: i64 = add nuw t54, Constant:i64<16>
disp=16 t56: f64,ch = load<(load (s64) from %ir.34, !tbaa !37)> t0, t55, undef:i64

>

t56: f64,ch = VMOVSDZrm_alt<Mem:(load (s64) from %ir.34, !tbaa !37)> t16, TargetConstant:i8<8>, t116, TargetConstant:i32<16>, Register:i16 $noreg, t0

after this patch, ISEL can only find t127 as index instead of t125 since ISEL is not sure the high 32bits of t125 is zero. Thus, extra instruction are created.

t95: i64,ch = load<(load (s16) from %ir.5, !tbaa !33), zext from i16> t0, t6, undef:i64
t125: i64 = X86ISD::MUL_IMM t95, Constant:i64<5>
t127: i32 = truncate t125 index= " t140: i64 = zero_extend t127 is created"
t128: i32 = shl nuw nsw t127, Constant:i8<3>
scale=8
t129: i64 = zero_extend t128
t54: i64 = add t16, t129 base=t16 t55: i64 = add nuw t54, Constant:i64<16> disp=16
t56: f64,ch = load<(load (s64) from %ir.34, !tbaa !37)> t0, t55, undef:i64

>

t56: f64,ch = VMOVSDZrm_alt<Mem:(load (s64) from %ir.34, !tbaa !37)> t16, TargetConstant:i8<8>, t140, TargetConstant:i32<16>, Register:i16 $noreg, t0

i guess writing td's pattern is not quite flexible and it cannot use SImplifyDemandedBits. @RKSimon do you have any idea to handle such case?

@yubing Try stepping through X86DAGToDAGISel::matchAddressRecursively - its likely we're missing a case now that a zext has been inserted someplace.

llvm/test/CodeGen/AMDGPU/vgpr-liverange-ir.ll