This is an archive of the discontinued LLVM Phabricator instance.

[DAG]Introduce llvm::processShuffleMasks and use it for shuffles in DAG Type Legalizer.
ClosedPublic

Authored by ABataev on Dec 13 2021, 10:42 AM.

Details

Summary

We can process the long shuffles (working across several actual
vector registers) in the best way if we take the actual register
represantion into account. We can build more correct representation of
register shuffles, improve number of recognised buildvector sequences.
Also, same function can be used to improve the cost model for the
shuffles. in future patches.

Part of D100486

Diff Detail

Event Timeline

ABataev created this revision.Dec 13 2021, 10:42 AM
ABataev requested review of this revision.Dec 13 2021, 10:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2021, 10:42 AM
craig.topper added inline comments.Dec 16 2021, 4:18 PM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2171

If you change the case here. Use DL. As far as I know SelectionDAG uses DL or dl.

2205–2206

InputUsed no longer exists

2464

Is this only used for DAG?

2471

Can this use SDValue::operator bool() to see if it has already been assigned instead of using a separate flag.

llvm/lib/Target/ARM/ARMISelLowering.cpp
394 ↗(On Diff #394987)

Does ARM already have code that just does the right thing for these types?

ABataev added inline comments.Dec 17 2021, 6:24 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2171

Ok, will rename it to DL

2205–2206

Will be fixed.

2464

Yes

2471

I'll check if it can be improved

llvm/lib/Target/ARM/ARMISelLowering.cpp
394 ↗(On Diff #394987)

Actually, need advice here. Without this change, there is an infinite loop for ARM tests for v2f64 type. VECTOR_SHUFFLE is assigned Expand action, so it is expanded, then it is transformed back to vector (setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v2f64, Legal);), then again expanded, etc. I assume that it is legal for v2f64 type if ISD::SCALAR_TO_VECTOR is legal. But better to have some input here from ARM folks.

ABataev updated this revision to Diff 395121.Dec 17 2021, 6:49 AM

Address comments

Adding @dmgreen who might be able to advise on the MVE codegen regressions and ARM infinite loops.

Sorry for the delay - I hadn't realised this was something I should look at. I can make the change blocking you here. Some of the other results look larger, but I'm not sure I understand how the patch is changing things enough to know why. The Thumb2 tests that change are not super important in themselves (other than they don't crash!). They are not expected to come up commonly in practice, but it might show something unexpected is happening. Either with this patch or the way Arm tries to lower buildvectors/shuffles.

llvm/lib/Analysis/VectorUtils.cpp
529

This FirstReg doesn't appear to be used anywhere.

llvm/lib/Target/ARM/ARMISelLowering.cpp
394 ↗(On Diff #394987)

The change looks like it will help, especially for i64 types that can now go via ARMISD::BUILD_VECTOR. We should make this change separately though. I'll see if I can do that today.

ABataev updated this revision to Diff 406431.Feb 7 2022, 6:34 AM

Rebase + address comments

I've read this patch a few times now but don't think I understand it enough to give great feedback I'm afraid, other than can we try and explain the details in the "2 or more" case a little more? This constructs a triple nested vector, most of which is expected to be empty?

I like the idea of sharing processShuffleMasks between the codegen and costmodel. That sounds like a nice way of doing things.

llvm/lib/Analysis/VectorUtils.cpp
531

I don't think splitted is a word. I think it would just be split, if my english isn't letting me down.

550–552

I'm not sure I understand this code very well, and the comment isn't very clear to me. Can you try and expand the comments a bit? And maybe giving It a better name, that kind of thing.

ABataev updated this revision to Diff 410401.Feb 21 2022, 3:14 PM

Address comments

Thanks for the update.

It looks like the Arm code is just getting stuck in various inefficient ARMISD::BUILD_VECTOR missing folds. It was previously going through the useBuildVector=true path so creating BUILDVECTOR earlier. Considering what they are testing, that seems fine.

I have no objections to this, if the X86 folks are OK with this.

@ABataev Please can you rebase? I'm not sure if my recent shuffle combine changes will have affected this

@ABataev Please can you rebase? I'm not sure if my recent shuffle combine changes will have affected this

Sure, will do later today

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 8:57 AM

I had never looked at the existing code, and I find that confusing already, so some high-level questions:

  1. For legalization, it seems like the code was written with the potential to split by more than a factor of 2, but that can't happen with the current interface (hard-coded to output Lo and Hi halves). Is the new code intended to allow splitting by >2?
  2. In SDAG, we always have equal size input/output vector for shuffle node. Is the new code planned to be used with the IR version of shuffles (where input vectors may have different size than output)?
  3. IIUC, the interesting case is when we need to shuffle together from >2 inputs. Would it make sense to handle the easy cases (0,1,2 inputs) in a separate function?

For example for #3, I think we could handle the simplest (NoInput) pattern before the main loop with something like:

unsigned MaskSize = N->getMask().size();
ArrayRef<int> LoMask = N->getMask().drop_back(NewElts);
if (all_of(LoMask, [MaskSize](int MaskElt) { return (unsigned)MaskElt >= MaskSize * 2; })) {
  Lo = DAG.getUNDEF(NewVT);
}
ArrayRef<int> HiMask = N->getMask().drop_front(NewElts);
if (all_of(HiMask, [MaskSize](int MaskElt) { return (unsigned)MaskElt >= MaskSize * 2; })) {
  Hi = DAG.getUNDEF(NewVT);
}

I had never looked at the existing code, and I find that confusing already, so some high-level questions:

  1. For legalization, it seems like the code was written with the potential to split by more than a factor of 2, but that can't happen with the current interface (hard-coded to output Lo and Hi halves). Is the new code intended to allow splitting by >2?

Yes, that was the idea.

  1. In SDAG, we always have equal size input/output vector for shuffle node. Is the new code planned to be used with the IR version of shuffles (where input vectors may have different size than output)?

Yes.

  1. IIUC, the interesting case is when we need to shuffle together from >2 inputs. Would it make sense to handle the easy cases (0,1,2 inputs) in a separate function?

For example for #3, I think we could handle the simplest (NoInput) pattern before the main loop with something like:

unsigned MaskSize = N->getMask().size();
ArrayRef<int> LoMask = N->getMask().drop_back(NewElts);
if (all_of(LoMask, [MaskSize](int MaskElt) { return (unsigned)MaskElt >= MaskSize * 2; })) {
  Lo = DAG.getUNDEF(NewVT);
}
ArrayRef<int> HiMask = N->getMask().drop_front(NewElts);
if (all_of(HiMask, [MaskSize](int MaskElt) { return (unsigned)MaskElt >= MaskSize * 2; })) {
  Hi = DAG.getUNDEF(NewVT);
}

If I understood your question correctly, NoInputAction, SingleInputAction and ManyInputsAction params are exactly for this.

If I understood your question correctly, NoInputAction, SingleInputAction and ManyInputsAction params are exactly for this.

I'm wondering when the NoInput or SingleInput behavior would be something other than return of undef or a unary shuffle. If we never diverge from that, then we don't need to complicate the interface with those parameters?

If I understood your question correctly, NoInputAction, SingleInputAction and ManyInputsAction params are exactly for this.

I'm wondering when the NoInput or SingleInput behavior would be something other than return of undef or a unary shuffle. If we never diverge from that, then we don't need to complicate the interface with those parameters?

The idea is to unify interface not only here, but also use the same function for the cost of shuffles. Generally speaking, I can merge these actions into single one, but just would like to simplify the functionality of the actions for the user.

@lebedev.ri Some of the x86 interleaved codegen sees some regressions from this - are you at all familiar with any of it?

@lebedev.ri Some of the x86 interleaved codegen sees some regressions from this

Yep, so i've seen. This is why we *desperately* *desperately* need 'live' costmodel that would be automatically produced from current codegen.

are you at all familiar with any of it?

Not particularly. I can only state the obvious: we are probably missing a lot of combines to recover from whatever the changes were to the legalization.

@lebedev.ri Some of the x86 interleaved codegen sees some regressions from this - are you at all familiar with any of it?

Will try to investigate it later this week

What about if we split the addition of VectorUtils processShuffleMasks off from the DAG legalization code - that way we can get D100486 moving again and address the DAG regressions separately?

What about if we split the addition of VectorUtils processShuffleMasks off from the DAG legalization code - that way we can get D100486 moving again and address the DAG regressions separately?

Anyway it is good to investigate regressions before that, may help improve final code. I'll try to make it ASAP. Better doublecheck rather than doing reverts again and again.

Some topics that may need to be considered for cost modelling purposes:
What about X86 AVX512 2-input shuffles?
What about the available ISA set? (=vector size, shuffle granularity (element size))
What about other legalization problems? (AVX2 pshub can't pick from another XMM half of YMM input.)

ABataev updated this revision to Diff 416610.Mar 18 2022, 2:32 PM

Improve shuffling

ABataev updated this revision to Diff 417431.Mar 22 2022, 4:26 PM

Some extra improvements

Thanks @ABataev this looks like its almost there - please can you do a run of the test suite to see if there are any noteworthy changes?

Thanks @ABataev this looks like its almost there - please can you do a run of the test suite to see if there are any noteworthy changes?

Sure, will try to do it ASAP.

Thanks @ABataev this looks like its almost there - please can you do a run of the test suite to see if there are any noteworthy changes?

Did some changes in the analysis. Still need to improve the split process and instead bisecting implement immediate splitting to the register sizes. This should fix the regressions.
Some results. Did not gather perf changes, the system is too busy, instead code size changes.

march=native(Skylake), -O3+LTO

Metric: size..text

Program                                                                                                           results  results0 diff
                                              test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test    45610    45866  0.6%
                                                       test-suite :: SingleSource/Benchmarks/SmallPT/smallpt.test     5907     5939  0.5%
                                          test-suite :: SingleSource/UnitTests/Vector/SSE/Vector-sse.stepfft.test     8166     8198  0.4%
                                    test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   201680   202256  0.3%
                                                 test-suite :: MultiSource/Benchmarks/McCat/12-IOtest/iotest.test     9155     9171  0.2%
                                                 test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   441951   442523  0.1%
                                                   test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   589569   590305  0.1%
                                                      test-suite :: MultiSource/Applications/ClamAV/clamscan.test   585848   586296  0.1%
                                                      test-suite :: SingleSource/UnitTests/matrix-types-spec.test   220022   220118  0.0%
                                                    test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1122554  1123018  0.0%
                                      test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    41667    41683  0.0%
                                             test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    41674    41690  0.0%
                                                     test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   405280   405360  0.0%
                                                     test-suite :: MultiSource/Applications/JM/lencod/lencod.test   901096   901256  0.0%
                                                        test-suite :: MultiSource/Applications/oggenc/oggenc.test   192635   192667  0.0%
                                           test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2116993  2117265  0.0%
                                                          test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   317150   317182  0.0%
                                                      test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   170427   170443  0.0%
                                            test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   236760   236776  0.0%
                                            test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1141289  1141353  0.0%
                                           test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1165233  1165297  0.0%
                                   test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   301095   301111  0.0%
                                        test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1422763  1422811  0.0%
                                         test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1422763  1422811  0.0%
                                   test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2129557  2129605  0.0%
                                    test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2129557  2129605  0.0%
                                         test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12493380 12493572  0.0%

Regressions:

 test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   611440   611408 -0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   611440   611408 -0.0%

Some extra loads of the patterns, I do not expect significant perf regressions here.

                       test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   100502   100486 -0.0%
                test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test   107636   107604 -0.0%
                test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test   107156   107124 -0.0%
          test-suite :: MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl.test   103066   103034 -0.0%
      test-suite :: MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl.test   102066   102034 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.test   101516   101484 -0.0%
              test-suite :: MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl.test   100904   100872 -0.0%
            test-suite :: MultiSource/Benchmarks/TSVC/Equivalencing-dbl/Equivalencing-dbl.test   100613   100581 -0.0%
  test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test   100609   100577 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl.test   100480   100448 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test   100352   100320 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.test   100344   100312 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test   100212   100180 -0.0%
            test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.test    98220    98188 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl.test    98136    98104 -0.0%
  test-suite :: MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.test    98136    98104 -0.0%
                        test-suite :: MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.test    97784    97752 -0.0%
            test-suite :: MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl.test    97448    97416 -0.0%
      test-suite :: MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt.test    96918    96886 -0.0%
                test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test    96700    96668 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test    96428    96396 -0.0%
  test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test    96229    96197 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test    95896    95864 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test    94844    94812 -0.0%
              test-suite :: MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt.test    94396    94364 -0.0%
            test-suite :: MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt.test    94169    94137 -0.0%
            test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test    91872    91840 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt.test    91692    91660 -0.0%
  test-suite :: MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.test    91564    91532 -0.0%
                        test-suite :: MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test    91324    91292 -0.0%
                test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test    90208    90176 -0.0%
test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test    90048    90016 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test    89390    89358 -0.0%
                                       test-suite :: MultiSource/Benchmarks/nbench/nbench.test    75643    75611 -0.0%
                  test-suite :: MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl.test    99871    99823 -0.0%
                  test-suite :: MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt.test    93429    93381 -0.1%
            test-suite :: MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt.test    91676    91612 -0.1%

TSVC - inner loop body has less instructions, the preheader is larger because of movement of some patterns loading.
nbench - no significant changes
jpeg/jpeg-6a - no significant changes

Generic, -O3+LTO

Metric: size..text
                                          test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test    88982    89094  0.1%
                                         test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1332828  1333276  0.0%
                                        test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1332828  1333276  0.0%
                                                   test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   502161   502177  0.0%
                                         test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12015748 12015828  0.0%
                                                       test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   154226   154226  0.0%

The perf changes are about 1-2% perf improvements (433.milc has ~10% but I would not rely on it).

Regressions:

                      test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   348030   347966 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   102840   102632 -0.2%
      test-suite :: SingleSource/UnitTests/Vector/SSE/Vector-sse.stepfft.test     3361     3345 -0.5%

Bullet - 3 extra instructions in the loop, need to improve final lowering but should not affect performance too much (2 unpkl are replaced by 4 shufps and 1 movq)
consumer-jpeg - no significant changes actually, just changed the order of shuffles after combining extractvector instructions.
Vector-sse.stepfft - 2 unpck replaced by 4 shufps, no perf changes expected

The updated patch will be uploaded later today.

ABataev updated this revision to Diff 420632.Apr 5 2022, 2:19 PM

Rebase + improvements

I think this is ready now - I've made a note of the changes in pr44976.ll and splat-for-size.ll and I'm intending to address them in x86 shuffle combining.

Any more comments?

I'm not sure I understand all the ramifications of this, but the Arm/AArch64 tests look OK to me (or I have plans to do something to improve them).

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2213

SDValue can usually be passed by value just fine.

ABataev updated this revision to Diff 421271.Apr 7 2022, 10:27 AM

Pass SDValue by value.

ABataev marked an inline comment as done.Apr 7 2022, 10:27 AM
RKSimon accepted this revision.Apr 8 2022, 2:10 AM

LGTM

This revision is now accepted and ready to land.Apr 8 2022, 2:10 AM

Sure, will check/update all the tests before pushing the patch, thanks. Hope to do it this week (travelling).

This revision was landed with ongoing or failed builds.Apr 20 2022, 5:50 AM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-4.ll