This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Treat SelectInsts as reduction values.
ClosedPublic

Authored by chatur01 on Oct 21 2015, 10:12 AM.

Details

Summary

Certain workloads, in particular sum-of-absdiff loops, can be vectorized using SLP if it can treat select instructions as reduction values.

The test case is a bit awkward. The AArch64 cost model needs some tuning to not be so pessimistic about selects. I've had to tweak the SLP threshold here.

Diff Detail

Repository
rL LLVM

Event Timeline

chatur01 retitled this revision from to [SLP] Treat SelectInsts as reduction values..
chatur01 updated this object.
chatur01 set the repository for this revision to rL LLVM.
chatur01 added a subscriber: llvm-commits.

The patch looks okay. The problem of adding new instructions to the list of ‘reductions’ is that it increases the compile time of the SLP-vectorizer. Do you know how effective is this pattern? Do you know how many times it hits when you compile the llvm test suite?

Do you know how effective is this pattern? Do you know how many times it hits when you compile the llvm test suite?

My testing has been on ARM and AArch64 targets, this patch has no effect on performance yet for the benchmarks I have access to. To vectorize the more general pattern, there are few more changes I need to propose, including searching for reductions in multi-block loops, and trying different reduction widths. I was going to go through them one at a time.

With the other changes, the biggest improvements are in MPEG encoders in third party benchmarks. I haven't yet measured the compile time impact of this particular change, I will get some numbers for this. Thanks for your review!

--Charlie.

Hi Nadav, I've anaysed my patches in more detail now. Sorry for how long it took me to do it!

This patch doesn't affect compile-time performance in any significant way.

My testing methodology for LNT was to run the following with and without my changes, pegged on just one A57 CPU:

$ taskset -c 5 ./bin/lnt runtest nt --sandbox $SANDBOX --cc=$COMPILER --cxx=$COMPILER '--cflag=-mcpu=cortex-a57 -mllvm -slp-vectorize-hor -mllvm -slp-vectorize-hor-store -Wl,--allow-multiple-definition' --test-suite $TEST_SUITE_DIR --no-timestamp --make-param=ARCH=AArch64 --make-param=ENDIAN=little --make-param=RUNTIMELIMIT=7200 '--make-param=RUNUNDER=taskset -c 5' --multisample=1 --threads 1 --build-threads 1 --benchmarking-only --use-perf 1

I then collected all the compile time results from the two runs and compared them. The differences were all within noise.

The methodology for SPEC{2000,2006} was to use their runspec drivers, and to time each benchmark's build "action" from a clean installation with and without my patch. There were also no significant differences in these benchmarks.

There are no significant differences in performance with this patch across SPEC{2000,2006} and LNT.

nadav edited edge metadata.Oct 26 2015, 9:18 AM
nadav added a subscriber: nadav.

Okay. Please go ahead and commit this patch. I do have one minor comment:

BinaryOperator *Next = dyn_cast<BinaryOperator>(NextV);
if (Next)
  Stack.push_back(std::make_pair(Next, 0));

+ else if (SelectInst *SelI = dyn_cast<SelectInst>(NextV))
+ Stack.push_back(std::make_pair(SelI, 0));

  else if (NextV != Phi)
    return false;
}

You can rewrite this code using ‘isa<BinaryOperator>(I) || isa<SelectInst>(I)’ and use a single push_back instruction.

It is also a good idea to add some comments to this code.

-Nadav

chatur01 edited edge metadata.

Address review comments.

I have another change to pass by review before I start landing these changes.
It's about choosing reduction widths. I'll put it up for review tomorrow. It
might be the wrong approach, so I don't want land half-finished chains of
thought :-)

Thanks a lot for your time!

--Charlie.

This revision was automatically updated to reflect the committed changes.