This is an archive of the discontinued LLVM Phabricator instance.

Detecte vector reduction operations just before instruction selection.
ClosedPublic

Authored by congh on Dec 4 2015, 3:43 PM.

Details

Summary

This patch detects vector reductions before instruction selection. Vector reductions are vectorized reduction operations, and for such operations we have freedom to reorganize the elements of the result as long as the reduction of them stay unchanged. This will enable some reduction pattern recognition during instruction combine such as SAD/dot-product on X86. A flag is added to SDNodeFlags to mark those vector reduction nodes to be checked during instruction combine.

To detect those vector reductions, we search def-use chains starting from the given instruction, and check if all uses fall into two categories:

  1. Reduction with another vector.
  2. Reduction on all elements.

in which 2 is detected by recognizing the pattern that the loop vectorizer generates to reduce all elements in the vector outside of the loop, which includes several ShuffleVector and one ExtractElement instructions.

Please checkout http://lists.llvm.org/pipermail/llvm-dev/2015-November/092379.html for discussions on this topic.

Diff Detail

Repository
rL LLVM

Event Timeline

congh updated this revision to Diff 41954.Dec 4 2015, 3:43 PM
congh updated this revision to Diff 41956.
congh retitled this revision from to Detecte vector reduction operations just before instruction selection..
congh updated this object.
congh added reviewers: hfinkel, spatel, RKSimon, davidxl.
congh added a subscriber: llvm-commits.

Fix a small comment format.

hfinkel added inline comments.Dec 10 2015, 11:35 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2333 ↗(On Diff #41956)

We need to also grab FAdd and FMul when appropriate fast-math flags are present.

2377 ↗(On Diff #41956)

For PHIs, don't we need to check that the initial value is the operation identify value (0 for add, 1 for multiply, etc.)? And that there are only two unique incoming blocks?

2410 ↗(On Diff #41956)

We should also catch the way that this is often programmed "by hand":

typedef float v4f __attribute__((vector_size(16)));
v4f foo(void);
float bar() {
  v4f phi = { 0, 0, 0, 0 };
  for (int i = 0; i < 1600; ++i)
    phi += foo();

  return phi[0] + phi[1] + phi[2] + phi[3];
}

thus, we have multiple extracts instead of shuffles. One might argue that we should canonicalize this to the shuffle form, but we don't have anything that does this currently (AFAIK). What do you think?

2457 ↗(On Diff #41956)

You need to set this flag on the PHI node too (meaning the associate CopyFromReg, etc. instructions), right?

test/CodeGen/Generic/vector-redux.ll
2 ↗(On Diff #41956)

You need to add:

; REQUIRES: asserts

here because you're checking debug output which is available only in +Asserts builds.

congh marked an inline comment as done.Dec 10 2015, 2:53 PM
congh added inline comments.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2377 ↗(On Diff #41956)

I think this may not be necessary. We only care about how the values in the vector are used or how they flow to other defs. They can flow to phi nodes, but we need to check all uses of those phi nodes. Otherwise before reduction on elements, they can only flow to other defs of the same associative operation (e.g. add). As long as the reduction on elements is the only def where they are flowing to, it is safe to assume the operation is a vector-reduction one.

2410 ↗(On Diff #41956)

OK, as long as this pattern often appears. After all, programs made by hand can have many different forms. For example, we can use target specific intrinsics (like movehl or hadd on X86) in the final reduction. As we don't have canonicalization now, I agree that we should make this function open to support more patterns.

2457 ↗(On Diff #41956)

Yes, we can do this, but I am wondering how we use the flag on PHI node during instruction combine (in my case I only check flags on operation nodes, but I think there may be other cases in which reduction PHI node can help?).

test/CodeGen/Generic/vector-redux.ll
2 ↗(On Diff #41956)

OK. Thanks for pointing it out!

hfinkel added inline comments.Dec 10 2015, 3:21 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2377 ↗(On Diff #41956)

Sounds good.

2410 ↗(On Diff #41956)

Thinking about it, I think it is better to canonicalize these in instcombine. The shuffle form will use fewer instructions in all cases (except the two-element case). Don't bother matching it here, we should add canonicalization to instcombine in a separate patch.

FWIW, however, yes, I've seen code like this from several users at our facility.

2457 ↗(On Diff #41956)

Okay; don't bother then. I was thinking you needed it for isel, but if not, wait until we have a concrete use case.

congh updated this revision to Diff 42479.Dec 10 2015, 4:18 PM

Update the patch according to Hal's comments.

FAdd and FMul are now supported when fast-math is used.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2410 ↗(On Diff #41956)

I agree. Then I will not modify this part now.

spatel added inline comments.Dec 11 2015, 9:13 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2410 ↗(On Diff #42479)

For reference, I filed this as PR25808 so I could link it to some other reduction bugs:
https://llvm.org/bugs/show_bug.cgi?id=25808

congh added inline comments.Dec 11 2015, 10:45 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2410 ↗(On Diff #42479)

Thanks for filing the bug!

suyog added a subscriber: suyog.Dec 14 2015, 12:15 AM
suyog added inline comments.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2410 ↗(On Diff #42479)
congh added a comment.Dec 14 2015, 8:45 PM

Ping?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2410 ↗(On Diff #42479)

Yes, this is a special case of reduction-on-elements operations.

Ping again?

RKSimon edited edge metadata.Jan 5 2016, 10:22 AM

One minor think I noticed, but overall I don't know this code well enough to review, sorry.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2338 ↗(On Diff #42479)

return false here or add a 'fall through' comment here to make it clear that its intentional

congh updated this revision to Diff 44149.Jan 6 2016, 12:52 PM
congh edited edge metadata.

Added a fall-through comment as suggested by Simon.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2338 ↗(On Diff #42479)

OK, thanks!

congh added a comment.Jan 6 2016, 12:53 PM

One minor think I noticed, but overall I don't know this code well enough to review, sorry.

Thanks for the review, Simon! I will let Hal or others to approve this patch.

hfinkel added inline comments.Jan 15 2016, 6:16 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2304 ↗(On Diff #44149)

'reorganize' sounds a bit weak here. How about reorganize -> alter

2305 ↗(On Diff #44149)

stay -> stays

2336 ↗(On Diff #44149)

'same arithmetic' sounds odd. Maybe just say 'same opcode'.

2341 ↗(On Diff #44149)

do -> does a

2347 ↗(On Diff #44149)

reduction -> a reduction

2349 ↗(On Diff #44149)

insturctions (typo)

2349 ↗(On Diff #44149)

Remove 'supposed to be'

2367 ↗(On Diff #44149)

We also need to check the fast-math flags on fadd/fmul here.

2367 ↗(On Diff #44149)

We can allow selects here for the same reason we can allow phis, right? If so, we should.

2375 ↗(On Diff #44149)

Do we need to check that we don't have ElemNumToReduce == 1 here?

2398 ↗(On Diff #44149)

When ElemNumToReduce == 1 here, do we need to check that the only user is an ExtractElementInst? I'm concerned that it could be a another reduction operation before the extract.

hfinkel added inline comments.Jan 15 2016, 6:21 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2367 ↗(On Diff #44149)

(Specifically, I mean selects with a scalar condition; I assume that selects with a vector condition would not work here).

RKSimon resigned from this revision.Jan 16 2016, 2:08 PM
RKSimon removed a reviewer: RKSimon.
congh marked 8 inline comments as done.Jan 19 2016, 12:56 PM
congh added inline comments.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2367 ↗(On Diff #44149)

When I wrote this pattern recognition, I am more concerned about the compiler vectorized code, which normally won't have selects. We detect phi because it can appear in the beginning of the loop, which is not the case of selects. Do you think if we should make this pattern recognition complicated enough to catch all cases that are manually composed?

2375 ↗(On Diff #44149)

I am not sure if it could happen, but I added such a check just in case.

2398 ↗(On Diff #44149)

I think even there is another reduction operation just before extract, it is still OK. We only care about if all values in vector are reduced to one element and this is still the case.

congh updated this revision to Diff 45294.Jan 19 2016, 12:58 PM

Update the patch according to Hal's comments.

hfinkel accepted this revision.Jan 26 2016, 3:18 PM
hfinkel edited edge metadata.

Please make sure the select case if handled (autovectorization test case provided below); otherwise, LGTM. Thanks!

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2367 ↗(On Diff #45294)

I understand, but this comes up in autovectorized code as well. Here's a quick example:

$ cat /tmp/v.c 
int foo(int * restrict a1, int * restrict a2, int * restrict a3, int * restrict a4, int * restrict a5,
        int * restrict a6, int * restrict a7, int * restrict a8, int * restrict a9, int * restrict a10,
        int * restrict a11, int * restrict a12, int * restrict a13, int * restrict a14, int * restrict a15,
        int * restrict a16, int * restrict a17, int * restrict a18, int * restrict a19, int * restrict a20,
        int * restrict a21, int * restrict a22, int * restrict a23, int * restrict a24, int * restrict a25,
        int * restrict a26, int * restrict a27, int * restrict a28, int * restrict a29, int * restrict a30,
        int * restrict b, int * restrict c, int x) {
  int r = 0;
  for (int i = 0; i < 1600; ++i)
    // Lots of other stuff to prevent loop unswitching from kicking in.
    r += a1[i] + a2[i] + a3[i] + a4[i] + a5[i] +
         a6[i] + a7[i] + a8[i] + a9[i] + a10[i] +
         a11[i] + a12[i] + a13[i] + a14[i] + a15[i] +
         a16[i] + a17[i] + a18[i] + a19[i] + a20[i] +
         a21[i] + a22[i] + a23[i] + a24[i] + a25[i] +
         a26[i] + a27[i] + a28[i] + a29[i] + a30[i] +
         b[i] + c[i] + (x > 5 ? b[i] : c[i]);

  return r;
}

Look at the IR from:

$ clang -target powerpc64 -mcpu=pwr7 -O3 -S -emit-llvm -fno-unroll-loops -o - /tmp/v.c 

and you'll see:

  %64 = select i1 %cmp93, <4 x i32> %wide.load170, <4 x i32> %wide.load171
...
  %93 = add <4 x i32> %92, %wide.load168
%94 = add <4 x i32> %93, %wide.load169
%95 = add <4 x i32> %94, %wide.load170
%96 = add <4 x i32> %95, %wide.load171
%97 = add <4 x i32> %96, %64
...

And we really should handle this case.

This revision is now accepted and ready to land.Jan 26 2016, 3:18 PM

Please make sure the select case if handled (autovectorization test case provided below); otherwise, LGTM. Thanks!

Sorry for replying your comments so late as I just finished a long vacation. In your proposed case, the definitions of reduction operations are never used by the select instruction, making the select instruction not an obstacle when detecting reduction operations (note that we only care how the definition of each reduction operation is used in the def-use chain).

This revision was automatically updated to reflect the committed changes.