This is an archive of the discontinued LLVM Phabricator instance.

AVX-512: shuffle mask vectors <2 x i1> .. <64 x i1>
ClosedPublic

Authored by delena on Sep 9 2015, 6:27 AM.

Details

Summary

Implemented lowering for shufflevector for i1 vectors.
AVX-512 does not provide an instruction that shuffles mask register. So I do the following way:

mask-2-simd , shuffle simd , simd-2-mask

I assume that some corner cases may be optimized. I implemented only a general case right now.

Diff Detail

Event Timeline

delena updated this revision to Diff 34325.Sep 9 2015, 6:27 AM
delena retitled this revision from to AVX-512: shuffle mask vectors <2 x i1> .. <64 x i1> .
delena updated this object.
delena added reviewers: chandlerc, Ayal, RKSimon.
delena set the repository for this revision to rL LLVM.
RKSimon edited edge metadata.Sep 9 2015, 2:26 PM

I'm surprised the patch is so small - I'd have expected more problems with computeZeroableShuffleElements and VT.getSizeInBits().

../lib/Target/X86/X86ISelLowering.cpp
10800

shullfe -> shuffle

../test/CodeGen/X86/vector-shuffle-v1.ll
3

In many other tests we've been moving away from -mcpu and using -mattr instead., is this an option here? It doesn't seem right to mix check prefixes that refer to instruction sets (AVX512F) and cpus (SKX)

delena updated this revision to Diff 34416.Sep 10 2015, 12:17 AM
delena edited edge metadata.
delena marked 2 inline comments as done.

The patch is small because I don't try to optimize the shuffles on the i1 level. I convert it to integer SIMD shuffle, where all optimizations are done.
Conversion to SIMD is a general way recommended by our architects.
I assume that some kinds of shuffles may be resolved without this conversion, but it can be done later. Now the i1 shuffles just fail on compilation.

The patch is small because I don't try to optimize the shuffles on the i1 level. I convert it to integer SIMD shuffle, where all optimizations are done.
Conversion to SIMD is a general way recommended by our architects.
I assume that some kinds of shuffles may be resolved without this conversion, but it can be done later. Now the i1 shuffles just fail on compilation.

OK - I was just surprised that we hadn't made more assumptions in the lowerVectorShuffle code (which the v*i1 types will pass through first) about data sizes etc.

Please can you add some tests that shuffle with constants (inc. an all zeroable test if possible) - after that it should be OK.

../test/CodeGen/X86/vector-shuffle-v1.ll
7

Instead of test numbers it'd be clearer if you named them after the shuffle being tested (e.g. shuffle_v2ii_10) - this matches the other vector-shuffle-*.ll test files.

delena updated this revision to Diff 34443.Sep 10 2015, 7:23 AM
delena removed rL LLVM as the repository for this revision.

I fixed some failures and added more tests with constants.

Some of the tests' codegen looks really bad (shuf64i1_zero!) - I accept this is early days but it at least needs FIXME comments explaining the problem.

../test/CodeGen/X86/vector-shuffle-v1.ll
48

missing underscore

176

extra space

206

extra space

234

extra space

259

extra space

292

extra space

The code is bad because shuffle lowering of <64 x i8> is not implemented yet.

  • Elena

What I don't understand is why shuf64i1_zero isn't just returning zero - there is no need to extend to 64xi8 in that case, computeZeroableShuffleElements should catch it. The VL_BW_DQ case is bad enough, but the AVX512F case is awful.

I'm removing the "check" for v64i1 and v32i1 for AVX-512F. These types are just illegal for the target. And nobody will use them.

../test/CodeGen/X86/vector-shuffle-v1.ll
206

Not sure I understand where the extra space. I'll remove it between <8 x i32> <i32 10,
I did not hear that we have extra spaces for tests.

Very well but there should be a test that tests shuffles with constant/zero i1 vectors - there is something going on there.

../test/CodeGen/X86/vector-shuffle-v1.ll
206

I meant there is bad spacing in the shufflevector instruction - it should be "<8 x i1> %b" instead of "< 8 x i1>%b"

delena added a subscriber: delena.Sep 13 2015, 1:26 AM

Do you mean something like this?

define i8 @shuf8i1_const() {

%c = shufflevector <8 x i1> <i1 1, i1 1, i1 0, i1 0, i1 1, i1 1, i1 0, i1 0>, <8 x i1> zeroinitializer, <8 x i32> <i32 3, i32 6, i32 1, i32 0, i32 3, i32 7, i32 7, i32 0>
%c1 = bitcast <8 x i1>%c to i8
ret i8 %c1

}

  • Elena

Do you mean something like this?

define i8 @shuf8i1_const() {

%c = shufflevector <8 x i1> <i1 1, i1 1, i1 0, i1 0, i1 1, i1 1, i1 0, i1 0>, <8 x i1> zeroinitializer, <8 x i32> <i32 3, i32 6, i32 1, i32 0, i32 3, i32 7, i32 7, i32 0>
%c1 = bitcast <8 x i1>%c to i8
ret i8 %c1

}

That would be interesting in itself although I'd hope that the DAGCombiner would successfully constant fold that test before lowering (and constant fold the bitcast too) - does it?

But what I meant was tests that trigger the computeZeroableShuffleElements early out (and then have the codegen return a zero mask vector without the insanity that was in the AVX512F shuf64i1_zero case).

And tests that shuffle a non-constant i1 vector with a constant i1 vector to check what happens to that constant input (does it get generated as a i1 vector and is then extended for shuffling or does it first appear later in the shuffle code as an extended type?).

delena updated this revision to Diff 34878.Sep 16 2015, 4:10 AM
delena set the repository for this revision to rL LLVM.

Added more tests for cases when one of the operands is

  1. a constant vector
  2. all-zero vector
  3. all-ones vector
RKSimon accepted this revision.Sep 16 2015, 6:56 AM
RKSimon edited edge metadata.

Thanks, the tests now give some indication of where code gen needs to be improved. LGTM.

This revision is now accepted and ready to land.Sep 16 2015, 6:56 AM
This revision was automatically updated to reflect the committed changes.

Simon, thanks a lot for the review.

  • Elena