This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] psrl(w/d/q) and psll(w/d/q) bit shifts for SSE2
ClosedPublic

Authored by RKSimon on Dec 13 2014, 8:09 AM.

Details

Summary

Patch to match cases where shuffle masks can be reduced to bit shifts. Similar to byte shift shuffle matching from D5699.

For integer vector shuffles where lanes are being moved to the left/right in short groups and zeros are being inserted. Each integer type can be shifted safely using any wider type (so i8 -> i16/i32/i64, i16 -> i32/i64, i32 -> i64).

I have an upcoming patch that will fix the combine-or.ll domain mismatch.

I kept to just providing the immediate versions of the SSE2 logical bit shifts - there may be a case for adding support for AVX2 per-lane shifts but I don't have the hardware to test this.

Theoretically I think in the future this could be generalised (endian fixes) and moved to DAGCombine?

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 17256.Dec 13 2014, 8:09 AM
RKSimon retitled this revision from to [X86][SSE] psrl(w/d/q) and psll(w/d/q) bit shifts for SSE2.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: chandlerc, qcolombet, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).
chandlerc added inline comments.Dec 15 2014, 10:56 AM
lib/Target/X86/X86ISelLowering.cpp
7762–7777

Is this clang-formatted? I would have expected a slightly different layout, but I don't want to quibble with whatever clang-format chooses.

7779–7790

Have you checkde whether we already have such a predicate? I thought we did, but maybe this is sligthly different. Either way, a comment clarifying the exact thing it is checking would be really helpful I think. At the very least, the argument set is quite surprising.

Also, for lambdas, I've been consistently naming them as variables, and so IsSequential.

7794

Does a range based for loop not work here?

7795

Please use early exit patterns to reduce indentation, and test the inverse and continue.

7797–7798

I would prefer to just use integers to represent numbers unless you explicitly want modular arithmetic (here and throughout this patch)

7800

I think it would be really helpful to actually comment on the algorithm you're using to search for bit-shift equivalent shuffle masks. It makes it a bit easier to check that the code is actually doing what is intended.

7814–7815

Rather than use variables, maybe sink this into the predicate function so that you can early-exit from the loops?

7819

80-columns.

Thanks Chandler - I'm creating an updated patch now.

lib/Target/X86/X86ISelLowering.cpp
7762–7777

I tried that but I thought clang-format's indentation was rather extreme - it pushes all the entries to the far right (all that whitespace!). But to avoid confusion I'll use it again.

7779–7790

Turns out there is a usable predicate (isSequentialOrUndefInRange) and whatever reason I didn't use it originally in lowerVectorShuffleAsByteShift was false - that patch went through a lot of edits. I've changed it for both lowerVectorShuffleAsByteShift and lowerVectorShuffleAsBitShift.

7794

Not sure what you mean here - have you got any examples?

7814–7815

This doesn't really work now that I can remove the predicated and use isSequentialOrUndefInRange directly.

RKSimon updated this revision to Diff 17407.Dec 17 2014, 11:04 AM

Updated patch based on Chandler's comments.

Replaced the custom isSequential predicate with isSequentialOrUndefInRange in both lowerVectorShuffleAsByteShift and lowerVectorShuffleAsBitShift.

RKSimon updated this revision to Diff 17474.Dec 18 2014, 2:44 PM

Updated + rebased shuffle -> bit shift lowering patch based on feedback.

Chandler - are there any outstanding issues you are concerned with please?

RKSimon updated this revision to Diff 17775.Jan 4 2015, 7:06 AM

Rebased - I've covered all of Chandler's comments now.

Quentin/Andrea - do you have any additional comments please?

chandlerc edited edge metadata.Jan 4 2015, 4:03 PM

Really sorry Simon, I lost track of this over the holidays as I became distracted with SROA stuff.

lib/Target/X86/X86ISelLowering.cpp
7709–7713

Please go ahead and commit these bits and the comment update as a separate change. This change should focus on adding bit-shift variants.

7779–7781

The first thing I think needs to be re-worked here is to change how you're computing the types to work with.

The table solution with a simple continue is really bad IMO. I had to spend some time thinking to see the best way to do this.

At first, I thought the best approach would be to build a function or an actual map from VT to the range of possible other vector types that we could use in combination with a shift. Then the outer loop would just be over the specific candidate types.

But then it occurred to me that you don't need to hard code any of these things.

Given an integer VT with N elements and M bits, you can just divid N by 2 and multiply M by 2 on each iteration as long as N > 1 and M <= 64. That will walk over all possible re-slicings of the vector type.

For x86 where the full permutation of types are legal, I would just add an assert that the resulting vector type is legal. This completely removes the need for a table or SSE2 vs AVX2 distinctions. And it leaves you with a single loop with no continues.

7787–7790

[IGNORE THIS: phabricator won't let me delete the comment]

7798–7813

You can merge all of these tests into a single predicate function which has an outer loop containing an inner loop and two calls to isSequentialOrUndefInRange.

Then I suspect you can merge this predicate with the Left variant by providing it as a parameter which half to examine [Scale - Shift, Scale) for right, and [0, Shift) for left. You might need one other parameter, but still, I think its worth folding them.

By putting them into a predicate function, you don't need 3 boolean variables, and you can early exit as soon as you determine "no".

7816–7821

Once you make everything a nice re-usable predicate for testing both right and left shifts, you can write this code once and just have a conditional to select which instruction (right or left) is used.

RKSimon updated this revision to Diff 17861.Jan 7 2015, 8:35 AM
RKSimon edited edge metadata.

Merged left / right shift matching code into single predicate - I'm a little dubious as to whether the predicate is actually needed at this point....

Improved some byte shift tests as they were matching against bit shifts instead and we were losing test coverage.

qcolombet edited edge metadata.Jan 21 2015, 10:51 AM

Hi Simon,

Chandler did all the work of the review so I’ll leave it the final LGTM.
Here are my comments.

Thanks for your patience.
-Quentin

test/CodeGen/X86/vec_insert-5.ll
68 ↗(On Diff #17861)

I guess you modified the input IR to have additional coverage.
If that is the case, I believe a new test case with that input would be preferable, i.e., having both tests.

77 ↗(On Diff #17861)

Same here.

RKSimon updated this revision to Diff 18600.Jan 22 2015, 4:24 AM
RKSimon edited edge metadata.

Thanks Quentin - I've added the extra tests to vec_insert-5.ll - the existing tests now match the bit shifts and the (readded) modified old tests match the byte shifts again.

Chandler / Quentin - are you happy with the changes?

Sorry I've not gotten back to this. I'm going to try to sweep through all
the outstanding vector shuffle reviews tomorrow.

This is much better, but I still think there is a simpler way to write the inner loops. Trying to explain this with prose isn't helping tjough, so I'll try to patch and modify it. If it works I can submit the hybrid result, and if not I'll understand much better why not. =]

It'd be great if you want to have a go at improving this - it does reek of being able to be reduced further but so far I've failed to find the right predicate magic.

The only other thing that I've thought of is adding all_of / any_of / none_of range tests to the (Small)BitVector classes to tidyup those zeroable tests, which are becoming more common in the X86 shuffle code now.

BTW - I've just put up a patch (D7256) which fixes the combine-or.ll issue in conjunction with this patch.

RKSimon updated this revision to Diff 19097.Jan 31 2015, 10:39 AM

I've had another go at this, and hope I''ve cracked it this time. I''ve refactored the predicate to actually create the bit shift node (+ surrounding casts) instead of just reporting whether a left/right shift is possible.

RKSimon updated this revision to Diff 19223.Feb 3 2015, 7:43 AM

Moved AVX2 bitshift matching after VPSHUFD detection to stop loss of a foldable shuffle

RKSimon updated this revision to Diff 19257.Feb 3 2015, 12:28 PM

rebased + updated tests

chandlerc accepted this revision.Feb 3 2015, 12:51 PM
chandlerc edited edge metadata.

I think this code is pretty nice.

I have a minor simplification of it, but I'm happy to just apply that in a follow-up patch. Go ahead and submit this one as-is.

This revision is now accepted and ready to land.Feb 3 2015, 12:51 PM

Are you working on committing this Simon? If not, I'd like to commit it for
you so that I don't then cause you a bunch of merge conflicts when updating
test cases as i flip defaults around on flags.

Thanks Chandler, I'll get this committed shortly - just got distracted with the release build compile warnings from my other commits tonight.

This revision was automatically updated to reflect the committed changes.