This is an archive of the discontinued LLVM Phabricator instance.

[AVX-512] Add support for commuting VPERMT2(B/W/D/Q/PS/PD) to/from VPERMI2(B/W/D/Q/PS/PD).
ClosedPublic

Authored by craig.topper on Oct 15 2016, 10:47 PM.

Details

Summary

The index and one of the table operands can be swapped by changing the opcode to the other version. Neither of these operands are the one that can load from memory so this can't be used to increase memory folding opportunities.

We need to handle the unmasked forms and the kz forms. Since the load operand isn't being commuted we can commute the load and broadcast instructions too.

Preprocessor macros are used to reduce the number of lines in the switches, but may have sacrificed some readability.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper retitled this revision from to [AVX-512] Add support for commuting VPERMT2(B/W/D/Q/PS/PD) to/from VPERMI2(B/W/D/Q/PS/PD)..
craig.topper updated this object.
craig.topper added reviewers: delena, igorb, RKSimon.
craig.topper added a subscriber: llvm-commits.
RKSimon edited edge metadata.Oct 18 2016, 7:37 AM

This is pretty bulky - worth pulling out into a helper function like was done for getFMA3OpcodeToCommuteOperands ?

craig.topper edited edge metadata.

Move code into separate functions per Simon's suggestion.

RKSimon accepted this revision.Oct 19 2016, 4:32 AM
RKSimon edited edge metadata.

LGTM although it'd be better if you could test vpermi2 -> vpermt2 and some of the memory folding versions of the instruction as well.

This revision is now accepted and ready to land.Oct 19 2016, 4:32 AM
craig.topper edited edge metadata.

Turns out we weren't properly commuting masked instructions due to findCommutedOpIndices miscalculating for these instructions.

I've also added an additional new test(avx512-vpermv3-commute.ll) to cover commuting these instructions explicitly. Version without commuting with extra moves was commited in r284808. So this patch just shows the moves being removed due to commuting.

craig.topper requested a review of this revision.Oct 20 2016, 11:10 PM
craig.topper edited edge metadata.
delena edited edge metadata.Oct 24 2016, 11:39 AM

VPERMI2Q kills register with indices. I thought that VPERMT2Q is better for loop vectorizer, since we may want to repeat the same shuffle on multiple data.

Hopefully the passes that would ask for commuting would make an intelligent decision about which register is best to kill.

Hopefully the passes that would ask for commuting would make an intelligent decision about which register is best to kill.

I don't completely trust passes' intelligence. I'd prefer to keep the "vpermt" form by default.

Do you know of cases where passes are doing the wrong thing or is this paranoia?

Do you know of cases where passes are doing the wrong thing or is this paranoia?

I don't want to base any assumption to my paranoia. I'm adding Ayal and Farhana, who is working on optimization of interleaved memory accesses.
The question is what instruction is better by default VPERMT or VPERMI. I initially thought, that VPERMT is better inside loops, since VPERMI kills register with indices.
Using VPERMI will require reloading indices again and again for each iteration.
Ayal, Farhana, what do you think?

Commuting these does appear to be consistent with ICC behavior.

This still LGTM - has any reason been found not to trust the commutation logic?

RKSimon accepted this revision.Nov 21 2016, 2:08 AM
RKSimon edited edge metadata.

Context missing but still LGTM. Note - I added extra VBMI lowering VPERMI2B/VPERMT2B lowering/combines recently, they should add extra diffs to this patch.

This revision is now accepted and ready to land.Nov 21 2016, 2:08 AM
This revision was automatically updated to reflect the committed changes.