Page MenuHomePhabricator

[X86][AVX512] lowering shuffle i/f intrinsic - llvm part

Authored by jina.nahias on Oct 8 2017, 12:24 AM.

Diff Detail


Event Timeline

jina.nahias created this revision.Oct 8 2017, 12:24 AM
craig.topper added inline comments.Oct 8 2017, 1:27 AM
1284 ↗(On Diff #118150)

Can you write this more like this code? I think its clearer.

void decodeVSHUF64x2FamilyMask(MVT VT, unsigned Imm,
                        SmallVectorImpl<int> &ShuffleMask) {
  unsigned NumLanes = VT.getSizeInBits() / 128;
  unsigned NumElementsInLane = 128 / VT.getScalarSizeInBits();
  unsigned ControlBitsMask = NumLanes - 1;
  unsigned NumControlBits  = NumLanes / 2;

  for (unsigned l = 0; l != NumLanes; ++l) {
    unsigned LaneMask = (Imm >> (l * NumControlBits)) & ControlBitsMask;
    // We actually need the other source.
    if (l >= NumLanes / 2)
      LaneMask += NumLanes;
    for (unsigned i = 0; i != NumElementsInLane; ++i)
      ShuffleMask.push_back(LaneMask * NumElementsInLane + i);
29084 ↗(On Diff #118150)

You should do this in combineBitcastForMaskedOp. It was already created to fix shuffles to work with masking.

RKSimon edited edge metadata.Oct 8 2017, 3:16 AM

Please can you add suitable -fast-isel.ll tests that match the code generated on the clang side in the avx512*-builtins.c tests

Also, it's really pedantic but please can you prefix your patch subject titles with "[X86][AVX512]" - it makes it easier to spot them in all the llvm-commits/cfe-commits emails.

jina.nahias retitled this revision from lowering shuffle i/f intrinsic - llvm part to [X86][AVX512] lowering shuffle i/f intrinsic - llvm part.Oct 8 2017, 5:51 AM
jina.nahias edited the summary of this revision. (Show Details)Oct 9 2017, 12:14 AM
jina.nahias added inline comments.Oct 9 2017, 5:27 AM
29084 ↗(On Diff #118150)

this function is called only if the select operand is bitcast, but in 64x2 intrinsics there are no bitcast and this function wont be called.

changing AutoUpgrade.cpp and adding fast-isel tests

craig.topper edited edge metadata.Oct 9 2017, 9:37 AM

For the 64-bit case its probably still better to do the conversion from combineSelect similar to what we do for the bitcasted case. Its somewhat unusual to walk the users in DAG combine.

Though it might make sense to turn all X86ISD::VPERM2X128 that we can into SHUF128 regardless of it being masked. Assuming they have the same latency and port bindings. This would open up the XMM16-XMM31 register space for them.

I believe the X86ISelLowering.cpp changes are no longer necessary after r317403 and r317410. I've taught lowering to prefer the EVEX encoded SHUF instructions by default and then EVEX->VEX will turn them back into VPERM2F128 if they don't end up being masked, using broadcast loads, or using XMM16-31.

I think there's still a lost load folding opportunity since the SHUF instructions aren't commutable like the VPERM instructions. We can probably recover that with some isel patterns if we want.

the X86ISelLowering.cpp changes are no longer necessary. those are the changes without it. @craig.topper

lsaba added inline comments.Nov 7 2017, 7:41 AM
7 ↗(On Diff #121708)

first 3 tests should be corrected same as the comments in Clang part

lsaba accepted this revision.Nov 7 2017, 8:16 AM
This revision is now accepted and ready to land.Nov 7 2017, 8:16 AM
This revision was automatically updated to reflect the committed changes.