This is an archive of the discontinued LLVM Phabricator instance.

[X86] Fix EXTRACT_VECTOR_ELT with variable index.
ClosedPublic

Authored by igorb on Feb 7 2017, 3:16 PM.

Details

Summary

Fix EXTRACT_VECTOR_ELT with variable index from v32i16 and v64i8 vector.
Its more profitable to goes through memory (1 cycles throughput)
than use VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput) to implement EXTRACT_VECTOR_ELT with variable index.
IACA tool was used to get performace estimation (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
For example for var_shuffle_v16i8_v16i8_xxxxxxxxxxxxxxxx_i8 test from vector-shuffle-variable-128.ll I get 26 cycles vs 79 cycles.
Tests on real hardware get similar results.

Removing the VINSERT node , we don't need it any more.

First patch to fix Bug 31731

Diff Detail

Repository
rL LLVM

Event Timeline

igorb created this revision.Feb 7 2017, 3:16 PM
zvi edited edge metadata.Feb 7 2017, 11:53 PM

Does it make sense to split this into 2 patches?

  1. Replace X86ISD::VINSERT with BUILD_VECTOR
  2. Implement missing functionality for variable-index extractelt?
lib/Target/X86/X86ISelLowering.cpp
13658 ↗(On Diff #87532)

Make this a static function or consider making a member of X86Subtarget

igorb updated this revision to Diff 87639.Feb 8 2017, 5:39 AM
igorb retitled this revision from [AVX512] Fix EXTRACT_VECTOR_ELT for v2i1/v4i1/v32i1/v64i1 with variable index. to [AVX512] Improve EXTRACT_VECTOR_ELT with variable index. .
igorb edited the summary of this revision. (Show Details)
igorb marked an inline comment as done.
delena added inline comments.Feb 8 2017, 6:14 AM
test/CodeGen/X86/avx512-insert-extract.ll
1463 ↗(On Diff #87639)

You can use VPERMQ here.
Permute qwords in ymm3/m256/m64bcst using indexes in ymm2 and store the result in ymm1.
VPERMQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

1595 ↗(On Diff #87639)

varible -> variable

1609 ↗(On Diff #87639)

You don't need stack. Use VPEMD, for the both KNL and SKX.

1719 ↗(On Diff #87639)

Please use VPSHUFB instead of stack.

RKSimon added a subscriber: RKSimon.Feb 8 2017, 6:36 AM
RKSimon added inline comments.
lib/Target/X86/X86ISelLowering.cpp
13658 ↗(On Diff #87639)

// VPERMD/PS

13664 ↗(On Diff #87639)

// VPERMW

igorb updated this revision to Diff 87661.Feb 8 2017, 8:07 AM
igorb marked 6 inline comments as done.
igorb added a reviewer: RKSimon.

Fix according to comments ,
Thanks!

RKSimon edited edge metadata.Feb 8 2017, 8:58 AM

Add the new tests to trunk now to show current codegen?

lib/Target/X86/X86ISelLowering.cpp
13698 ↗(On Diff #87661)

SmallVector<SDValue, 8> Ops(NumElts, DAG.getConstant(0, dl, MaskEltVT));

26982 ↗(On Diff #87661)

Haven't you've lost the hasVLX() check by using isVPERMVsupported?

zvi added a comment.Feb 8 2017, 10:16 PM

Add the new tests to trunk now to show current codegen?

According to pr31731 some cases should miscompile on ToT

delena added inline comments.Feb 8 2017, 11:03 PM
test/CodeGen/X86/avx512-insert-extract.ll
1505 ↗(On Diff #87661)

vmovslq + vmovq can be replaced with vmovd

1520 ↗(On Diff #87661)

It should be vpermpd.

1693 ↗(On Diff #87661)

I propose to extend v8i16 to v8i32 and use vpermd.

1723 ↗(On Diff #87661)

Again, you can extend to v16i32.

1780 ↗(On Diff #87661)

The result will be wrong. If you have '1' in MSB, the destination will be zeroed. You should mask the %index with 0xF:
and $15, %edi
vmovd %edi, %xmm

1840 ↗(On Diff #87661)

You can extend to v32i16 and use VPERMW

igorb updated this revision to Diff 87793.Feb 9 2017, 4:36 AM
igorb marked 2 inline comments as done.
igorb edited the summary of this revision. (Show Details)
igorb added inline comments.Feb 9 2017, 4:38 AM
lib/Target/X86/X86ISelLowering.cpp
26982 ↗(On Diff #87661)

I added patterns to TD file to handle 128/256 bit vector in case NoVLX using 512bit.
I think the check here only to insure VPERMV can be used.

test/CodeGen/X86/avx512-insert-extract.ll
1520 ↗(On Diff #87661)

Thanks!

1693 ↗(On Diff #87661)

Thanks Elena,
I prefer to handle all cases that can be extend in separate path.

1780 ↗(On Diff #87661)

Max 5 bits are used , according to LLVM doc "If idx exceeds the length of val, the results are undefined."

igorb added a comment.Feb 9 2017, 4:39 AM

Add the new tests to trunk now to show current codegen?

Done,
Thanks.

RKSimon added inline comments.Feb 9 2017, 9:48 AM
lib/Target/X86/X86InstrAVX512.td
4844 ↗(On Diff #87793)

Very minor but is it worth renaming this if its not just being used for shifts/rotates?

igorb updated this revision to Diff 89090.Feb 20 2017, 1:55 AM
igorb retitled this revision from [AVX512] Improve EXTRACT_VECTOR_ELT with variable index. to [AVX512] Fix EXTRACT_VECTOR_ELT with variable index. .
igorb edited the summary of this revision. (Show Details)
delena accepted this revision.Feb 20 2017, 2:50 AM

It's not AVX-512 specific any more, please re-title the review.
Please specify that you are removing the VINSERT together with the code that using it and we don't need this node any more.

LGTM after the minor changes in the comments.

lib/Target/X86/X86ISelLowering.cpp
13779 ↗(On Diff #89090)

It's more profitable to go ..
than using ..

This revision is now accepted and ready to land.Feb 20 2017, 2:50 AM
igorb retitled this revision from [AVX512] Fix EXTRACT_VECTOR_ELT with variable index. to [X86] Fix EXTRACT_VECTOR_ELT with variable index. .Feb 20 2017, 4:27 AM
igorb edited the summary of this revision. (Show Details)
zvi accepted this revision.Feb 20 2017, 5:15 AM

LGTM with a minor comment

lib/Target/X86/X86ISelLowering.cpp
13782 ↗(On Diff #89090)

Would be nice if you added in the comment the two generated code sequences you compared in IACA and their IACA analysis.

This revision was automatically updated to reflect the committed changes.