Page MenuHomePhabricator

[X86] Use shuffle to widen truncate of 128-bit and smaller vectors
Needs ReviewPublic

Authored by foad on Sep 18 2020, 9:04 AM.

Details

Summary

This uses BITCAST (and maybe CONCAT_VECTORS) and VECTOR_SHUFFLE instead
of scalarizing the vector with EXTRACT_VECTOR_ELTs and rebuilding it
with BUILD_VECTOR. This generates more efficient code in the majority of
affected test cases.

The original motivation for this was to avoid some redundant AND
instructions introduced by D87502, which were due to EXTRACT_VECTOR_ELT
sometimes being optimized to a constant, which makes it harder to
optimize away the scalarize-and-then-rebuild sequence.

Diff Detail

Event Timeline

foad created this revision.Sep 18 2020, 9:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 18 2020, 9:04 AM
foad requested review of this revision.Sep 18 2020, 9:04 AM
foad added inline comments.Sep 18 2020, 9:17 AM
llvm/test/CodeGen/X86/avx512-vec-cmp.ll
1489

Is this a regression? It's the same number of instructions but it looks like longer encoding, and it uses the constant pool. There are some similar cases in bitcast-and-setcc-512.ll where it uses shufps instead of packssdw. I haven't looked into them yet.

llvm/test/CodeGen/X86/bitcast-and-setcc-256.ll
383

This regression (and similar ones below in this file and in bitcast-and-setcc-512.ll) are due to the recursion depth limit in ComputeNumSignBits, and the fact that the unoptimized DAG that we generate is now deeper but narrower. By contrast, scalarizing with EXTRACT_VECTOR_ELT tends to generate wide shallow DAGs.

I did try a couple of things to mitigate this, but I'm really not sure how much effort it's worth putting in to work around the arbitrary depth limit:

  1. I tried adding a special case to generate PACKSS directly instead of BITCAST+VECTOR_SHUFFLE, if the types are appropriate and enough sign bits are known
  2. I tried changing DAGTypeLegalizer::WidenVecOp_CONCAT_VECTORS to generate a VECTOR_SHUFFLE instead of "a nasty build vector" in the common case where exactly two vectors are concatenated.

Together these fixed some but not all of the psllw regressions.

Sorry, I haven't had chance to look at this yet

RKSimon added a comment.EditedSun, Sep 27, 11:32 AM

I'm not convinced this should be necessary - although it does seem to show some missed opportunities in truncateVectorWithPACK because we bail if the destination vector size < 64 bits - fixing that would avoid many of ISD::TRUNCATE cases in ReplaceNodeResults