Page MenuHomePhabricator

[PowerPC] Canonicalize shuffles to match more single-instruction masks on LE

Authored by nemanjai on Apr 3 2020, 6:18 PM.



We currently miss a number of opportunities to emit single-instruction VMRG[LH][BHW] instructions for shuffles on little endian subtargets. Although this in itself is not a huge performance opportunity since loading the permute vector for a VPERM can always be pulled out of loops, producing such merge instructions is useful to downstream optimizations.
Since VPERM is essentially opaque to all subsequent optimizations, we want to avoid it as much as possible. Other permute instructions have semantics that can be reasoned about much more easily in later optimizations.

This patch does the following:

  • Canonicalize shuffles so that the first element comes from the first vector (since that's what most of the mask matching functions want)
  • Switch the elements that come from splat vectors so that they match the corresponding elements from the other vector (to allow for merges)
  • Adds debugging messages for when a shuffle is matched to a VPERM so that anyone interested in improving this further can get the info for their code

Diff Detail

Event Timeline

nemanjai created this revision.Apr 3 2020, 6:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2020, 6:18 PM
nemanjai updated this revision to Diff 254978.Apr 3 2020, 6:52 PM

Updated to reflect changes in the pre-committed new test case.

amyk added a subscriber: amyk.Apr 19 2020, 11:30 PM
amyk added inline comments.

I see that our other combine functions have an assert in the beginning checking the opcode. It might make sense to have one here checking SVN->getOpcode() == ISD::VECTOR_SHUFFLE?


Is it possible to add a comment regarding the splats here? Or is the comment above supposed to explain this bit, too?



nemanjai marked an inline comment as done.Apr 24 2020, 1:48 PM
nemanjai added inline comments.

The reason for the asserts in other combines is that they do not take a specialized node. This takes a ShuffleVectorSDNode pointer. It is not possible for a shuffle vector node to have an opcode other than vector shuffle.

nemanjai updated this revision to Diff 260001.Apr 24 2020, 2:49 PM

Add handling for shuffles that are fed by SCALAR_TO_VECTOR to avoid the swap before the shuffle.

lei added a subscriber: lei.May 5 2020, 11:24 AM
lei added inline comments.

Can this and the for-loop below be a range based for-loop?

for (auto FirstOp : Op->op_values())
  if (!FirstOp.isUndef())

Maybe consider using existing naming used for scalar and vector ISD nodes: SCALAR_TO_VEC_PERMUTED

RolandF added inline comments.

I don't think it is safe to use a load and splat for a smaller than word sized load. The larger load might now cross a page boundary.


This code is only used on the early, conditional returns, and not on the final return. The function could be refactored such that the final return case is moved first, or the early returns could return the result of a function that generates the instruction, to prevent unused code.

RolandF added inline comments.May 15 2020, 12:40 PM

Looks like the code got worse here?


This code is also worse.


Okay it looks like there are a number of similar cases where the code is slightly worse. I will stop flagging them individually.

RolandF added inline comments.May 19 2020, 10:20 AM

Can we rename this opcode? S and V are too short to have meaning and permute is too general. Maybe SCALAR_TO_VECTOR_BE or _RIGHT or _UPPER or something?

NeHuang added inline comments.

nit: Is it possible to combine line 14068 and 14066 into one if check since the two operations are same?


nit: Little -> little

steven.zhang added inline comments.

Can we use llvm::any_of to make the code more compact ?

nemanjai marked 5 inline comments as done.Jun 8 2020, 7:43 PM
nemanjai added inline comments.

I think this code is a bit confusing because of the <=. I actually don't really know why I used <= rather than == since the memory value type can never be narrower than the splat width since we are checking for normal (unindexed, non-extending) loads.


How? I need to find the first operand that is *not* undef.


But then FirstOp is out of scope after the loop and we need it.


I am not sure this would be an improvement. The early exit conditions are the common case. The final return is only for the pattern:
(<n x Ty> (scalar_to_vector (Ty (extract_elt <n x Ty> %a, C)))).

I'll refactor it as per your suggestion and if it is more readable we'll go with that.
Unused nodes aren't really a problem - the SDAG will just get rid of them, but I agree that it is less than ideal to create nodes that will just be discarded.


SCALAR_TO_VECTOR_BE is close but the problem is that SCALAR_TO_VECTOR assumes the scalar is placed into vector element zero. However, (almost) none of the instructions we have actually put the scalar into vector element zero for either endianness. The value always goes to the least significant portion of the most significant doubleword. It is very hard to encapsulate that level of weirdness in a name.

I think I will opt for either SCALAR_TO_VECTOR_PERMUTED or SCALAR_TO_VECTOR_PPC with a more detailed comments:

/// PowerPC instructions that have SCALAR_TO_VECTOR semantics tend to
/// place the value into the least significant element of the most significant
/// doubleword in the vector. This is not element zero for anything smaller
/// than a doubleword on either endianness. This node has the same semantics
/// as SCALAR_TO_VECTOR except that the value remains in the
/// aforementioned location in the vector register.

If you have a preference for either of those, please let me know.

anil9 added a subscriber: anil9.Jun 9 2020, 12:01 AM
anil9 added inline comments.

Indentation discrepancy.


Indentation discrepancy.


Same as before.

nemanjai updated this revision to Diff 269454.Jun 9 2020, 2:28 AM
  • Fix some cases where code got worse
  • Refactor some code for mode readability
  • Fix some naming and comments
steven.zhang added inline comments.Jun 15 2020, 10:45 PM

Ah, sorry. Maybe, llvm::find_if_not is the right one. But it seems not help too much for this case. So, it depends on you.

RolandF accepted this revision.Jun 16 2020, 7:37 AM


This revision is now accepted and ready to land.Jun 16 2020, 7:37 AM
This revision was automatically updated to reflect the committed changes.

Hi! This patch causes a crash when compiling the Linux kernel with certain distribution configs:

$ git bisect log
# bad: [b885b1b92d3d9bfcc8229476f4ad4660305d6160] [mlir] Fix gcc build break due to previous commit
# good: [158e734af19d6be206f80c213a028b569c441b24] [ARM] Adjust AND/OR combines to not call isConstantSplat on i1 vectors. NFC.
git bisect start 'b885b1b92d3d9bfcc8229476f4ad4660305d6160' '158e734af19d6be206f80c213a028b569c441b24'
# good: [fcd67665a8de61223313e1e1582faf17d9ee76b8] [StackSafety] Add "Must Live" logic
git bisect good fcd67665a8de61223313e1e1582faf17d9ee76b8
# bad: [d3b752845df0331348dad48000fc8b82afb3de5b] [clang][test][NFC] Also test for serialization in AST dump tests, part 1/n.
git bisect bad d3b752845df0331348dad48000fc8b82afb3de5b
# bad: [bb480056602daab86fbcd6aac5c6bc92ce350bb3] [NFC] Make AST_BLOCK_HASH test more robust with downstream changes
git bisect bad bb480056602daab86fbcd6aac5c6bc92ce350bb3
# bad: [d938ec4509c47d461377527fc2877ae14b91275c] [AArch64] Avoid incompatibility between SLSBLR mitigation and BTI codegen.
git bisect bad d938ec4509c47d461377527fc2877ae14b91275c
# good: [9ca50e887db7f903c04a90593d2beed8a96794f1] [libTooling] Add parser for string representation of `RangeSelector`.
git bisect good 9ca50e887db7f903c04a90593d2beed8a96794f1
# bad: [1fed131660b2c5d3ea7007e273a7a5da80699445] [PowerPC] Canonicalize shuffles to match more single-instruction masks on LE
git bisect bad 1fed131660b2c5d3ea7007e273a7a5da80699445
# good: [9c9b71a2908d47ebd65cb7c2e1e499484aaa547e] [gn build] Port 9ca50e887db
git bisect good 9c9b71a2908d47ebd65cb7c2e1e499484aaa547e
# good: [8f3b2c8aa3175628128a32a6bcaecc67efd03514] AMDGPU/GlobalISel: Remove selection of MAD/MAC when not available
git bisect good 8f3b2c8aa3175628128a32a6bcaecc67efd03514
# first bad commit: [1fed131660b2c5d3ea7007e273a7a5da80699445] [PowerPC] Canonicalize shuffles to match more single-instruction masks on LE

To reproduce with the Linux kernel source tree:

$ mkdir -p out/ppc64le

$ curl -LSso out/ppc64le/.config ''

$ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc64le-linux-gnu- LD=powerpc64le-linux-gnu-ld LLVM=1 O=out/ppc64le OBJDUMP=powerpc64le-linux-gnu-objdump olddefconfig drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.o
fatal error: error in backend: Cannot select: t146: v16i8 = PPCISD::SCALAR_TO_VECTOR_PERMUTED t110, drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:2855:3 @[ drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:241:2 ]
  t110: i32 = any_extend t32, drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:2855:3 @[ drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:241:2 ]
    t32: i1,ch = CopyFromReg t0, Register:i1 %31, drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:2855:3 @[ drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:241:2 ]
      t31: i1 = Register %31
In function: dml20_recalculate
clang-11: error: clang frontend command failed with exit code 70 (use -v to see invocation)
ClangBuiltLinux clang version 11.0.0 ( b885b1b92d3d9bfcc8229476f4ad4660305d6160)

cvise spits out (not cleaned up or anything):

enum { a, b } c;
e() {
  _Bool f = a;
  for (; d; ++d)
    if (c)
      f = b;
  if (f)

Interestingness test and original preprocessed file available here:

Gentle ping. This is still an issue at 4c6548222b3c41d024581d28f42b3f02510bcfe3 and I have not heard anything about a fix. I know I sent that message on Friday night before the weekend so I am hoping that it did not get lost. If I need to file a bug report, I am more than happy to do so.

Gentle ping. This is still an issue at 4c6548222b3c41d024581d28f42b3f02510bcfe3 and I have not heard anything about a fix. I know I sent that message on Friday night before the weekend so I am hoping that it did not get lost. If I need to file a bug report, I am more than happy to do so.

I am really sorry I missed this message. Should be fixed in

@nemanjai thanks for the fix, I can confirm all of my PowerPC builds are successful with LLVM at 4772b99dffec4f87bb7bc9273495066058ac0186!

MaskRay added a subscriber: MaskRay.EditedTue, Jul 21, 5:33 PM

Our internal testing finds a correctness bug related to this patch (after the last known fix eafe7c14ea38946e8c1fb64d548effaee7614718)

The code is at

With the following patch it can build with -mcpu=pwr9 -O0. When I feed the executable dimsum_fuzz itself to dimsum_fuzz, it triggers __builtin_trap because hmin (horizontal min) on typedef signed char simd_char __attribute__((vector_size__(8))); computed result is different from a naive algorithm. I'll inform you when I can get a minimized reproduce or in the meantime if you can find problem that'd also be nice:)

diff --git i/ w/
index fafd1fa..b7188fd 100644
--- i/
+++ w/
@@ -532 +532,6 @@ extern "C" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size) {
+int main() {
+  const long data[] = {0x00010102464c457f, 0x0000000000000000};
+  LLVMFuzzerTestOneInput((const uint8_t*)data, 16);
diff --git i/simd_vsx.h w/simd_vsx.h
index 5c59e2a..15a32fe 100644
--- i/simd_vsx.h
+++ w/simd_vsx.h
@@ -23,2 +23,6 @@
+inline __vector long vec_abs(__vector long a) {
+  return (__vector long)vec_abs((__vector long long)a);
 namespace dimsum {

cgdb session

2004x typename std::enable_if<(_SimdType::size() > 1 && is_simd<_SimdType>::value &&
2005x                          __floor_pow_of_2(_SimdType::size()) ==
2006x                              _SimdType::size()),
2007x                         typename _SimdType::value_type>::type
2008x __hmin(const _SimdType& __v) {
2009x   auto __arr = split_by<2>(__v);
2010x   return __hmin(min(__arr[0], __arr[1]));                                                                                                       
2011x }
2013x template <class _SimdType>
2014x typename std::enable_if<(_SimdType::size() > 1 &&
2015x                          !(is_simd<_SimdType>::value &&
2016x                            __floor_pow_of_2(_SimdType::size()) ==
t third_party/dimsum/simd.h:2079
0x0000000010083c40 in std::experimental::__hmin<std::experimental::simd<signed char, std::experimental::__simd_abi<(std::experimental::_StorageKind)2,
 16> > > (__v=...) at third_party/dimsum/simd.h:2010
Value returned is $2 = <incomplete type>
(gdb) s
std::experimental::__hmin<std::experimental::simd<signed char, std::experimental::__simd_abi<(std::experimental::_StorageKind)2, 8> > > (__v=...) at t
(gdb) n
(gdb) x/2xg &__v
0x7fffffffed88: 0x0000000000000000      0x00010102464c457f
(gdb) x/2xg &__arr
0x7fffffffecf8: 0x00007fff00000000      0x00007fffffffed88
(gdb) p sizeof(__arr)            # under the hood this is a signed char vector_size(8) type
$3 = 8


142x template <typename T>
143x void TestHMin(const uint8_t* data) {
144x   NativeSimd<T> simd;
145x   LoadFromRaw(data, &simd);
146x   if (!IsNormal(simd)) return;
147x   if (dimsum::simulated::hmin(simd) != dimsum::hmin(simd))     ######   dimsum::hmin(simd) is wrong                                                                                
148x     __builtin_trap();
149x }

Fixed in
I'll wait for it to go through the bots and then I'll open a PR to ask for a merge into 11.0. Thanks for reporting this and providing the repro.