Page MenuHomePhabricator

[SLP] Vectorize bit-parallel operations with SWAR.
AbandonedPublic

Authored by courbet on Jun 28 2018, 8:12 AM.

Details

Reviewers
RKSimon
ABataev
Summary

Consider the following code:

struct S {
  int32_t a;
  int32_t b;
  int64_t c;
  int32_t d;
};

S PartialCopy(const S& s) {
  S result;
  result.a = s.a;
  result.b = s.b;
  return result;
}

The two load/stores do not vectorize:

mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
mov eax, dword ptr [rsi + 4]
mov dword ptr [rdi + 4], eax
mov rax, rdi
ret

This is because the SLP vectorizer only considers 4xi32=i128 as a candidate,
because there exists such a vector register. It never considers 2xi32=i64,
because the only register that exists for this is a GPR.
However, all operations that only manipulate values as arrays of
bits (e.g. Load, Store, Bitcast, and potentially Xor/And/Or) do not
strictly require vector registers. Let's call these bit-parallel
operations.

This change lets the SLP vectorizer vectorize trees composed of only bit-parallel operations using the native GPR size.

The example above will vectorize to:

mov rax, qword ptr [rsi]
mov qword ptr [rdi], rax
mov rax, rdi
ret

For now this only handles the most trivial bit-parallel instructions (Load, Store, Bitcast), and only homogeneous types (it will not vectorize <4xi8, 1xi32>), but this can be added later.

Diff Detail

Event Timeline

courbet created this revision.Jun 28 2018, 8:12 AM
courbet edited the summary of this revision. (Show Details)Jun 28 2018, 8:19 AM
courbet added a subscriber: chandlerc.

Thanks for the pointers Sanjay ! I'll use this terminology and add a link to this in the code.

courbet updated this revision to Diff 153439.Jun 28 2018, 11:22 PM

Add pointers to SWAR, rename bit-parallel.ll to swar.ll.

courbet retitled this revision from [RFC][SLP] Vectorize bit-parallel operations with GPR. to [SLP] Vectorize bit-parallel operations with SWAR..Jun 28 2018, 11:22 PM

If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

My main interest in SWAR patterns was mainly for bitfield arithmetic cases such as PR34526 which I figured we could perform in InstCombine with some suitable overflow/demandedbits magic

  1. Do all targets support this kind of transformation? Are they aware of transformation of operations with small vectors into operations on GPR?
  2. You need to update the cost model for this kind of transformation.
  3. I think this should not the part of SLPVectorizer, looks like it is part of InstCombiner.

About the division of labor:

  1. I don't think instcombine can handle any of the cases shown here because it doesn't have the machinery to combine multiple independent values. So SLP or DAG are the options AFAIK.
  2. Instcombine could handle something like the xor example from https://bugs.llvm.org/show_bug.cgi?id=32119 , but it's probably better suited for AggressiveInstCombine because that's not a fixed pattern (we have to increase the matcher as the width of the value grows).

If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

There are other advantages that we get from reusing the infrastructure of the SLP vectorizer. Besides load/stores and logicals we also get shuffles for free. Consider this code:

struct S {
  int32_t a;
  int32_t b;
  int64_t c;
  int32_t d;
};

S copy_2xi32(const S& s) {
  S result;
  result.a = s.b;
  result.b = s.a;
  return result;
}

Without the change this lowers to:

copy_2xi32(S): # @copy_2xi32(S)
  mov eax, dword ptr [rsp + 12]
  mov dword ptr [rdi], eax
  mov eax, dword ptr [rsp + 8]
  mov dword ptr [rdi + 4], eax
  mov rax, rdi
  ret

With the change this lowers to:

0000000000000000 <_Z10copy_2xi32RK1S>:
   0:	f3 0f 7e 06          	movq   (%rsi),%xmm0
   4:	66 0f 70 c0 e1       	pshufd $0xe1,%xmm0,%xmm0
   9:	66 0f d6 07          	movq   %xmm0,(%rdi)
   d:	48 89 f8             	mov    %rdi,%rax
  10:	c3                   	retq
  1. Do all targets support this kind of transformation? Are they aware of transformation of operations with small vectors into operations on GPR?

For this change I've restricted the range of operations to load/store and bitcast, which I presume is guaranteed to work efficiently on all targets. Then if the target can shuffle efficiently, the SLP vectorizer might decide to also do shuffles.

  1. You need to update the cost model for this kind of transformation.

Thanks. This is my first change to the SLP vectorizer, do you have any pointers to documentation on how to do this ?

  1. I think this should not the part of SLPVectorizer, looks like it is part of InstCombiner.

Would InstCombine be able to also do stuff like shuffles ? I like how we can leverage all that's been done in SLP to get all that functionality for free.

Currently, SLPVectorizer does not generate vector types smaller than TargetTransformInfo::getMinVectorRegisterBitWidth(). This is 128 on many targets, including x86. But that doesn't really make sense, in general; even if a target doesn't have 64-bit vector registers, it can emulate them using 128-bit vector registers. The loop vectorizer frequently takes advantage of this; the SLP vectorizer should also take advantage of this, independent of anything else.

There's also the possibility of emitting "vector" operations using GPRs. This generally makes sense; it's basically the same transform even if the available instructions are more limited. But this patch doesn't really do that: it emits IR operations using vector types. SelectionDAG legalization will generally prefer to emit vector operations to vector registers, if they're available, or just scalarize if there aren't any vector registers. There's basically one exception to that rule, which you've stumbled across; DAGCombine will transform a vector or float load+store into an integer load+store, if the loaded value doesn't have any other uses. But we shouldn't rely on that, I think; if we're doing cost modeling based on the cost of integer operations, we should explicitly emit integer operations in IR.

Currently, SLPVectorizer does not generate vector types smaller than TargetTransformInfo::getMinVectorRegisterBitWidth(). This is 128 on many targets, including x86. But that doesn't really make sense, in general; even if a target doesn't have 64-bit vector registers, it can emulate them using 128-bit vector registers. The loop vectorizer frequently takes advantage of this; the SLP vectorizer should also take advantage of this, independent of anything else.

There's also the possibility of emitting "vector" operations using GPRs. This generally makes sense; it's basically the same transform even if the available instructions are more limited. But this patch doesn't really do that: it emits IR operations using vector types. SelectionDAG legalization will generally prefer to emit vector operations to vector registers, if they're available, or just scalarize if there aren't any vector registers. There's basically one exception to that rule, which you've stumbled across; DAGCombine will transform a vector or float load+store into an integer load+store, if the loaded value doesn't have any other uses. But we shouldn't rely on that, I think; if we're doing cost modeling based on the cost of integer operations, we should explicitly emit integer operations in IR.

Thanks for the explanation; this is very useful.

courbet planned changes to this revision.Jul 2 2018, 1:40 AM

Thank you all for your comments.

So let me sum up the options from the various comments here:
A - Keep this change in the SLP vectorizer. This requires emitting GPR operations instead of vector operations, and updating the cost model.
B - Do this in DAG, inside or near to LoadCombine.

I'll explore both solutions and create a patch with (B) so that we can compare.

  1. You need to update the cost model for this kind of transformation.

Thanks. This is my first change to the SLP vectorizer, do you have any pointers to documentation on how to do this ?

Actually since I only do load/store now I think this already covers it: X86TTIImpl::getMemoryOpCost. Am I right ?

courbet updated this revision to Diff 153721.Jul 2 2018, 8:33 AM

Emit scalar values instead of vector values for SWAR.

Thank you all for your comments.

So let me sum up the options from the various comments here:
A - Keep this change in the SLP vectorizer. This requires emitting GPR operations instead of vector operations, and updating the cost model.

I've updated the change with a crude implementation.

Shuffles and extracts are disabled because we can no longer rely on the
DAG to transform shuffles and extracts into the appropriate operations.
If we want to support them, we will have to reimplement them as integer
operations.

So now that I've done this I think I understand what @efriedma was saying: another take on it is to say that (taking X86 as an example), 128 is not the smallest vector, because we can do partial load/stores.

B - Do this in DAG, inside or near to LoadCombine.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together. In the case of two completely independant load/stores without anything in the middle as the two_i32 test, there is nothing linking the instructions can could provide an entry point to try to merge the instructions.

128 is not the smallest vector, because we can do partial load/stores

Essentially, yes.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together.

We have code to do this sort of merging in DAGCombiner::MergeConsecutiveStores. But it misses cases like the ones in your patch because combiner-global-alias-analysis is off by default. (I don't remember the full history of that, but IIRC the compile-time penalty was too large.)

128 is not the smallest vector, because we can do partial load/stores

Essentially, yes.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together.

We have code to do this sort of merging in DAGCombiner::MergeConsecutiveStores. But it misses cases like the ones in your patch because combiner-global-alias-analysis is off by default. (I don't remember the full history of that, but IIRC the compile-time penalty was too large.)

Hm actually I had a look at MergeConsecutiveStores and it can actually merge non-vector and/or heterogeneous-sized values (D52643). It won't handle my case though because it considers load/store in chain order and considers any store to be potentially aliasing the following loads:

This gets merged (the chain is load-load-store-store):

S PartialCopy(const S& s) {
  S result;
  const auto ta = s.a;
  const auto tb = s.b;
  result.a = ta;
  result.b = tb;
  return result;
}

But not this (the chain is load-store-load-store):

S PartialCopy(const S& s) {
  S result;
  result.a = s.a;
  result.b = s.b;
  return result;
}

Or did I miss something ?

Yes, like I said, your original testcase doesn't get merged by DAGCombine unless "-combiner-global-alias-analysis" is enabled (which it isn't, by default).

Yes, like I said, your original testcase doesn't get merged by DAGCombine unless "-combiner-global-alias-analysis" is enabled (which it isn't, by default).

Oh, I see, thanks. I missed the fact that using this flag will actually reorder the load/stores in the chain before entering CombineLoadStores (I was looking for analysis usage from CombineLoadStores).

courbet abandoned this revision.Oct 3 2018, 5:01 AM

The flag Eli pointer to is sufficient for my needs, so I'm going to abandon this revision for now.