This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Merge reorder and reuse shuffles.
ClosedPublic

Authored by ABataev on Dec 4 2020, 9:57 AM.

Details

Summary

It is possible to merge reuse and reorder shuffles and reduce the total
cost of the ivectorization tree/number of final instructions.

Diff Detail

Unit TestsFailed

Event Timeline

ABataev created this revision.Dec 4 2020, 9:57 AM
ABataev requested review of this revision.Dec 4 2020, 9:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 4 2020, 9:57 AM

The same work will be done by instcombine, so we can just zero redundant cost here to gain the same effect globally. But it makes sense to make shuffle merging at the vectorization stage as well, of course.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3510
3513

Here += looks confusing, we can just set DeadCost = ..., since DeadCost == 0.

4348

This doesn't change anything but it looks preferable for IsFinalized to be true after finalization even for empty Mask.

4634

Why do we ignore ReuseShuffleIndices here?

RKSimon added inline comments.Dec 4 2020, 2:40 PM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
12

These align changes look like they should already be there - regenerate the base test file and then rebase?

ABataev added inline comments.Dec 7 2020, 6:27 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3510

I don't get it. What do you want to see here?

3513

No, DeadCost is not 0, it is ReuseShuffleCost

4634

Because stores do not have uses, so this is actually just a dead code.

ABataev updated this revision to Diff 309900.Dec 7 2020, 6:32 AM

Updated and rebased

anton-afanasyev added inline comments.Dec 7 2020, 6:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3510

I suggest to join code below (starting from if (!NeedToShuffleReuses && !E->ReorderIndices.empty())) to this if-block:

if (NeedToShuffleReuses) {
  ...
  DeadCost = ReuseShuffleCost;
} else if (!E->ReorderIndices.empty()) {
  DeadCost = TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
}
3513

If NeedToShuffleReuses == false then ReuseShuffleCost == 0, so DeadCost == 0.

ABataev added inline comments.Dec 7 2020, 6:41 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3510

Ok, I see

ABataev updated this revision to Diff 309902.Dec 7 2020, 6:41 AM

Address comments

anton-afanasyev added inline comments.Dec 7 2020, 6:50 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3513

So, you still can set DeadCost = ... instead of +=. It's simpler and points that we can get starting cost either from shuffle reuse, or from reorder indices, not both.

ABataev updated this revision to Diff 309905.Dec 7 2020, 6:51 AM

Replaced += with just =

anton-afanasyev accepted this revision.EditedDec 7 2020, 7:00 AM

Looks good to me, just one style remark: we have newlines between function members for all classes throughout this module, so I'd prefer to see the same for ShuffleInstructionBuilder class.

This revision is now accepted and ready to land.Dec 7 2020, 7:00 AM
Harbormaster completed remote builds in B81289: Diff 309900.
This revision was landed with ongoing or failed builds.Dec 7 2020, 7:51 AM
This revision was automatically updated to reflect the committed changes.

I think I'm seeing a miscompile as a result of the patch. The issue is not the transformation itself, but a dependent instruction is reading the wrong value.

Before this patch, we have this:

  %171 = load <2 x i32>, <2 x i32>* %114, align 8
  %reorder_shuffle = shufflevector <2 x i32> %171, <2 x i32> undef, <2 x i32> <i32 1, i32 0>
  %shuffle = shufflevector <2 x i32> %reorder_shuffle, <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
...
  %186 = add nsw <2 x i32> %reorder_shuffle, <i32 -1, i32 -1>

After this patch the same snippet looks like:

  %171 = load <2 x i32>, <2 x i32>* %114, align 8
  %shuffle = shufflevector <2 x i32> %171, <2 x i32> undef, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
...
  %186 = add nsw <2 x i32> %171, <i32 -1, i32 -1>

IIUC, if we suppose %171 loads the values (a, b), then:

  • In the original version, %reorder_shuffle is (b, a), so %186 is (b - 1, a - 1)
  • In the modified version, %186 reads directly from %171, so the result is (a - 1, b - 1), which is swapped

Do you know what the issue might be? I'm trying to reduce the test case so I can share more details.

I think I'm seeing a miscompile as a result of the patch. The issue is not the transformation itself, but a dependent instruction is reading the wrong value.

Before this patch, we have this:

  %171 = load <2 x i32>, <2 x i32>* %114, align 8
  %reorder_shuffle = shufflevector <2 x i32> %171, <2 x i32> undef, <2 x i32> <i32 1, i32 0>
  %shuffle = shufflevector <2 x i32> %reorder_shuffle, <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
...
  %186 = add nsw <2 x i32> %reorder_shuffle, <i32 -1, i32 -1>

After this patch the same snippet looks like:

  %171 = load <2 x i32>, <2 x i32>* %114, align 8
  %shuffle = shufflevector <2 x i32> %171, <2 x i32> undef, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
...
  %186 = add nsw <2 x i32> %171, <i32 -1, i32 -1>

IIUC, if we suppose %171 loads the values (a, b), then:

  • In the original version, %reorder_shuffle is (b, a), so %186 is (b - 1, a - 1)
  • In the modified version, %186 reads directly from %171, so the result is (a - 1, b - 1), which is swapped

Do you know what the issue might be? I'm trying to reduce the test case so I can share more details.

Hi, thanks for report, looks like you're right. I'll fix it.

rupprecht added a comment.EditedDec 30 2020, 4:27 PM

Finally got a reasonably sized reduction. It's probably so long because I'm just using clang to build, and if it gets smaller than clang optimizes too much of it away; if I used opt to control the passes I could probably get it smaller. Anyway, hope this is good enough to demonstrate the issue:

; ModuleID = './repro.ll'
source_filename = "repro.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%struct.widget = type { %struct.baz }
%struct.baz = type { i32 (...)**, i32, i32, i32, i32, i8* }
%struct.snork = type { i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8*, %struct.spam*, i8*, i8* }
%struct.spam = type { i8, i8, i8, i32, i32, i32, i32 }
%struct.zot = type { i8 }
%struct.wombat = type { i8 }

define void @wombat(%struct.widget* %arg) local_unnamed_addr align 32 personality i8* undef {
bb:
  %tmp = alloca float, align 4
  %tmp1 = alloca float, align 4
  %tmp2 = alloca float, align 4
  %tmp3 = alloca float, align 4
  %tmp4 = alloca %struct.snork, align 8
  %tmp7 = getelementptr inbounds %struct.widget, %struct.widget* %arg, i64 0, i32 0, i32 2
  %tmp8 = getelementptr inbounds %struct.widget, %struct.widget* %arg, i64 0, i32 0, i32 1
  %tmp9 = getelementptr inbounds %struct.snork, %struct.snork* %tmp4, i64 0, i32 11
  invoke void bitcast (void ()* @quux to void (%struct.snork*, %struct.zot*)*)(%struct.snork* %tmp4, %struct.zot* null)
          to label %bb11 unwind label %bb47

bb11:                                             ; preds = %bb
  %tmp12 = load i32, i32* %tmp7, align 4
  %tmp13 = load i32, i32* %tmp8, align 8
  invoke void bitcast (void ()* @wobble to void (%struct.wombat*, float*, float*, float*, float*)*)(%struct.wombat* undef, float* %tmp, float* %tmp1, float* %tmp2, float* %tmp3)
          to label %bb14 unwind label %bb47

bb14:                                             ; preds = %bb11
  %tmp15 = load float, float* %tmp, align 4
  %tmp16 = load float, float* %tmp1, align 4
  %tmp17 = load float, float* %tmp2, align 4
  %tmp18 = load float, float* %tmp3, align 4
  %tmp20 = load %struct.spam*, %struct.spam** %tmp9, align 8
  %tmp21 = add nsw i32 %tmp12, -1
  %tmp22 = fptosi float %tmp15 to i32
  %tmp23 = icmp slt i32 %tmp22, 0
  %tmp24 = icmp sgt i32 %tmp12, %tmp22
  %tmp25 = select i1 %tmp24, i32 %tmp22, i32 %tmp21
  %tmp26 = select i1 %tmp23, i32 0, i32 %tmp25
  %tmp27 = getelementptr inbounds %struct.spam, %struct.spam* %tmp20, i64 0, i32 3
  store i32 %tmp26, i32* %tmp27, align 8
  %tmp28 = add nsw i32 %tmp13, -1
  %tmp29 = fptosi float %tmp16 to i32
  %tmp30 = icmp slt i32 %tmp29, 0
  %tmp31 = icmp sgt i32 %tmp13, %tmp29
  %tmp32 = select i1 %tmp31, i32 %tmp29, i32 %tmp28
  %tmp33 = select i1 %tmp30, i32 0, i32 %tmp32
  %tmp34 = getelementptr inbounds %struct.spam, %struct.spam* %tmp20, i64 0, i32 4
  store i32 %tmp33, i32* %tmp34, align 4
  %tmp35 = fptosi float %tmp17 to i32
  %tmp36 = icmp slt i32 %tmp35, 0
  %tmp37 = icmp sgt i32 %tmp12, %tmp35
  %tmp38 = select i1 %tmp37, i32 %tmp35, i32 %tmp21
  %tmp39 = select i1 %tmp36, i32 0, i32 %tmp38
  %tmp40 = getelementptr inbounds %struct.spam, %struct.spam* %tmp20, i64 0, i32 5
  store i32 %tmp39, i32* %tmp40, align 8
  %tmp41 = fptosi float %tmp18 to i32
  %tmp42 = icmp slt i32 %tmp41, 0
  %tmp43 = icmp sgt i32 %tmp13, %tmp41
  %tmp44 = select i1 %tmp43, i32 %tmp41, i32 %tmp28
  %tmp45 = select i1 %tmp42, i32 0, i32 %tmp44
  %tmp46 = getelementptr inbounds %struct.spam, %struct.spam* %tmp20, i64 0, i32 6
  store i32 %tmp45, i32* %tmp46, align 4
  ret void

bb47:                                             ; preds = %bb11, %bb10
  %tmp48 = landingpad { i8*, i32 }
          cleanup
  ret void
}

declare void @quux() unnamed_addr

declare void @wobble() local_unnamed_addr

Building with clang -c repro.ll -O3 -o optimized.ll -fno-discard-value-names -S -emit-llvm, before this patch we had:

  %1 = load <2 x i32>, <2 x i32>* %0, align 8
  %reorder_shuffle = shufflevector <2 x i32> %1, <2 x i32> undef, <2 x i32> <i32 1, i32 0>
...
  %shuffle = shufflevector <2 x i32> %reorder_shuffle, <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
...
  %2 = add nsw <2 x i32> %reorder_shuffle, <i32 -1, i32 -1>

And after, we have:

  %1 = load <2 x i32>, <2 x i32>* %0, align 8
...
  %shuffle = shufflevector <2 x i32> %1, <2 x i32> undef, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
...
  %2 = add nsw <2 x i32> %1, <i32 -1, i32 -1>

As with the previous snippet, the dependency on %reorder_shuffle gets lost.