This is an archive of the discontinued LLVM Phabricator instance.

[LoadStoreVectorizer] Fix infinite loop in reorder.
Needs ReviewPublic

Authored by ebevhan on Nov 13 2018, 2:07 AM.

Details

Reviewers
rtereshin
volkan
Summary

When performing LSV, SCEV could tell us that two addresses
were consecutive, yet depended on each other in the IR. This
was due to folding opportunities in the SCEV expressions for
the address calculations.

Solve this by doing basic simplification when reordering
instructions after merging.

This solves https://bugs.llvm.org/show_bug.cgi?id=38517 .

Diff Detail

Event Timeline

ebevhan created this revision.Nov 13 2018, 2:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2019, 11:42 PM
volkan added inline comments.Feb 12 2019, 10:52 AM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
494

Could you add a comment explaining why we are doing this? Also, do we really need to simplify the operands for each instruction in the list? Should we check if it's a gep before trying to simplify?

test/Transforms/LoadStoreVectorizer/AArch64/reorder-infinite-loop.ll
15

Nit: You can run -instnamer on the test in order to assign names to these instructions.

rtereshin added inline comments.Feb 12 2019, 7:20 PM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
497

Hi Bevin,

Thanks for looking into this, apologies it took me so long to notice there is a review on me.

I don't think this is a reliable fix. It appears that it will only work if SimplifyInstruction is capable of simplifying-away all the "false" data dependencies between memory operations Vectorizer::isConsecutiveAccess is able to prove consecutive. I don't think there is anything (or should be) that guarantees such parity.

To illustrate the problem, let me start with a minimized version of the test you're adding in this patch, for which the SimplifyInstruction-based fix does work:

target triple = "aarch64"

; Function Attrs: noinline nounwind
define i32 @test([2 x i32]* %array) #0 {
entry:
  %arrayidx = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 0, i32 1
  %t = load i32, i32* %arrayidx, align 4
  %rem = urem i32 %t, 1
  %arrayidx2 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %rem, i32 0
  %v = load i32, i32* %arrayidx2, align 4
  %r = add i32 %v, %t
  ret i32 %r
}

attributes #0 = { noinline nounwind }

The false dependency is created by %rem = urem i32 %t, 1 and it's trivial enough for SimplifyInstruction to constant fold, eliminating the dependency.

Here is the same test with the expression made more complex every so slightly:

target triple = "aarch64"

; Function Attrs: noinline nounwind
define i32 @test([2 x i32]* %array, i32 %idx) #0 {
entry:
  %ptr1 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx, i32 1
  %t = load i32, i32* %ptr1, align 4
  %t2 = mul i32 %t, 2
  %idx.p.2t = add i32 %idx, %t2
  %idx.p.t = sub i32 %idx.p.2t, %t
  %idx.another = sub i32 %idx.p.t, %t
  %ptr0 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx.another, i32 0
  %v = load i32, i32* %ptr0, align 4
  %r = add i32 %v, %t
  ret i32 %r
}

attributes #0 = { noinline nounwind }

The expression for i32 %idx.another is roughly (((2 * %t) + %idx) - %t) - %t and it's trivial enough for ScalarEvolution alone to see through:

%ptr1 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx, i32 1
-->  (4 + (8 * (sext i32 %idx to i64))<nsw> + %array) U: full-set S: full-set
%t = load i32, i32* %ptr1, align 4
-->  %t U: full-set S: full-set
%t2 = mul i32 %t, 2
-->  (2 * %t) U: [0,-1) S: [-2147483648,2147483647)
%idx.p.2t = add i32 %idx, %t2
-->  ((2 * %t) + %idx) U: full-set S: full-set
%idx.p.t = sub i32 %idx.p.2t, %t
-->  (%idx + %t) U: full-set S: full-set
%idx.another = sub i32 %idx.p.t, %t
-->  %idx U: full-set S: full-set
%ptr0 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx.another, i32 0
-->  ((8 * (sext i32 %idx to i64))<nsw> + %array)<nsw> U: full-set S: full-set

but it isn't trivial enough for SimplifyInstruction to eliminate the false dependency: LSV hangs on this example with and w/o this patch both.

If we pursue roughly the same approach, we could use ScalarEvolution Expander instead: if we re-materialize the SCEV of the pointer being used when we issue the vectorized version of the load or store (and keep reorder method as it currently is TOT), we could probably avoid the problem more reliably just because Vectorizer::isConsecutiveAccess is largely based on ScalarEvolution and there is more parity between what they can and can not do (note above the SCEV for %ptr0: it has the false dependency on %t fully eliminated. If re-materialized from SCEV, it will be just fine).

However, it won't do entirely either. Vectorizer::isConsecutiveAccess does more than just applying SE. Most notably, it can see through selects, even nested ones. Here's a test-case, which is mostly derived from the previous one:

target triple = "aarch64"

; Function Attrs: noinline nounwind
define i32 @test(i32* %array, i32 %idx, i1 %cond, i1 %cond2) #0 {
entry:
  %idx.p.7 = add nsw i32 %idx, 7
  %idx.p.7.sext = sext i32 %idx.p.7 to i64
  %ptr1.dummy = getelementptr inbounds i32, i32* %array, i64 21
  %ptr1.true = getelementptr inbounds i32, i32* %array, i64 1
  %ptr1.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.7.sext
  %ptr1.tmp = select i1 %cond, i32* %ptr1.true, i32* %ptr1.false
  %ptr1 = select i1 %cond2, i32* %ptr1.tmp, i32* %ptr1.dummy
  %t = load i32, i32* %ptr1, align 4
  %t2 = mul i32 %t, 2
  %idx.p.2t = add i32 %idx, %t2
  %idx.p.t = sub i32 %idx.p.2t, %t
  %idx.another = sub i32 %idx.p.t, %t
  %idx.p.6 = add i32 %idx.another, 6
  %idx.p.6.sext = sext i32 %idx.p.6 to i64
  %ptr0.dummy = getelementptr inbounds i32, i32* %array, i64 20
  %ptr0.true = getelementptr inbounds i32, i32* %array, i64 0
  %ptr0.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.6.sext
  %ptr0.tmp = select i1 %cond, i32* %ptr0.true, i32* %ptr0.false
  %ptr0 = select i1 %cond2, i32* %ptr0.tmp, i32* %ptr0.dummy
  %v = load i32, i32* %ptr0, align 4
  %r = add i32 %v, %t
  ret i32 %r
}

attributes #0 = { noinline nounwind }

As before, it hangs LSV both w/ and w/o this patch applied. What's more, however, in this case a top-level SCEV of the %ptr0 is nowhere near being free of the false dependency on %t:

%idx.p.7 = add nsw i32 %idx, 7
-->  (7 + %idx) U: full-set S: full-set
%idx.p.7.sext = sext i32 %idx.p.7 to i64
-->  (sext i32 (7 + %idx) to i64) U: [-2147483648,2147483648) S: [-2147483648,2147483648)
%ptr1.dummy = getelementptr inbounds i32, i32* %array, i64 21
-->  (84 + %array)<nsw> U: full-set S: full-set
%ptr1.true = getelementptr inbounds i32, i32* %array, i64 1
-->  (4 + %array)<nsw> U: full-set S: full-set
%ptr1.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.7.sext
-->  ((4 * (sext i32 (7 + %idx) to i64))<nsw> + %array)<nsw> U: full-set S: full-set
%ptr1.tmp = select i1 %cond, i32* %ptr1.true, i32* %ptr1.false
-->  %ptr1.tmp U: full-set S: full-set
%ptr1 = select i1 %cond2, i32* %ptr1.tmp, i32* %ptr1.dummy
-->  %ptr1 U: full-set S: full-set
%t = load i32, i32* %ptr1, align 4
-->  %t U: full-set S: full-set
%t2 = mul i32 %t, 2
-->  (2 * %t) U: [0,-1) S: [-2147483648,2147483647)
%idx.p.2t = add i32 %idx, %t2
-->  ((2 * %t) + %idx) U: full-set S: full-set
%idx.p.t = sub i32 %idx.p.2t, %t
-->  (%idx + %t) U: full-set S: full-set
%idx.another = sub i32 %idx.p.t, %t
-->  %idx U: full-set S: full-set
%idx.p.6 = add i32 %idx.another, 6
-->  (6 + %idx) U: full-set S: full-set
%idx.p.6.sext = sext i32 %idx.p.6 to i64
-->  (sext i32 (6 + %idx) to i64) U: [-2147483648,2147483648) S: [-2147483648,2147483648)
%ptr0.dummy = getelementptr inbounds i32, i32* %array, i64 20
-->  (80 + %array)<nsw> U: full-set S: full-set
%ptr0.true = getelementptr inbounds i32, i32* %array, i64 0
-->  %array U: full-set S: full-set
%ptr0.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.6.sext
-->  ((4 * (sext i32 (6 + %idx) to i64))<nsw> + %array)<nsw> U: full-set S: full-set
%ptr0.tmp = select i1 %cond, i32* %ptr0.true, i32* %ptr0.false
-->  %ptr0.tmp U: full-set S: full-set
%ptr0 = select i1 %cond2, i32* %ptr0.tmp, i32* %ptr0.dummy

So rematerializing just the top-level SCEV with the Expander won't help here either.