This is an archive of the discontinued LLVM Phabricator instance.

[LV] Avoid adding into interleaved group in presence of WAW dependency
AbandonedPublic

Authored by anna on Jan 24 2019, 12:29 PM.

Download Raw Diff

Details

Reviewers

hsaito
Ayal
mkazantsev
fhahn

Summary

Fix for miscompile when we ignore the store ordering when adding stores
into the interleaved store group.
Fix for PR40291.
Test case shows we now add stores into the interleaved store group only
if we prove there is no WAW dependency.
This is a cleaned up fix suggested by @hsaito.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 27264
Build 27263: arc lint + arc unit

Event Timeline

anna created this revision.Jan 24 2019, 12:29 PM

Herald added a subscriber: rkruppe. · View Herald TranscriptJan 24 2019, 12:29 PM

Harbormaster completed remote builds in B27264: Diff 183367.Jan 24 2019, 12:29 PM

ping

I plan on having a look later this week. I am a little worried that the checks in-line here are already quite complex and I would like to have a think if that could be improved in some way.

In D57180#1376068, @fhahn wrote:

I plan on having a look later this week. I am a little worried that the checks in-line here are already quite complex and I would like to have a think if that could be improved in some way.

I agree; The algorithm makes sure that we visit everything between B and A, including C, before we visit A; so we have a chance to identify the (potentially) interfering store C before we reach A; This is what allows the algorithm to only compare the pairs (A,B) without having each time to also scan everything in between.

So I think the bug is that when we visited C, and found that it could be inserted into B's group dependence-wise, but wasn't inserted due to other reasons, we should have either:

Invalidated the group (which is over aggressive but better than wrong code)
Recorded in B's Group the index where C could be inserted, to "burn" that index from allowing some other instruction A to become a group member at that index; so when we reach A we see its spot is taken. (I think this will have the same effect as the proposed patch but without the extra scan.)
Same as above but instead of bailing out on grouping A with B, make sure that C is alsosunk down with that group (as I think Hideki mentioned in the PR) (maybe a future improvement).

Herald added a project: Restricted Project. · View Herald TranscriptJan 31 2019, 11:19 PM

In D57180#1380093, @dorit wrote:

In D57180#1376068, @fhahn wrote:

I plan on having a look later this week. I am a little worried that the checks in-line here are already quite complex and I would like to have a think if that could be improved in some way.

I agree; The algorithm makes sure that we visit everything between B and A, including C, before we visit A; so we have a chance to identify the (potentially) interfering store C before we reach A; This is what allows the algorithm to only compare the pairs (A,B) without having each time to also scan everything in between.

So I think the bug is that when we visited C, and found that it could be inserted into B's group dependence-wise, but wasn't inserted due to other reasons, we should have either:

Invalidated the group (which is over aggressive but better than wrong code)

Recorded in B's Group the index where C could be inserted, to "burn" that index from allowing some other instruction A to become a group member at that index; so when we reach A we see its spot is taken. (I think this will have the same effect as the proposed patch but without the extra scan.)

Same as above but instead of bailing out on grouping A with B, make sure that C is alsosunk down with that group (as I think Hideki mentioned in the PR) (maybe a future improvement).

If you don't like the current approach, I agree 2) achieves the same thing, with extra bookkeeping (or extra state in the existing bookkeeping). I think 1) is too conservative. Even if C is next to B, B's group can still be extended to the other direction. 3) should be done separately from the bug fix. Anyway, do we ever deal with so many loads/stores for this efficiency to avoid extra scanning to actually matter? I'm just curious.

! In D57180#1380108, @hsaito wrote:
Anyway, do we ever deal with so many loads/stores for this efficiency to avoid extra scanning to actually matter? I'm just curious.

I'd say usually the number would be quite small and there would be no significant compile time problem. But it is not impossible to hit a worst case scenario, given the wide range of frontends and source code that gets thrown at LLVM. Also, compile time is a key metric for some users and unnecessarily wasting a bit here and there leads to a death by a thousand paper cuts :)

(I am saying that after recently investigating a few cases of hour-long compile times on user code that were caused by unnecessary scanning)

In D57180#1380108, @hsaito wrote:

In D57180#1380093, @dorit wrote:

In D57180#1376068, @fhahn wrote:

I plan on having a look later this week. I am a little worried that the checks in-line here are already quite complex and I would like to have a think if that could be improved in some way.

I agree; The algorithm makes sure that we visit everything between B and A, including C, before we visit A; so we have a chance to identify the (potentially) interfering store C before we reach A; This is what allows the algorithm to only compare the pairs (A,B) without having each time to also scan everything in between.

So I think the bug is that when we visited C, and found that it could be inserted into B's group dependence-wise, but wasn't inserted due to other reasons, we should have either:

Invalidated the group (which is over aggressive but better than wrong code)

Recorded in B's Group the index where C could be inserted, to "burn" that index from allowing some other instruction A to become a group member at that index; so when we reach A we see its spot is taken. (I think this will have the same effect as the proposed patch but without the extra scan.)

Same as above but instead of bailing out on grouping A with B, make sure that C is alsosunk down with that group (as I think Hideki mentioned in the PR) (maybe a future improvement).

If you don't like the current approach, I agree 2) achieves the same thing, with extra bookkeeping (or extra state in the existing bookkeeping). I think 1) is too conservative. Even if C is next to B, B's group can still be extended to the other direction. 3) should be done separately from the bug fix. Anyway, do we ever deal with so many loads/stores for this efficiency to avoid extra scanning to actually matter? I'm just curious.

Rather than over aggressive or too conservative, 1) seems to match the current behavior which forbids store groups with gaps; extending in the other direction will also break the vector WAW dependence, right? 2) could potentially "burn" the index with minimal extra bookkeeping or state by inserting a nullptr in its place; in any case, it's worth doing only when/after introducing support for store groups with gaps.

In D57180#1380573, @fhahn wrote:

(I am saying that after recently investigating a few cases of hour-long compile times on user code that were caused by unnecessary scanning)

Fair enough. If the size based threshold is not in place in the interleaved access optimization, that would be a good defensive improvement. Can be separate from this bug fix, though.

In D57180#1380814, @Ayal wrote:

Rather than over aggressive or too conservative, 1) seems to match the current behavior which forbids store groups with gaps; extending in the other direction will also break the vector WAW dependence, right?

OK. Hitting the gap case, yes. Else, the other direction should also hit WAW dep.

in any case, it's worth doing only when/after introducing support for store groups with gaps.

Agree.

Also, regarding

Same as above but instead of bailing out on grouping A with B, make sure that C is also sunk down with that group (as I think Hideki mentioned in the PR) (maybe a future improvement).

consider eliminating such WAW dependencies by sinking the stores and folding them into one, producing a single interleave group (by a future, separate patch).

(The issue here is somewhat reminiscent of the WAW dependence caused by multiple invariant stores, as in https://reviews.llvm.org/D54538, but there the dependence is both inside and across loop iterations.)

ebrevnov added a subscriber: ebrevnov.Jun 28 2019, 4:39 AM

ebrevnov added a reviewer: ebrevnov.Jun 28 2019, 5:02 AM

Here's is an old draft along the lines discussed above, probably deserves some updating or clean ups:

Index: include/llvm/Analysis/VectorUtils.h
===================================================================
--- include/llvm/Analysis/VectorUtils.h	(revision 352559)
+++ include/llvm/Analysis/VectorUtils.h	(working copy)
@@ -303,6 +303,35 @@
     return true;
   }
 
+  /// Check if a new member \p Instr can be inserted with index \p Index and
+  /// alignment \p NewAlign. The index is related to the leader and it could be
+  /// negative if it is the new leader.
+  ///
+  /// \returns false if the instruction doesn't belong to the group.
+  bool insertableMember(InstTy *Instr, int Index, unsigned NewAlign) const {
+    assert(NewAlign && "The new member's alignment should be non-zero");
+
+    int Key = Index + SmallestKey;
+
+    // Skip if there is already a member with the same index.
+    if (Members.find(Key) != Members.end())
+      return false;
+
+    if (Key > LargestKey) {
+      // The largest index is always less than the interleave factor.
+      if (Index >= static_cast<int>(Factor))
+        return false;
+
+    } else if (Key < SmallestKey) {
+      // The largest index is always less than the interleave factor.
+      if (LargestKey - Key >= static_cast<int>(Factor))
+        return false;
+
+    }
+
+    return true;
+  }
+
   /// Get the member with the given index \p Index
   ///
   /// \returns nullptr if contains no such member.
Index: lib/Analysis/VectorUtils.cpp
===================================================================
--- lib/Analysis/VectorUtils.cpp	(revision 352559)
+++ lib/Analysis/VectorUtils.cpp	(working copy)
@@ -936,8 +936,23 @@
       BasicBlock *BlockA = A->getParent();  
       BasicBlock *BlockB = B->getParent();  
       if ((isPredicated(BlockA) || isPredicated(BlockB)) &&
-          (!EnablePredicatedInterleavedMemAccesses || BlockA != BlockB))
+          (!EnablePredicatedInterleavedMemAccesses || BlockA != BlockB)) {
+        // If A could be inserted into B's group but is not, prevent a potential
+        // output dependent unpredicated store from taking its place.
+        // TODO: mark IndexA "taken", instead of breaking the group, when store
+        // groups with gaps are supported.
+        if (Group) {
+          int IndexA =
+              Group->getIndex(B) + DistanceToB / static_cast<int64_t>(DesB.Size);
+          if (Group->insertableMember(A, IndexA, DesA.Align)) {
+            LLVM_DEBUG(dbgs() << "LV: Detected candidate:" << *A << '\n'
+                              << "    preventing another for interleave group"
+                              << " with" << *B << '\n');
+            break;
+          }
+        }
         continue;
+      }
 
       // The index of A is the index of B plus A's distance to B in multiples
       // of the size.

ebrevnov mentioned this in D63981: [LV] Avoid building interleaved group in presence of WAW dependency.Jun 30 2019, 11:51 PM

ebrevnov removed a reviewer: ebrevnov.

ebrevnov added a child revision: D63981: [LV] Avoid building interleaved group in presence of WAW dependency.Jun 30 2019, 11:56 PM

ebrevnov removed a child revision: D63981: [LV] Avoid building interleaved group in presence of WAW dependency.

Original miscompile is expected to be fixed by:

commit 71b98e0841b13cc9848327e69f531efd1e294592
Author: Hideki Saito <hideki.saito@intel.com>
Date: Fri Aug 2 06:31:50 2019 +0000

[LV] Avoid building interleaved group in presence of WAW dependency

@anna Abandon this now that D63981 has landed at rL367654 ?

I think this patch was superseded by D63981, which landed a while ago. Marking as requiring changes, to remove it from the review queue. @anna, it would be great if you could take a look and abandon the revision, unless it is still relevant

This revision now requires changes to proceed.Nov 28 2019, 8:51 AM

hi, thanks guys. Sorry for the abnormally late response (was on mat leave). Yes, looks like this can be abandoned.

anna abandoned this revision.Dec 4 2019, 11:11 AM

Revision Contents

Path

Size

lib/

Analysis/

VectorUtils.cpp

30 lines

test/

Transforms/

LoopVectorize/

interleaved-accesses-waw-dependency.ll

108 lines

Diff 183367

lib/Analysis/VectorUtils.cpp

Show First 20 Lines • Show All 938 Lines • ▼ Show 20 Lines	for (auto AI = std::next(BI); AI != E; ++AI) {
(!EnablePredicatedInterleavedMemAccesses \|\| BlockA != BlockB))		(!EnablePredicatedInterleavedMemAccesses \|\| BlockA != BlockB))
continue;		continue;

// The index of A is the index of B plus A's distance to B in multiples		// The index of A is the index of B plus A's distance to B in multiples
// of the size.		// of the size.
int IndexA =		int IndexA =
Group->getIndex(B) + DistanceToB / static_cast<int64_t>(DesB.Size);		Group->getIndex(B) + DistanceToB / static_cast<int64_t>(DesB.Size);

		// If there is a store Instruction C between A and B, WAW dependence
		// needs to be properly observed. If C is not already part of
		// the Interleave Group, A cannot be.
		if (A->mayWriteToMemory()) {
		bool Bailout = false;
		for (auto CI = std::next(BI); CI != AI; ++CI) {
		Instruction *C = CI->first;
		// Not WAW dependency.
		if (!C->mayWriteToMemory())
		continue;
		// Should be the same address space for the accesses to be dependent.
		if (getLoadStoreAddressSpace(A) != getLoadStoreAddressSpace(B))
		continue;
		// If A and C are not dependent, then continue ahead.
		if (canReorderMemAccessesForInterleavedGroups(&AI, &CI) &&
		getLoadStorePointerOperand(A) != getLoadStorePointerOperand(C))
		continue;
		if (!isInterleaved(C) \|\| getInterleaveGroup(C) != Group) {
		Bailout = true;
		break; // A and C are dependent but C is not in the same group as B
		}
		}
		if (Bailout) {
		LLVM_DEBUG(dbgs() << "LV: Cannot insert: " << *A
		<< " into interleave group because of WAW dependency"
		<< "\n");
		continue; // A can't be grouped with B since C is not.
		}
		}

// Try to insert A into B's group.		// Try to insert A into B's group.
if (Group->insertMember(A, IndexA, DesA.Align)) {		if (Group->insertMember(A, IndexA, DesA.Align)) {
LLVM_DEBUG(dbgs() << "LV: Inserted:" << *A << '\n'		LLVM_DEBUG(dbgs() << "LV: Inserted:" << *A << '\n'
<< " into the interleave group with" << *B		<< " into the interleave group with" << *B
<< '\n');		<< '\n');
InterleaveGroupMap[A] = Group;		InterleaveGroupMap[A] = Group;

// Set the first load in program order as the insert position.		// Set the first load in program order as the insert position.
▲ Show 20 Lines • Show All 118 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/interleaved-accesses-waw-dependency.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -force-vector-width=4 -force-vector-interleave=2 -debug-only=vectorutils -disable-output -enable-interleaved-mem-accesses=true 2>&1 \| FileCheck %s
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; PR40291
				; The loop does the following operation 3 times:
				; 1. Load x from memory;
				; 2. Store (x + 1) to this memory;
				; 3. if (x < 1), store 0 to this memory.

				; When scalar version stores 0 in all locations, the vector version should do
				; the same thing. However, with interleaving it does not honour the WAW dependency between
				; store 0 and store (x + 1) to the same memory.
				; For now, we identify such unsafe dependency and disable adding the
				; store into the interleaved group.
				; In this test case, because we disable adding store into i32* %storeaddr12 and
				; storeaddr22, we create interleaved groups with gaps and
				; disable that interleaved group. So, we are only left with valid interleaved
				; groups.




				; CHECK: LV: Analyzing interleaved accesses...
				; CHECK: LV: Creating an interleave group with: store i32 %tmp34, i32* %storeaddr32, align 4
				; CHECK-NEXT: LV: Cannot insert: store i32 %tmp24, i32* %storeaddr22, align 4 into interleave group because of WAW dependency
				; CHECK-NEXT: LV: Cannot insert: store i32 %tmp14, i32* %storeaddr12, align 4 into interleave group because of WAW dependency

				; CHECK: LV: Creating an interleave group with: %tmp33 = load i32, i32* %storeaddr32, align 4
				; CHECK-NEXT: LV: Inserted: %tmp23 = load i32, i32* %storeaddr22, align 4
				; CHECK-NEXT: into the interleave group with %tmp33 = load i32, i32* %storeaddr32, align 4
				; CHECK-NEXT: LV: Inserted: %tmp13 = load i32, i32* %storeaddr12, align 4
				; CHECK-NEXT: into the interleave group with %tmp33 = load i32, i32* %storeaddr32, align 4
				; CHECK: LV: Creating an interleave group with: store i32 %tmp24, i32* %storeaddr22, align 4
				; CHECK-NEXT: LV: Cannot insert: store i32 %tmp14, i32* %storeaddr12, align 4 into interleave group because of WAW dependency
				define void @test(i8* nonnull align 8 dereferenceable_or_null(24) %arg) {
				bb:
				%tmp = getelementptr inbounds i8, i8* %arg, i64 16
				%tmp1 = bitcast i8* %tmp to i8**
				%tmp2 = load i8, i8* %tmp1, align 8
				%tmp3 = getelementptr inbounds i8, i8* %arg, i64 8
				%tmp4 = bitcast i8* %tmp3 to i8**
				%tmp5 = load i8, i8* %tmp4, align 8
				%tmp6 = getelementptr inbounds i8, i8* %tmp5, i64 12
				%tmp7 = bitcast i8* %tmp6 to i32*
				%tmp8 = getelementptr inbounds i8, i8* %tmp2, i64 12
				br label %header

				header: ; preds = %latch, %bb
				%tmp10 = phi i64 [ %tmp41, %latch ], [ 3, %bb ]
				%tmp11 = add nsw i64 %tmp10, -1
				%storeaddr12 = getelementptr inbounds i32, i32* %tmp7, i64 %tmp11
				%tmp13 = load i32, i32* %storeaddr12, align 4
				%tmp14 = add i32 %tmp13, 1
				store i32 %tmp14, i32* %storeaddr12, align 4
				%tmp15 = icmp slt i32 %tmp13, 1
				%tmp16 = xor i1 %tmp15, true
				%tmp17 = zext i1 %tmp16 to i8
				%tmp18 = getelementptr inbounds i8, i8* %tmp8, i64 %tmp10
				store i8 %tmp17, i8* %tmp18, align 1
				br i1 %tmp15, label %bb19, label %bb20

				bb19: ; preds = %header
				store i32 0, i32* %storeaddr12, align 4
				br label %bb20

				bb20: ; preds = %bb19, %header
				%tmp21 = add nuw nsw i64 %tmp10, 1
				%storeaddr22 = getelementptr inbounds i32, i32* %tmp7, i64 %tmp10
				%tmp23 = load i32, i32* %storeaddr22, align 4
				%tmp24 = add i32 %tmp23, 1
				store i32 %tmp24, i32* %storeaddr22, align 4
				%tmp25 = icmp slt i32 %tmp23, 1
				%tmp26 = xor i1 %tmp25, true
				%tmp27 = zext i1 %tmp26 to i8
				%tmp28 = getelementptr inbounds i8, i8* %tmp8, i64 %tmp21
				store i8 %tmp27, i8* %tmp28, align 1
				br i1 %tmp25, label %bb29, label %bb30

				bb29: ; preds = %bb20
				store i32 0, i32* %storeaddr22, align 4
				br label %bb30

				bb30: ; preds = %bb29, %bb20
				%tmp31 = add nuw nsw i64 %tmp10, 2
				%storeaddr32 = getelementptr inbounds i32, i32* %tmp7, i64 %tmp21
				%tmp33 = load i32, i32* %storeaddr32, align 4
				%tmp34 = add i32 %tmp33, 1
				store i32 %tmp34, i32* %storeaddr32, align 4
				%tmp35 = icmp slt i32 %tmp33, 1
				%tmp36 = xor i1 %tmp35, true
				%tmp37 = zext i1 %tmp36 to i8
				%tmp38 = getelementptr inbounds i8, i8* %tmp8, i64 %tmp31
				store i8 %tmp37, i8* %tmp38, align 1
				br i1 %tmp35, label %bb39, label %latch

				bb39: ; preds = %bb30
				store i32 0, i32* %storeaddr32, align 4
				br label %latch

				latch: ; preds = %bb39, %bb30
				%tmp41 = add nuw nsw i64 %tmp10, 3
				%tmp42 = icmp ugt i64 %tmp31, 67
				br i1 %tmp42, label %exit, label %header

				exit: ; preds = %latch
				ret void
				}