This is an archive of the discontinued LLVM Phabricator instance.

[LV] Avoid building interleaved group in presence of WAW dependency
ClosedPublic

Authored by ebrevnov on Jun 30 2019, 11:36 PM.

Details

Summary

This is another attempt to fix an issue from https://reviews.llvm.org/D57180. I have to create a new review since have no permissions to update diff in original review.

I don't think we actually need to do any additional book keeping (as suggested in D57180). Current algorithm already examines all pairs and in case of dependence will invalidate interleaving group. I think the real issue is WAW dependence (on the same iteration) is not collected by LoopAccessInfo in the first place and as a result canReorderMemAccessesForInterleavedGroups returns wrong answer. The fix is to explicitly check for WAW dependence in canReorderMemAccessesForInterleavedGroups.

Diff Detail

Event Timeline

ebrevnov created this revision.Jun 30 2019, 11:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2019, 11:36 PM
ebrevnov edited the summary of this revision. (Show Details)Jun 30 2019, 11:51 PM
ebrevnov added reviewers: hsaito, Ayal, fhahn, anna, mkazantsev.
Ayal added a comment.Jul 10 2019, 5:14 PM

From the SUMMARY above:

I don't think we actually need to do any additional book keeping (as suggested in D57180). Current algorithm already examines all pairs and in case of dependence will invalidate interleaving group. I think the real issue is WAW dependence (on the same iteration) is not collected by LoopAccessInfo in the first place and as a result canReorderMemAccessesForInterleavedGroups returns wrong answer. The fix is to explicitly check for WAW dependence in canReorderMemAccessesForInterleavedGroups.

Nice catch! Agree about the real issue being a deficiency of LoopAccessInfo, and that it's better to treat all interleave-group-preventing dependencies the same, w/o dedicating special treatment for same-iteration WAW dependences. Suggest to fix this real issue at its core.

LoopAccessInfo does collect the 6 same-iteration WAR Forward dependences between each of the 3 loads and their 2 subsequent same-address stores, in the testcase below, but these do not interfere with interleave-grouping the stores. So it's perfectly capable of collecting same-iteration WAW Forward dependences as well. (LoopAccessInfo should also already collect any same-iteration RAW dependences, although such cases would hopefully be optimized away.)

The reason why LoopAccessInfo does not collect same-iteration WAW dependences is due to the way areDepsSafe() traverses pairs of DependentAccesses, held in EquivalenceClasses<MemAccessInfo> which clusters stores-to-same-address together (and load-to-same-address together, but separately from the stores). One way of fixing this may be:

diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 36bd9a8b7ea..a3f40826861 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -1641,13 +1641,21 @@ bool MemoryDepChecker::areDepsSafe(DepCandidates &AccessSets,
     // Check every access pair.
     while (AI != AE) {
       Visited.insert(*AI);
-      EquivalenceClasses<MemAccessInfo>::member_iterator OI = std::next(AI);
+      bool AIIsWrite = AI->getInt();
+      // Check loads only against next equivalent class, but stores also against
+      // other stores in the same equivalence class - to the same address.
+      EquivalenceClasses<MemAccessInfo>::member_iterator OI =
+          (AIIsWrite ? AI : std::next(AI));
       while (OI != AE) {
         // Check every accessing instruction pair in program order.
         for (std::vector<unsigned>::iterator I1 = Accesses[*AI].begin(),
              I1E = Accesses[*AI].end(); I1 != I1E; ++I1)
-          for (std::vector<unsigned>::iterator I2 = Accesses[*OI].begin(),
-               I2E = Accesses[*OI].end(); I2 != I2E; ++I2) {
+          // Scan all accesses of another equivalence class, but only the next
+          // accesses of the same equivalent class.
+          for (std::vector<unsigned>::iterator
+                   I2 = (OI == AI ? std::next(I1) : Accesses[*OI].begin()),
+                   I2E = (OI == AI ? I1E : Accesses[*OI].end());
+               I2 != I2E; ++I2) {
             auto A = std::make_pair(&*AI, *I1);
             auto B = std::make_pair(&*OI, *I2);

(could possibly save dependences by recording only those between pairs of adjacent stores to same address, relying on transitivity)

include/llvm/Analysis/VectorUtils.h
612

Src and Sink should only be StoreInst's here, right?

ebrevnov added inline comments.Jul 11 2019, 4:28 AM
include/llvm/Analysis/VectorUtils.h
612

On enter to canReorderMemAccessesForInterleavedGroups we can have load or store only. mayWriteToMemory treats some types of loads as writing to memory. Thus load is possible here as well.

From the SUMMARY above:

I don't think we actually need to do any additional book keeping (as suggested in D57180). Current algorithm already examines all pairs and in case of dependence will invalidate interleaving group. I think the real issue is WAW dependence (on the same iteration) is not collected by LoopAccessInfo in the first place and as a result canReorderMemAccessesForInterleavedGroups returns wrong answer. The fix is to explicitly check for WAW dependence in canReorderMemAccessesForInterleavedGroups.

Nice catch! Agree about the real issue being a deficiency of LoopAccessInfo, and that it's better to treat all interleave-group-preventing dependencies the same, w/o dedicating special treatment for same-iteration WAW dependences. Suggest to fix this real issue at its core.

LoopAccessInfo does collect the 6 same-iteration WAR Forward dependences between each of the 3 loads and their 2 subsequent same-address stores, in the testcase below, but these do not interfere with interleave-grouping the stores. So it's perfectly capable of collecting same-iteration WAW Forward dependences as well. (LoopAccessInfo should also already collect any same-iteration RAW dependences, although such cases would hopefully be optimized away.)

The reason why LoopAccessInfo does not collect same-iteration WAW dependences is due to the way areDepsSafe() traverses pairs of DependentAccesses, held in EquivalenceClasses<MemAccessInfo> which clusters stores-to-same-address together (and load-to-same-address together, but separately from the stores). One way of fixing this may be:

diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 36bd9a8b7ea..a3f40826861 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -1641,13 +1641,21 @@ bool MemoryDepChecker::areDepsSafe(DepCandidates &AccessSets,
     // Check every access pair.
     while (AI != AE) {
       Visited.insert(*AI);
-      EquivalenceClasses<MemAccessInfo>::member_iterator OI = std::next(AI);
+      bool AIIsWrite = AI->getInt();
+      // Check loads only against next equivalent class, but stores also against
+      // other stores in the same equivalence class - to the same address.
+      EquivalenceClasses<MemAccessInfo>::member_iterator OI =
+          (AIIsWrite ? AI : std::next(AI));
       while (OI != AE) {
         // Check every accessing instruction pair in program order.
         for (std::vector<unsigned>::iterator I1 = Accesses[*AI].begin(),
              I1E = Accesses[*AI].end(); I1 != I1E; ++I1)
-          for (std::vector<unsigned>::iterator I2 = Accesses[*OI].begin(),
-               I2E = Accesses[*OI].end(); I2 != I2E; ++I2) {
+          // Scan all accesses of another equivalence class, but only the next
+          // accesses of the same equivalent class.
+          for (std::vector<unsigned>::iterator
+                   I2 = (OI == AI ? std::next(I1) : Accesses[*OI].begin()),
+                   I2E = (OI == AI ? I1E : Accesses[*OI].end());
+               I2 != I2E; ++I2) {
             auto A = std::make_pair(&*AI, *I1);
             auto B = std::make_pair(&*OI, *I2);

(could possibly save dependences by recording only those between pairs of adjacent stores to same address, relying on transitivity)

Ayal, thanks for paying attention to this :-). Solution your are proposing looks more solid to me. But I'm a bit worry if reporting such WAW dependencies by LoopAccessInfo will affect many other places...and possibly compile time. What do you think?

Ayal added a comment.Jul 11 2019, 3:34 PM

[snip]
But I'm a bit worry if reporting such WAW dependencies by LoopAccessInfo will affect many other places...and possibly compile time. What do you think?

I think same-iteration WAW dependences should be reported just as (and especially given that) same-iteration WAR dependences are. There's already a MaxDependences threshold in place to bail out when there are too many dependences. Admittedly borderline cases may now bail out with these extra WAW dependencies. But how many stores to the same address do we expect... and as commented in D57180 - best optimize them into one. Also, as commented, it should suffice to record only a path of dependences between pairs of adjacent stores, e.g., by setting I2E to std::next(I2) instead of I1E, unless I2==I2E.

ebrevnov updated this revision to Diff 209800.Jul 15 2019, 4:42 AM

Updating patch to use implementation proposed by Ayal.

Hi @hsaito!

Would be nice if you find time to take a look.

Thanks
Evgeniy

Hi @hsaito!

Would be nice if you find time to take a look.

@ebrevnov, thanks for pinging. This fell into the stack of not yet processed e-mails accumulated during vacation. Will take a look today.

hsaito accepted this revision.Jul 31 2019, 10:11 AM

Indeed, this is a much cleaner fix. LGTM.

This revision is now accepted and ready to land.Jul 31 2019, 10:11 AM

Thanks for quick response @hsaito ! May I ask you to commit it as well since I don't have committer rights yet.

hsaito added a comment.Aug 1 2019, 5:28 PM

Thanks for quick response @hsaito ! May I ask you to commit it as well since I don't have committer rights yet.

Sure. Will do it tonight.

Closed by commit rL367654 (authored by hsaito). · Explain WhyAug 1 2019, 11:31 PM
This revision was automatically updated to reflect the committed changes.

r367654 | hsaito | 2019-08-01 23:31:50 -0700 (Thu, 01 Aug 2019) | 12 lines

[LV] Avoid building interleaved group in presence of WAW dependency

Reviewers: hsaito, Ayal, fhahn, anna, mkazantsev

Reviewed By: hsaito

Patch by evrevnov, thanks!

Differential Revision: https://reviews.llvm.org/D63981

LIT test has been moved to X86 subdir, in response to ps4-buildslave1 buildbot failure.

hsaito added a comment.Aug 2 2019, 4:16 PM

Reported buildbot fails for Evgueni to follow up as needed:

http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/18311 --- blamelist had only this commit. Resolved by moving the LIT test to X86.

http://lab.llvm.org:8011/builders/clang-cmake-thumbv7-full-sh/builds/2185 --- I think this is not relevant. "clang -emit-llvm -O0".

http://lab.llvm.org:8011/builders/ppc64le-lld-multistage-test/builds/5742 --- not relevant. Something to do with the test infrastructure.

Reported buildbot fails for Evgueni to follow up as needed:

http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/18311 --- blamelist had only this commit. Resolved by moving the LIT test to X86.

http://lab.llvm.org:8011/builders/clang-cmake-thumbv7-full-sh/builds/2185 --- I think this is not relevant. "clang -emit-llvm -O0".

Pretty sure that was caused by https://reviews.llvm.org/D65286. which introduces failing check.

http://lab.llvm.org:8011/builders/ppc64le-lld-multistage-test/builds/5742 --- not relevant. Something to do with the test infrastructure.