This is an archive of the discontinued LLVM Phabricator instance.

[Devirtualization] MemDep returns non-local !invariant.group dependencies
ClosedPublic

Authored by Prazek on Dec 28 2016, 7:45 AM.

Details

Summary

Memory Dependence Analysis was limited to return only local dependencies
for invariant.group handling. Now it returns NonLocal when it finds it
and then by asking getNonLocalPointerDependency we get found dep.

Thanks to this we are able to devirtualize loops!

void indirect(A &a, int n) {
  for (int i = 0 ; i < n; i++)
    a.foo();

}
void test(int n) {
  A a;
  indirect(a);
}

After inlining a.foo() will be changed to direct call, even if it is
external (but only if vtable definition will be available).

Diff Detail

Repository
rL LLVM

Event Timeline

Prazek updated this revision to Diff 82593.Dec 28 2016, 7:45 AM
Prazek retitled this revision from to [Devirtualization] MemDep returns non-local !invariant.group dependencies.
Prazek updated this object.
Prazek added subscribers: llvm-commits, davide, mehdi_amini.
davide added inline comments.Dec 28 2016, 7:59 AM
include/llvm/Analysis/MemoryDependenceAnalysis.h
305–307 ↗(On Diff #82593)

I have hard time parsing this comment, I think you can remove the so then part or remove it altogether.

449 ↗(On Diff #82593)

nitpicking, typos:
retreived -> retrieved
quered -> queried.

lib/Analysis/MemoryDependenceAnalysis.cpp
344 ↗(On Diff #82593)

please add an assert message.

371 ↗(On Diff #82593)

spurious newline.

911 ↗(On Diff #82593)

why do you need a different scope?

912 ↗(On Diff #82593)

The type of NonLocalDef is not entirely obvious so I'd prefer if you use the explicit one.

test/Transforms/NewGVN/invariant.group.ll
345–347 ↗(On Diff #82593)

Thanks for adding NewGVN tests (although they're still XFAIL'ed, alas).

Prazek added inline comments.Dec 28 2016, 8:12 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
911 ↗(On Diff #82593)

This function is pretty large, so I prefer not to add new names with that huge scope.

test/Transforms/NewGVN/invariant.group.ll
345–347 ↗(On Diff #82593)

BTW is there any way we could merge them together keeping the Xfail semantics?

Prazek marked 4 inline comments as done.Dec 28 2016, 8:22 AM
Prazek added inline comments.
lib/Analysis/MemoryDependenceAnalysis.cpp
912 ↗(On Diff #82593)

Find returns iterator, so I would like to avoid this long name. Does adding "It" suffix to variable name works for you?

Prazek updated this revision to Diff 82596.Dec 28 2016, 8:25 AM

Small changes

davide added inline comments.Dec 28 2016, 8:48 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
911 ↗(On Diff #82593)

I agree it's large, maybe in a separate commit you can try splitting into multiple functions?
Not sure how much is buying us as MemDep will go away soon'ish (6 months to 1 year) but it's worth a try.

912 ↗(On Diff #82593)

Fair enough.

test/Transforms/NewGVN/invariant.group.ll
345–347 ↗(On Diff #82593)

I wish there was (as it would avoid a lot of duplication) but I wasn't able to find it. Happy to do the conversion if you think of/implement a way to share.

Side note: I think this is not really high priority as the only time when we copy tests is during transitions (as in pass rewrites etc...) which are not common enough to justify the burden of implementing a conditional XFAIL on a per-pass basis. YMMV.

Prazek marked 8 inline comments as done.Dec 28 2016, 9:19 AM
Prazek added inline comments.
lib/Analysis/MemoryDependenceAnalysis.cpp
911 ↗(On Diff #82593)

I will see what I can do about it, but I guess it is more worth to implement !invariant.group in MemSSE

davide added inline comments.Dec 28 2016, 9:22 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
911 ↗(On Diff #82593)

I wholeheartedly agree.

Prazek marked 3 inline comments as done.Dec 28 2016, 9:31 AM
Prazek updated this revision to Diff 82725.Dec 30 2016, 4:19 AM

small format

Prazek added a subscriber: vsk.Dec 30 2016, 7:23 AM
reames requested changes to this revision.Dec 30 2016, 10:02 AM
reames edited edge metadata.

The use of a side cache here appears unnecessary. Why can't you simply return the non-local result to the immediate caller and let it be kept in a local unless needed?

Also, you have a problem with the quality of your results. You're returning the first non-local found, not the best non-local found. This makes the results use list order dependent. We generally try not to have use list order sensitivities in the optimizer.

This revision now requires changes to proceed.Dec 30 2016, 10:02 AM

The use of a side cache here appears unnecessary. Why can't you simply return the non-local result to the immediate caller and let it be kept in a local unless needed?

What do you mean by "kept in local"? Do you mean storing one def as a member of MemDep?

Also, you have a problem with the quality of your results. You're returning the first non-local found, not the best non-local found. This makes the results use list order dependent. We generally try not to have use list order sensitivities in the optimizer.

But this doesn't matter to invariant.group. Let say that I have 2 loads with invariant group, one local and second non-local

  %0 load %a !invariant.group !0
  call void foo(%a)
  br b1

b1:
  %1 = load %a !invariant.group !0
  call void foo(%a)
  %2 = load %a !invariant.group !0

There is guarantee that %0, %1 and %2 will load the same value. If I will ask for dependency about %2, then whenever I will get
def(%1) or def(%0), they will be transformed into either %0 or %1, but assuming we will query %1, then it will be changed into %0,
which means that we will end up with %0 everywhere.
This is assuming that GVN is sane.

I can continue looping and either try hard to find local dependency or 'the best' non-local dependency (which probably won't be found because GVN will already remove it)
and return it, or even collect list of all non local dependencies.

I am also not sure if I understand what 'the best' non local dependency means.

The use of a side cache here appears unnecessary. Why can't you simply return the non-local result to the immediate caller and let it be kept in a local unless needed?

What do you mean by "kept in local"? Do you mean storing one def as a member of MemDep?

Also, you have a problem with the quality of your results. You're returning the first non-local found, not the best non-local found. This makes the results use list order dependent. We generally try not to have use list order sensitivities in the optimizer.

But this doesn't matter to invariant.group. Let say that I have 2 loads with invariant group, one local and second non-local

  %0 load %a !invariant.group !0
  call void foo(%a)
  br b1

b1:
  %1 = load %a !invariant.group !0
  call void foo(%a)
  %2 = load %a !invariant.group !0

There is guarantee that %0, %1 and %2 will load the same value. If I will ask for dependency about %2, then whenever I will get
def(%1) or def(%0), they will be transformed into either %0 or %1, but assuming we will query %1, then it will be changed into %0,
which means that we will end up with %0 everywhere.
This is assuming that GVN is sane.

I understand your argument that after optimizing to a fixed point, we should get the same result. The problem is that the *order* of iteration will differ. This can mean differences in naming, memory allocation patterns, or even output (I don't believe GVNPRE actually iterates to a fixed point.)

I can continue looping and either try hard to find local dependency or 'the best' non-local dependency (which probably won't be found because GVN will already remove it)
and return it, or even collect list of all non local dependencies.

The key point is that we should return a consistent result, regardless of which query path we see. Returning the most dominating non-local would be one reasonable scheme.

I am also not sure if I understand what 'the best' non local dependency means.

Don't get too caught up on "best" here. The ordering point above is the primary concern. A secondary concern is reducing the number of iterations required to reach the fixed point. Why do multiple full scans if we can bypass all but one with a small amount of extra work? Particularly work that we know only happens when we have found a useful result and are just trying to find a better one? (i.e. we're not burning time when we're not making progress.)

lib/Analysis/MemoryDependenceAnalysis.cpp
417 ↗(On Diff #82725)

Ok, my comment on the previous round was unclear and probably misleading. My primary concern is the use of a side structure to propagate results from one call to the next. This introduces a bunch of complexity (do we need cache invalidation?, are we leaking memory?) which I'd like to avoid if we can.

Can we structure this such that we split this function and call it once for local and once for non-local? I know this will do strictly more work, but I'd prefer that over the code complexity.

Hmm, shouldn't it be the least dominating instead? Then we will prefer local dependencies instead of non-local.
I never faced the problem with iterations (in GVN or anywhere else), so I will trust you that it will be valuable to fix it.
I am not sure if it is valuable to fix the second issue.

lib/Analysis/MemoryDependenceAnalysis.cpp
417 ↗(On Diff #82725)

Yes, I totally agree that this is hacky solution. I don't see a simple solution to solve it expect to do high refactor in MemDep. The main problem here is that all the users of memdep expect to get local results first, and then ask for nonlocal dependencies.

I firstly tried to change Dep into LocalDep and add NonLocalDep, but then I found that I would have to see all the usages of MemDep, understand all the algorithms to make sure everything is still valid.

Other way would be to run the same algorithm second time in nonlocal, but this would make even less sense than the solution we have today.

There is also strong argument imho to leave it as it - MemDep will is being deprecated, so there is plan to remove it in 6-12 months. Because NewGVN is still not there, I would like to make this small hack so clang-4.0 will be able to do much better devirtualization.

Then I am planning to implement handling of invariant.group in MSSE and assume in NewGVN, but this will probably take me while.

Prazek updated this revision to Diff 82870.Jan 3 2017, 5:11 AM
Prazek edited edge metadata.

Fiding the closest dependency

Prazek updated this revision to Diff 82871.Jan 3 2017, 5:13 AM
Prazek edited edge metadata.

Small format

davide added inline comments.Jan 3 2017, 6:14 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
371–374 ↗(On Diff #82871)

if (Best == nullptr || Dt.dominates()) return Other;

422–425 ↗(On Diff #82871)

You probably don't need braces, right?

Prazek updated this revision to Diff 82877.Jan 3 2017, 6:22 AM
Prazek marked 2 inline comments as done.

Addresed Davide's comments

ping
I would like to ship it in LLVM 4.0

Prazek added a comment.Jan 8 2017, 3:23 AM

note that there was discussion with emails that reviewboard didn't catch.

Ping, I would really like to land it in 4.0

dberlin accepted this revision.Jan 11 2017, 7:58 AM
dberlin edited edge metadata.

The N^2 loop worries me a lot, because it means i can easily construct cases where you've made memdep N^3 or N^4 in the case of invariant group dependencies.

But i guess we'll see.

lib/Analysis/MemoryDependenceAnalysis.cpp
333 ↗(On Diff #83945)

Please try to separate out the renaming from the actual logic changes

340 ↗(On Diff #83945)

Err, doesn't it simply indicate there *may* be a non-local def because it hit the top of the block?
(that's what it indicates for everything else)
How does it indicate there *must* be one?

379 ↗(On Diff #83945)

This loop is N^2, because each of these dominates calls make take O(N) time.

429 ↗(On Diff #83945)

As long as you are willing to commit to doing this right, i'm okay with it.

The N^2 loop worries me a lot, because it means i can easily construct cases where you've made memdep N^3 or N^4 in the case of invariant group dependencies.

But i guess we'll see.

Yea, but at least for clang it won't matter because it deosn't use invariant.group as default. I will make sure MemSSA will support it with constant time before turing it default.

Good news. I asked Teresa Johnson to run few benchmarks for me and this is results she got

"
Here are the results (ran 3x on ivybridge, showing both aggregated and non-aggregated improvements so it is easier to filter out noise). Looks like a real win in povray. The omnetpp improvement looks like noise, ditto for xalancbmk. Not sure about namd.
Teresa

Reference: D28137_0 + D28137_1 + D28137_2
(1): D28137_strictvtableptrs_0 + D28137_strictvtableptrs_1 + D28137_strictvtableptrs_2

           Benchmark             Base:Reference   (1)  
-------------------------------------------------------
spec/2006/fp/C++/444.namd                 25.51  +0.86%
spec/2006/fp/C++/447.dealII               43.52  +0.44%
spec/2006/fp/C++/450.soplex               41.91  +0.13%
spec/2006/fp/C++/453.povray               35.35  +4.50%
spec/2006/int/C++/471.omnetpp             26.11  +2.71%
spec/2006/int/C++/473.astar               20.52  +0.21%
spec/2006/int/C++/483.xalancbmk           34.58  +1.18%

geometric mean                                   +1.42% 


Reference: D28137_0 + D28137_1 + D28137_2
(1): D28137_0
(2): D28137_1
(3): D28137_2
(4): D28137_strictvtableptrs_0
(5): D28137_strictvtableptrs_1
(6): D28137_strictvtableptrs_2

           Benchmark             Base:Reference   (1)     (2)     (3)     (4)     (5)      (6)  
------------------------------------------------------------------------------------------------
spec/2006/fp/C++/444.namd                 25.51  -0.55%  +0.08%  +0.47%  +0.94%  +1.14%   +0.51%
spec/2006/fp/C++/447.dealII               43.52  +0.15%  -0.58%  +0.43%  +0.70%  +0.68%   -0.08%
spec/2006/fp/C++/450.soplex               41.91  -1.46%  +1.36%  +0.10%  +0.36%  -0.84%   +0.86%
spec/2006/fp/C++/453.povray               35.35  -0.88%  +0.25%  +0.62%  +3.76%  +4.47%   +5.26%
spec/2006/int/C++/471.omnetpp             26.11  +1.33%  -0.82%  -0.51%  -1.09%  -1.16%  +10.37%
spec/2006/int/C++/473.astar               20.52  -0.36%  -0.31%  +0.67%  +0.13%  +0.28%   +0.23%
spec/2006/int/C++/483.xalancbmk           34.58  -0.99%  -0.36%  +1.35%  +0.80%  -0.82%   +3.55%

geometric mean                                   -0.40%  -0.06%  +0.44%  +0.79%  +0.52%   +2.90%

"

Before this patch and handling of gep 0, there was -2% regression in povray, so this is very good sign!

Prazek added inline comments.Jan 11 2017, 8:16 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
340 ↗(On Diff #83945)

Because this is what getInvariantGroupPointerDependency is doing and what documentation says.

"

 Returns Unknown if it does not
/// find anything, and Def if it can be assumed that 2 instructions load or
/// store the same value and NonLocal which indicate that non-local Def was
/// found, which can be retrieved by calling getNonLocalPointerDependency
/// with the same queried instruction.

"

379 ↗(On Diff #83945)

ok, I will leave FIXME

429 ↗(On Diff #83945)

yep, implementing it in MSSA sounds like a real fun!

Prazek updated this revision to Diff 83984.Jan 11 2017, 8:36 AM
Prazek marked an inline comment as done.
Prazek edited edge metadata.

rebase typos

dberlin added inline comments.Jan 11 2017, 10:40 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
340 ↗(On Diff #83945)

Right, i'm suggesting that this is different than all the other local pointer dependency getters, and thus , confusing

/// This marker indicates that the query has no dependency in the specified
/// block.
///
/// To find out more, the client should query other predecessor blocks.
NonLocal = 1,
Prazek updated this revision to Diff 84050.Jan 11 2017, 4:19 PM
Prazek marked 7 inline comments as done.

updated comments

This revision was automatically updated to reflect the committed changes.

Just for the record - Teresa runned benchmarks with trunk instead of trunk+patch. After applying patch there were some regressions, but it might be because of some other things going out in the middle, so there is no clear verdict.