This is an archive of the discontinued LLVM Phabricator instance.

Allow MemoryLocation to carry pre-existing knowledge to AA to elide expensive repeated checks
Needs ReviewPublic

Authored by dsanders on Oct 12 2018, 9:39 AM.

Details

Reviewers
hfinkel
sanjoy
Summary

DeadStoreElimination currently calls getModRefInfo() for each combination of
exit block and alloca (and similarly local allocs). Each one of these calls
is checking whether the given memory location is a non-escaping local. In
the case of one function I have, this is ~5,000 exit blocks * ~13,000 allocas

~65 million calls to getModRefInfo(). As a result it spends ~57s on

DeadStoreElimination. Unfortunately, DeadStoreElimination finds that it's not
able to make any changes at all so getModRefInfo() is returning the same result
5,000 times for each allocation. While none of the calls to getModRefInfo() are
redundant, 99.98% of the checks that the argument is a non-escaping local inside
it (which involves an expensive call to PointerMayBeCaptured()) are redundant
since this portion of getModRefInfo() depends on a property of the Value being
queried and not the particular call site.

DeadStoreElimination knows that all of its queries are about locals
(or equivalent to a local such as a non-escaping heap alloc), it doesn't cause
non-escaping locals to escape since it's only removing dead stores, and it
knows when a change may cause an escaping local to stop escaping. Therefore
it has everything it getModRefInfo() needs to cache the expensive
PointerMayBeCaptured() call and provide it to future calls.

This patch introduces a means to do that by extending MemoryLocation with a
KnownFlags member which can record pre-existing knowledge which can be used by
its clients to elide particularly expensive checks. This patch currently applies
this caching very conservatively within DeadStoreElimination. Any change at all
to the IR flushes the whole cache. This is partly to keep the overhead of the
cache maintenance down and partly to keep it simple. In the worst case, we end
up doing all the PointerMayBeCaptured() with a very small amount of overhead to
maintain the cache.

CTMark isn't significantly affected by this patch. With 10x multisampling, I see
two regressions:

  • pairlocalalign: 0.10%
  • sqlite3-link: 0.66%

and several minor improvements, the top 3 are:

  • 7-zip-benchmark-link: -1.37%
  • bullet-link: -0.92%
  • kc-link: -0.82%

and overall the geomean has improved slightly (-0.21%). The resulting binaries
are unchanged. The more interesting result is the motivating function mentioned
above. Previously, DeadStoreElimination was taking ~57s on this function and now
takes ~20s (-65%).

Diff Detail

Event Timeline

dsanders created this revision.Oct 12 2018, 9:39 AM

Hi Sanjoy,

I don't seem to be having much luck getting reviewers for this and I noticed that you reviewed the last big change in getModRefInfo(). Would you be able to take a look at this patch?

Hi Hal,

I didn't spot you in CODE_OWNERS.txt at first because I was searching for 'Alias' instead of 'alias' :-). Would you be able to take a look at this patch too?

asbirlea added inline comments.
lib/Analysis/BasicAliasAnalysis.cpp
819

Add an assert before this to check:

assert((!Loc.isKnownToBeLocalObject() || !Loc.isKnownCouldBeNonLocalObject()) &&
            "Location cannot be both known local and known non-local");
assert((!Loc.isKnownToBeNonEscaping() || !Loc.isKnownMayEscape() &&
            "Location cannot be both known non escaping and known it may escape");
822

Update flags in Loc here if isNonEscapingLocalObject was called?
This will make Loc non-const, which is a larger chance in itself and there should be some comments added if we make getModRefInfo have side-effects.

lib/Transforms/Scalar/DeadStoreElimination.cpp
774

Can we avoid this call and leave the Escaping bits both not set?
Precisely because PointerMayBeCaptured is expensive. Ideally this should be computed only on demand (in geModRefInfo) and the info passed back when caching.

876

Check if Loc was updated by the ModRef call and update the cache?

dsanders marked an inline comment as done.Oct 31 2018, 5:36 PM

Thanks Alina

lib/Analysis/BasicAliasAnalysis.cpp
822

I like the idea of the Loc being updated with things that become known as a result of AA as I'm sure there's plenty more caching that can be done. However, I'm a bit worried about the effects of making the MemoryLocation non-const. My main concern is that callers will find that caching is happening unexpectedly and that passes will therefore cache automatically, modify the IR, but forget to clear the known flags because they weren't aware of the caching. The other concerns are mostly about unintended implicit copies caused by assigning const MemoryLocation to non-const MemoryLocation, and about the size of the patch as dropping the const will likely propagate outwards quite a long way.

I think it would probably be better to add an extra argument to make the caching something the callers have to knowingly opt into. I'm thinking something like:

ModRefInfo getModRefInfo(ImmutableCallSite CS, const MemoryLocation &Loc, MemoryLocationKnowledge &Knowledge);
ModRefInfo getModRefInfo(ImmutableCallSite CS, const MemoryLocation &Loc) {
  MemoryLocationKnowledge Knowledge;
  return getModRefInfo(CS, Loc, Knowledge);
}

That way, callers that aren't aware of it or don't want to preserve it can keep calling:

getModRefInfo(CS, Loc)

and discarding any cachable knowledge. Those that want full caching (e.g. analysis passes) can do something like:

getModRefInfo(CS, Loc, Loc.Knowledge)

meanwhile those that need to be more careful can do:

MemoryLocationKnowledge Knowledge;
getModRefInfo(CS, Loc, Knowledge)
... some transformation ...
Knowledge.forgetX();
Knowledge.forgetY();
Loc.Knowledge = Knowledge;

Does that sound like a good direction to go?

lib/Transforms/Scalar/DeadStoreElimination.cpp
774

We can, although for DSE at least we won't end up saving any further calls to it as a result. DSE calls getModRef() for every local (or local equivalent) in DeadStackObjects so we'll always call PointerMayBeCaptured() once for each item either way

dsanders updated this revision to Diff 172065.Oct 31 2018, 5:37 PM

Add the sanity checks

Hi Hal,

I didn't spot you in CODE_OWNERS.txt at first because I was searching for 'Alias' instead of 'alias' :-). Would you be able to take a look at this patch too?

Thanks. I suppose that I should echo Chris's question about the proposed local-dominance cache with respect to this change too: If we switch this pass to using MemorySSA does this problem go away?

Hal,

I have mixed feelings here because I understand Chris's point, but there are differences in this case which make this patch worth pushing forward IMO.

MemorySSA in the current format will not solve this. It can be (and should be) extended to handle cases with *some* similarity (because we have other use cases for it). Trying to explain below.

I'm not that familiar with DSE but if I understand correctly, the sequence is:

  • Keep a list of instructions, known to be allocas (or alike?)
  • Call for all callsites *after* those allocas getModRefInfo
  • If CS can read from any of those allocas, all stores above the CS are live (and it's iterating BBs in reverse so marking all preceding stores live as soon as one call is found to Ref)

So the patch adds info about the list of allocas inside the MemoryLocation they modify and saves some of the work getModRefInfo does.

MemorySSA cannot currently provide info about isClobberedBy, it provides uses which are not yet optimized (this is a useful and expensive extension, and we should have a walker for it)
But even if we had this in MemorySSA now, the info cached in this case does not necessarily overlap with storing "isLocal" and "isNonEscaping" for a MemoryLocation, plus it is more expensive.
We're looking here for ModRef between two particular instances that may have many other memory-accessing instructions inbetween, with whom they MayAlias. A getModRefInfo call should be cheaper than a MemorySSA query here.
Perhaps I could argue that rewriting checks entirely (to avoid this particular ModRef call) would make MemorySSA more viable, but that's a whole other discussion and it would involve re-writing the pass.

The cost is clearly the extra memory added to MemoryLocation (Chris's objection for BBs) which is a core data structure. Unlike BBs we create MemoryLocations often on the spot and drop them right away.
This may make things better (as far as memory cost throughout the compilation) or worse (as far as allocations).
IMO, as far as caching invalidation, it is clearly better than BBs, since MemoryLocations are not passed between Analyses or Transforms. Hence the KnownFlags set by a pass will be used only in that pass, and whatever caching the pass does is dropped for others.
It could be argued that this could be another extension to add to MemorySSA so we do pass cached info forward, but I think that's beyond the scope of MemorySSA right now.

Hope this makes sense, and please feel free to pull in Chris and the others to give feedback.

FWIW, long term DSE should ideally use MemorySSA instead of MemoryDependenceResults.
But this particular optimization has a high chance of being useful even then.

lib/Analysis/BasicAliasAnalysis.cpp
822

Yes, I think it's good to keep the MemoryLocation const in this patch.

lib/Transforms/Scalar/DeadStoreElimination.cpp
774

Got it.
I'm wondering what's the cause of the regressions you're seeing? Just from adding to the cache and clearing it?

Hi Hal,

I didn't spot you in CODE_OWNERS.txt at first because I was searching for 'Alias' instead of 'alias' :-). Would you be able to take a look at this patch too?

Thanks. I suppose that I should echo Chris's question about the proposed local-dominance cache with respect to this change too: If we switch this pass to using MemorySSA does this problem go away?

Do you mean the question at http://lists.llvm.org/pipermail/llvm-dev/2018-September/126375.html?

I'm not very familiar with the MemorySSA pass but based on a fairly quick skim of it ...
It looks like the MemoryDef/MemoryUse objects are attached to the BB rather than the instructions. That wouldn't give us the accuracy we need for DSE to eliminate the store in something like:

%p = alloca ...
%1 = load %p
store %p, %1
unreachable

but it would give a reasonable early test as to whether it's worth asking further questions about the instructions. If the block containing the call site lacks a MemoryUse for a given alloca's MemoryDef then we could avoid asking whether the the call site accesses the given alloca because we know that nothing in the BB accesses it. However, this is only helpful if MemorySSA tries to eliminate MemoryUses for allocas that don't escape or where the called function is inspected and confirmed to not use that particular alloca. If it's conservative (e.g. adds a MemoryUse for all allocas at every call site) then we can't cull the expensive 'is it a non-escaping local?' check because MemorySSA only tells us that something in the BB used it, rather than a particular instruction used it.

Looking at MemorySSA's code, I'm wondering if it may be hitting the same performance issue that this patch is targeting. I see that getModRefInfo() is called quite a bit, most notably for every instruction in buildMemorySSA() (via createNewAccess()). Every one of those calls potentially calls the expensive PointerMayBeCaptured() (subject to early exits) and doesn't cache the results of any of them even though it's a property of the MemoryLocation rather than the Instruction. It's also potentially called again inside instructionClobbersQuery() which also looks like it's called fairly often (in walkToPhiOrClobber() and optimizeUsesInBlock()).

Hal,

I have mixed feelings here because I understand Chris's point, but there are differences in this case which make this patch worth pushing forward IMO.

MemorySSA in the current format will not solve this. It can be (and should be) extended to handle cases with *some* similarity (because we have other use cases for it). Trying to explain below.

I'm not that familiar with DSE but if I understand correctly, the sequence is:

  • Keep a list of instructions, known to be allocas (or alike?)
  • Call for all callsites *after* those allocas getModRefInfo
  • If CS can read from any of those allocas, all stores above the CS are live (and it's iterating BBs in reverse so marking all preceding stores live as soon as one call is found to Ref)

That's right. The particular case that was causing performance problems for me was:

call void @__assert_rtn(i8* %0, i8* %1, i8* %2)
unreachable

I had 5,000 exit blocks like that and 13,000 allocas to consider. Every alloca was potentially dead at the unreachable and it called getModRefInfo() for each one to find out if __assert_rtn was able to access it. The answer was always no because the allocas were all non-escaping locals but finding that out was expensive and it re-checked for every local alloca and every call site.

So the patch adds info about the list of allocas inside the MemoryLocation they modify and saves some of the work getModRefInfo does.

MemorySSA cannot currently provide info about isClobberedBy, it provides uses which are not yet optimized (this is a useful and expensive extension, and we should have a walker for it)
But even if we had this in MemorySSA now, the info cached in this case does not necessarily overlap with storing "isLocal" and "isNonEscaping" for a MemoryLocation, plus it is more expensive.
We're looking here for ModRef between two particular instances that may have many other memory-accessing instructions inbetween, with whom they MayAlias. A getModRefInfo call should be cheaper than a MemorySSA query here.
Perhaps I could argue that rewriting checks entirely (to avoid this particular ModRef call) would make MemorySSA more viable, but that's a whole other discussion and it would involve re-writing the pass.

The cost is clearly the extra memory added to MemoryLocation (Chris's objection for BBs) which is a core data structure.

We can potentially eliminate that cost in MemoryLocation using the separate non-const MemoryLocationKnowledge object we were talking about as an input and output. Callers that don't provide it would still need to allocate memory for it but it would be a default-constructed object in the callees stack frame.

Unlike BBs we create MemoryLocations often on the spot and drop them right away.
This may make things better (as far as memory cost throughout the compilation) or worse (as far as allocations).
IMO, as far as caching invalidation, it is clearly better than BBs, since MemoryLocations are not passed between Analyses or Transforms. Hence the KnownFlags set by a pass will be used only in that pass, and whatever caching the pass does is dropped for others.
It could be argued that this could be another extension to add to MemorySSA so we do pass cached info forward, but I think that's beyond the scope of MemorySSA right now.

Hope this makes sense, and please feel free to pull in Chris and the others to give feedback.

lib/Transforms/Scalar/DeadStoreElimination.cpp
774

I think so but I haven't dug into it yet to confirm it

I'm not very familiar with the MemorySSA pass but based on a fairly quick skim of it ...
It looks like the MemoryDef/MemoryUse objects are attached to the BB rather than the instructions.

I'll write more later, but quickly, this is incorrect. MemorySSA has per-instruction granularity.

I'm not very familiar with the MemorySSA pass but based on a fairly quick skim of it ...
It looks like the MemoryDef/MemoryUse objects are attached to the BB rather than the instructions.

I'll write more later, but quickly, this is incorrect. MemorySSA has per-instruction granularity.

Thanks. I got that from some comments about attaching MemoryUse/MemoryDef to basic blocks (not the ones about phi's) but I'm having trouble finding the same comments this morning. Is it storing the additional data in the BB objects or something like that? That could explain how I reached the wrong conclusion.

In that case it will come down to how conservative it is about attaching MemoryUse. Looking at buildMemorySSA() again, createNewAccess() is calling getModRefInfo(const Instruction *, Optional<MemoryLocation>) and is always providing None as the second argument. Following that code path, this is causing it to report the behaviour using getModRefBehavior(ImmutableCallSite) which is good news performance-wise since this path doesn't have the expensive is-this-a-non-escaping-local check. It's also good news for DSE when doesNotAccessMemory() is true since that avoids the need for the expensive call for all allocas on the basis that none of them can be accessed if it doesn't access memory. However, it's not enough for DSE when doesNotAccessMemory() is false since DSE wants to know if specific allocas are accessed by the call site so we'd still have to iterate over all the possibly-dead allocas checking each one with getModRefInfo(ImmutableCallSite *, MemoryLocation) which (without something like this patch) would repeatedly perform the expensive check to see if the MemoryLocation is a non-escaping local.

MemorySSA is an analysis, all info is kept "on the side". It's standalone and no info is kept in BBs.
Some doc can be found here: http://releases.llvm.org/6.0.1/docs/MemorySSA.html

MemorySSA would be a net improvement vs MemoryDependenceAnalysis currently used in DSE.

asbirlea added inline comments.Nov 5 2018, 10:09 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
1382–1383

Could this go inside eliminateDeadStores(F, ...)?

1415–1416

Same as above.

sanjoy resigned from this revision.Jan 29 2022, 5:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 29 2022, 5:30 PM