This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Remove redundant stores more aggressively.
ClosedPublic

Authored by efriedma on Feb 10 2017, 1:13 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

efriedma created this revision.Feb 10 2017, 1:13 PM
RKSimon edited edge metadata.

Adding reviewers, including ARM specialists given its the only test case

spatel edited edge metadata.

I don't know enough about TokenFactor or chains to review, so adding more reviewers still. :)
Note that the problem isn't limited to ARM: the first test manifests (and gets fixed with this patch) on x86. But not the 2nd test. I didn't dig in to see why. On PPC, early-cse catches the first test, so the problem isn't ever visible in the DAG (why don't the other targets run that too?). Like x86, the 2nd test isn't fixed on PPC.

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7161 ↗(On Diff #88045)

Don't need "llvm::" here?

7165 ↗(On Diff #88045)

It would be nice to convert this loop to (*this)->ops() as well for symmetry.

I took a quick look at why test2 isn't getting optimized on other platforms.

On x86, we end up with a DAG that looks like store(trunc(zextload(x)), x), and the code that looks for redundant stores doesn't try to look through truncate nodes. I'm not sure why the no-op OR isn't getting simplified; maybe something isn't getting added to the worklist.

On PPC, the store isn't dead because it's a big-endian target. We probably need some dedicated optimization before legalization to catch this.

Ugh, studied this a bit more, this is wrong.

The problem is that TokenFactor isn't guaranteed to be the minimal set chain values. Consider something like this:

chain1, val1 = load entrytoken...
chain2 = store chain1...
chain3 = tokenfactor chain1, chain2
chain4 = store chain3, val1, ...

The presence of chain1 in the tokenfactor is irrelevant; the only correct ordering is chain1, chain2, chain3, chain4, so chain2 might clobber the load.

I think I can come up with a more targeted fix which is correct.

efriedma updated this revision to Diff 88969.Feb 17 2017, 3:12 PM

Updated to address review comments. Updated to check that Dest has only one use, so we don't accidentally allow some illegal reordering.

niravd edited edge metadata.Feb 17 2017, 7:56 PM

Hmm. This doesn't look like it is correct as it is currently upstream much less with this change. I'm not sure if we can get a problematic graph wtihout useAA on, so perhaps this is why we never noticed before. The chain dependence subgraph does _not_ capture the full dependencies of any two nodes in the subgraph. It only captures straight control dependencies in frame. This is important for efficiency and allows stack-level memory operations to be elided if unused.

This underlying transformation in visitSTORE effectively tries to reduce the parallelism in the DAG by pulling the relevant memory operation in the parallel TokenFactor group to the end. This could be done with calls of predecessorHelper though it's possible too expensive. If this is encoded in the graph it should be safe to do, but as it is currently it looks like multiple applications of the dead store elimination might be invalid. I haven't thought deeply enough about the extract vector case, but there's a possibility of incorrect behavior with respect multiple threads peeking at each other's stack.

Ugh, studied this a bit more, this is wrong.

The problem is that TokenFactor isn't guaranteed to be the minimal set chain values. Consider something like this:

chain1, val1 = load entrytoken...
chain2 = store chain1...
chain3 = tokenfactor chain1, chain2
chain4 = store chain3, val1, ...

The presence of chain1 in the tokenfactor is irrelevant; the only correct ordering is chain1, chain2, chain3, chain4, so chain2 might clobber the load.

I think I can come up with a more targeted fix which is correct.

The chain dependence subgraph does _not_ capture the full dependencies of any two nodes in the subgraph. It only captures straight control dependencies in frame.

That's true, but we don't need the full dependencies to remove a redundant store, only the control dependencies. The control dependencies are what determines the range in which a loaded value is valid, and the range in which a store is allowed to modify memory; it doesn't matter whether we could actually schedule the DAG in that order. In other words, it's legal to duplicate a load or store. This is fundamental to the way we reason about memory in SelectionDAG: for example, legalization depends on this to allow splitting loads and stores.

I'll add a comment to clarify this.

I took a quick look at why test2 isn't getting optimized on other platforms.

On x86, we end up with a DAG that looks like store(trunc(zextload(x)), x), and the code that looks for redundant stores doesn't try to look through truncate nodes. I'm not sure why the no-op OR isn't getting simplified; maybe something isn't getting added to the worklist.

On PPC, the store isn't dead because it's a big-endian target. We probably need some dedicated optimization before legalization to catch this.

Are there tests that cover the cases in X86/PPC (arm-be?) already or could you add them please? I'd also recommend committing the tests with the current codegen so that this patch can show the diffs.

efriedma updated this revision to Diff 89605.Feb 23 2017, 7:30 PM

Added comment about side-effects and chains. Committed testcases separately to show code generation differences.

I'm just relanded D14834 which defaults on anti-aliasing in the DAGCombiner and will hopefully stick this time. This simplifies a number of potential cases (but not all) your patch would catch to something we catch. As it stand snow, all of your given examples are now caught without your patch. Are there other cases that you've noticed?

Also I'm still worried about correctness of ordering. Consider the following graph:

x, ch1 = Load A, ch0
ch2 = Store volatile AliasA x+1 ch1
y, ch3 = Load A ch2
ch4 = TF ch1, ch3
ch5 = Store A x ch4

My understanding is the patch could erase the last store which isn't correct.

Are there other cases that you've noticed?

I originally stumbled across this issue looking at code with a construct like i24_insert_bit... I don't know if this comes up in practice otherwise.

x, ch1 = Load A, ch0
ch2 = Store volatile AliasA x+1 ch1
y, ch3 = Load A ch2
ch4 = TF ch1, ch3
ch5 = Store A x ch4

This is the reason I added the hasOneUse check.

On trunk, this leads to zero changes to regression tests, but it still has an effect when rebased on top of D30472. I'll hold off on rebasing this until that is merged.

efriedma updated this revision to Diff 91767.Mar 14 2017, 1:44 PM

Updated diff. I'm not sure why it only has an effect on that one function.

niravd accepted this revision.Mar 15 2017, 7:37 AM

I suspect this is triggering slightly less often because of you're using the more conservative than necessary hasOneUse() restriction which has become slightly less common with AA on by default. Turning it to true reveals at least one more case that we could catch if we did a deeper analysis to make sure we are

In any case, modulo my minor inline comment adding this shallow check is an obvious win and this LGTM.

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7309 ↗(On Diff #91767)

IIUC the shallow search's true set is not a subset of the deeper one and we should have the union.
so this should probably be:

if (Dest.hasOneUse())
  return true;
This revision is now accepted and ready to land.Mar 15 2017, 7:37 AM
This revision was automatically updated to reflect the committed changes.