SDValue::reachesChainWithoutSideEffects is tricky code (see r114268), so please review carefully.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I don't know enough about TokenFactor or chains to review, so adding more reviewers still. :)
Note that the problem isn't limited to ARM: the first test manifests (and gets fixed with this patch) on x86. But not the 2nd test. I didn't dig in to see why. On PPC, early-cse catches the first test, so the problem isn't ever visible in the DAG (why don't the other targets run that too?). Like x86, the 2nd test isn't fixed on PPC.
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
7298–7299 | It would be nice to convert this loop to (*this)->ops() as well for symmetry. | |
7301 | Don't need "llvm::" here? |
I took a quick look at why test2 isn't getting optimized on other platforms.
On x86, we end up with a DAG that looks like store(trunc(zextload(x)), x), and the code that looks for redundant stores doesn't try to look through truncate nodes. I'm not sure why the no-op OR isn't getting simplified; maybe something isn't getting added to the worklist.
On PPC, the store isn't dead because it's a big-endian target. We probably need some dedicated optimization before legalization to catch this.
Ugh, studied this a bit more, this is wrong.
The problem is that TokenFactor isn't guaranteed to be the minimal set chain values. Consider something like this:
chain1, val1 = load entrytoken...
chain2 = store chain1...
chain3 = tokenfactor chain1, chain2
chain4 = store chain3, val1, ...
The presence of chain1 in the tokenfactor is irrelevant; the only correct ordering is chain1, chain2, chain3, chain4, so chain2 might clobber the load.
I think I can come up with a more targeted fix which is correct.
Updated to address review comments. Updated to check that Dest has only one use, so we don't accidentally allow some illegal reordering.
Hmm. This doesn't look like it is correct as it is currently upstream much less with this change. I'm not sure if we can get a problematic graph wtihout useAA on, so perhaps this is why we never noticed before. The chain dependence subgraph does _not_ capture the full dependencies of any two nodes in the subgraph. It only captures straight control dependencies in frame. This is important for efficiency and allows stack-level memory operations to be elided if unused.
This underlying transformation in visitSTORE effectively tries to reduce the parallelism in the DAG by pulling the relevant memory operation in the parallel TokenFactor group to the end. This could be done with calls of predecessorHelper though it's possible too expensive. If this is encoded in the graph it should be safe to do, but as it is currently it looks like multiple applications of the dead store elimination might be invalid. I haven't thought deeply enough about the extract vector case, but there's a possibility of incorrect behavior with respect multiple threads peeking at each other's stack.
The chain dependence subgraph does _not_ capture the full dependencies of any two nodes in the subgraph. It only captures straight control dependencies in frame.
That's true, but we don't need the full dependencies to remove a redundant store, only the control dependencies. The control dependencies are what determines the range in which a loaded value is valid, and the range in which a store is allowed to modify memory; it doesn't matter whether we could actually schedule the DAG in that order. In other words, it's legal to duplicate a load or store. This is fundamental to the way we reason about memory in SelectionDAG: for example, legalization depends on this to allow splitting loads and stores.
I'll add a comment to clarify this.
Are there tests that cover the cases in X86/PPC (arm-be?) already or could you add them please? I'd also recommend committing the tests with the current codegen so that this patch can show the diffs.
Added comment about side-effects and chains. Committed testcases separately to show code generation differences.
I'm just relanded D14834 which defaults on anti-aliasing in the DAGCombiner and will hopefully stick this time. This simplifies a number of potential cases (but not all) your patch would catch to something we catch. As it stand snow, all of your given examples are now caught without your patch. Are there other cases that you've noticed?
Also I'm still worried about correctness of ordering. Consider the following graph:
x, ch1 = Load A, ch0
ch2 = Store volatile AliasA x+1 ch1
y, ch3 = Load A ch2
ch4 = TF ch1, ch3
ch5 = Store A x ch4
My understanding is the patch could erase the last store which isn't correct.
Are there other cases that you've noticed?
I originally stumbled across this issue looking at code with a construct like i24_insert_bit... I don't know if this comes up in practice otherwise.
x, ch1 = Load A, ch0
ch2 = Store volatile AliasA x+1 ch1
y, ch3 = Load A ch2
ch4 = TF ch1, ch3
ch5 = Store A x ch4
This is the reason I added the hasOneUse check.
On trunk, this leads to zero changes to regression tests, but it still has an effect when rebased on top of D30472. I'll hold off on rebasing this until that is merged.
I suspect this is triggering slightly less often because of you're using the more conservative than necessary hasOneUse() restriction which has become slightly less common with AA on by default. Turning it to true reveals at least one more case that we could catch if we did a deeper analysis to make sure we are
In any case, modulo my minor inline comment adding this shallow check is an obvious win and this LGTM.
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
7309 | IIUC the shallow search's true set is not a subset of the deeper one and we should have the union. if (Dest.hasOneUse()) return true; |
It would be nice to convert this loop to (*this)->ops() as well for symmetry.