This is an archive of the discontinued LLVM Phabricator instance.

[SDAG] bail out of mergeTruncStores() if there's an unknown store in the chain
ClosedPublic

Authored by spatel on Nov 10 2022, 9:47 AM.

Details

Summary

This hopefully fixes the miscompile in issue #58883.

I'm not sure if the baseline test actually shows the expected miscompile, but the test diff does demonstrate that we gave up on store merging in that example.

This change should be strictly safe (just adds another clause to avoid the transform), and it does not prohibit any existing valid optimizations based on regression tests. I want to believe that it's also a sufficient fix (possibly overkill), but I'm not sure how to prove that.

Diff Detail

Event Timeline

spatel created this revision.Nov 10 2022, 9:47 AM
spatel requested review of this revision.Nov 10 2022, 9:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2022, 9:47 AM
jonpa added a comment.Nov 12 2022, 6:01 AM

As I mentioned on the other thread, it would probably be better to derive a test case from main if possible, where the bug is exposed (64-bit overlapping store after store of 0).

Maybe stop after ISel? Should all the unused metadata and datalaoyut string be included? (I usually strip that away for readability personally...)

jonpa added a comment.Nov 12 2022, 6:43 AM

I can confirm that the patch fixes the unreduced C-smith test case also.

spatel updated this revision to Diff 475003.Nov 13 2022, 9:45 AM

Patch updated:
As discussed in #58883, I was looking at the wrong asm. The test is corrected/reduced from before, and I added a test comment to explain the miscompile. (I didn't push the baseline test update yet in case I my reading of the asm is incorrect.)

I can confirm that the patch fixes the unreduced C-smith test case also.

Does the updated test/description look right?

jonpa added a comment.Nov 16 2022, 8:41 AM

I can confirm that the patch fixes the unreduced C-smith test case also.

Does the updated test/description look right?

Yes, that looks like the failure as the stg (Store 64-bits) comes after and overrwrites what the sthrl (Store Halfword Relative Long) just stored.

To test that this behavior does *not* happen (with the patch), I wonder if it would maybe be better to have a .mir test that starts before isel and then check in the debug dump that DAGCombiner does not do anything wrong.

spatel updated this revision to Diff 475865.Nov 16 2022, 10:21 AM

Updated test to stop after isel and check MIR lines.

jonpa added a comment.Nov 17 2022, 8:08 AM

Updated test to stop after isel and check MIR lines.

The test looks ok, but I guess if you wanted to you could instead have a test which contains "CHECK-NOT: STG", to make it explicit that there should be no 64-bit store in the output...

The test looks ok, but I guess if you wanted to you could instead have a test which contains "CHECK-NOT: STG", to make it explicit that there should be no 64-bit store in the output...

I actually prefer to not use CHECK-NOT; we've seen cases where tests sporadically fail on those because someone's home dir or some other unexpected string pollutes the output and gets caught by FileCheck on just that one machine.

The complete CHECK-NEXT sequence from start of the function through return guarantees that we have the expected code and nothing else. And it doesn't require any special target knowledge to adjust the test in the future if it changes since the CHECKs were auto-generated - someone with SystemZ knowledge would still review any diffs of course. :)

jonpa added a comment.Nov 17 2022, 4:26 PM

The test looks ok, but I guess if you wanted to you could instead have a test which contains "CHECK-NOT: STG", to make it explicit that there should be no 64-bit store in the output...

I actually prefer to not use CHECK-NOT; we've seen cases where tests sporadically fail on those because someone's home dir or some other unexpected string pollutes the output and gets caught by FileCheck on just that one machine.

The complete CHECK-NEXT sequence from start of the function through return guarantees that we have the expected code and nothing else. And it doesn't require any special target knowledge to adjust the test in the future if it changes since the CHECKs were auto-generated - someone with SystemZ knowledge would still review any diffs of course. :)

ok... a regexp like STG .* (store s64) might be fairly safe, but I guess the auto-generated test also may be simpler to maintain. However, I see that this test is actually now passing without the patch. I guess the extra options used earlier would make it fail, but maybe it would be best to really test the DAGCombiner:

-debug-only=dagcombine
REQUIRES: asserts
CHECK: Combining: t16: ch = store<(store (s32) into %ir.f4)> t13, t9, t15, undef:i64
CHECK-NOT ... into: t30: ch = store<(store (s64) into %ir.t1, align 4)> t12, t4, t11, undef:i64
(better use some regexps)

That would also be more safe against any future changes to the scheduler options used originally, right?

jonpa added a subscriber: d.Nov 17 2022, 5:24 PM
jonpa added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8162

I think this will also bail on stores that are actually not aliasing any of the stores, for instance the store to '@d' in the test case. Would it be possible to check aliasing against the full, merged store? (maybe in a separate loop further down after the offsets have been checked?)

Why are not the uses of N checked?

What about the uses of the uses... wouldn't we have to check any additional stores further down the DAG via the use edges?

Does this make a big difference? I am thinking that an alternative to bailing might be to add the correct chains from the new store to all the users..?

The test looks ok, but I guess if you wanted to you could instead have a test which contains "CHECK-NOT: STG", to make it explicit that there should be no 64-bit store in the output...

I actually prefer to not use CHECK-NOT; we've seen cases where tests sporadically fail on those because someone's home dir or some other unexpected string pollutes the output and gets caught by FileCheck on just that one machine.

The complete CHECK-NEXT sequence from start of the function through return guarantees that we have the expected code and nothing else. And it doesn't require any special target knowledge to adjust the test in the future if it changes since the CHECKs were auto-generated - someone with SystemZ knowledge would still review any diffs of course. :)

ok... a regexp like STG .* (store s64) might be fairly safe, but I guess the auto-generated test also may be simpler to maintain. However, I see that this test is actually now passing without the patch. I guess the extra options used earlier would make it fail, but maybe it would be best to really test the DAGCombiner:

-debug-only=dagcombine
REQUIRES: asserts
CHECK: Combining: t16: ch = store<(store (s32) into %ir.f4)> t13, t9, t15, undef:i64
CHECK-NOT ... into: t30: ch = store<(store (s64) into %ir.t1, align 4)> t12, t4, t11, undef:i64
(better use some regexps)

That would also be more safe against any future changes to the scheduler options used originally, right?

It's a trade-off in test fragility / specificity. As noted, we're not using any scheduler options on the RUN line now (so the test is more focused, but the actual miscompile is not visible).
I can add another RUN to check the debug spew from DAGCombiner, but I don't see how that makes the test intent any clearer than the MIR checks + test comment.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8162

TBH, I'm not interested in preserving corner-case optimizations that are not based on the motivating code patterns (for which there are several regression tests across different targets).

I just want to avoid the proven miscompile; that's the most important thing. Optimizations can be added later if that matters.

I don't think it is necessary to walk uses of uses on this transform. IIUC, if there's some other problematic store that is included anywhere on this chain, then it must be a use of at least one of the stores to be merged, so we should find it eventually (and give up on the transform).

I don't have a complete understanding of chains and token factors, so I could be wrong, but I haven't come up with any code examples where this does not work.

spatel updated this revision to Diff 476499.Nov 18 2022, 8:59 AM

Added a RUN line and checks of debug output to make it clearer where this goes wrong.

jonpa added a comment.Nov 18 2022, 2:11 PM

I can add another RUN to check the debug spew from DAGCombiner, but I don't see how that makes the test intent any clearer than the MIR checks + test comment.

I guess I was just trying to make the test more resilient to any future changes which would require review of the test each time it would have to be re-generated. I also thought it would be more clear to show exactly which optimization we don't want to happen. Thinking more about it, the funny thing is that it actually would be legal to merge the stores as long as the i16 store to @e was chained to follow after. It may be sufficient to have just one of the runlines, and I guess it might not matter much which one...

I just want to avoid the proven miscompile; that's the most important thing. Optimizations can be added later if that matters.

That's true. I checked on SPEC/SystemZ and this did not change a single file, so bailing seems to be ideal.

I don't think it is necessary to walk uses of uses on this transform. IIUC, if there's some other problematic store that is included anywhere on this chain, then it must be a use of at least one of the stores to be merged, so we should find it eventually (and give up on the transform).

Yeah, I guess that's true.

Could there be a load which is chained after a store, which in turn has a store chained after it? Or any other chained operation besides a store whatever that might be?

And what about N, the first node that never gets its uses checked..?

I guess I was just trying to make the test more resilient to any future changes which would require review of the test each time it would have to be re-generated. I also thought it would be more clear to show exactly which optimization we don't want to happen. Thinking more about it, the funny thing is that it actually would be legal to merge the stores as long as the i16 store to @e was chained to follow after. It may be sufficient to have just one of the runlines, and I guess it might not matter much which one...

Yes, this example works fine unless the stores get re-ordered. That was the advantage of the first test draft that included the extra RUN params; it showed that the final asm code was truly a miscompile. I'm ok with any of the test variations so far, so let me know if you want this to change.

Could there be a load which is chained after a store, which in turn has a store chained after it? Or any other chained operation besides a store whatever that might be?

I'm not finding a way to show a bug from code like that. If there's a load after one of the stores in our merging set, then it must be independent of any subsequent stores. Therefore, any stores chained in parallel to that load must also be independent. The wide merged store is guaranteed to be chained before the load, so that guarantees that any bits loaded that actually do depend on previous stores are still being stored first.

And what about N, the first node that never gets its uses checked..?

We don't care about values that are chained after N - that's the last original store in time in the set. That's because the merged store from this transform happens strictly before N. Ie, the chain for the merged store is always updated to the first store in our merged set of stores, and that set has at least 2 values in it.

jonpa added a comment.EditedNov 22 2022, 2:37 PM

Yes, this example works fine unless the stores get re-ordered. That was the advantage of the first test draft that included the extra RUN params; it showed that the final asm code was truly a miscompile. I'm ok with any of the test variations so far, so let me know if you want this to change.

Yes, the first one gave a more obvious error, although it was dependent on the scheduler behavior, and therefore not ideal. Since maybe the DAGCombine optimization is not really wrong (if the chains are correctly updated), maybe that wasn't a good idea after all. Maybe your first test was better then, if also the comment said that the store of 0 should follow the two other stores (or a larger one) to @e, or in other words go after all other stores except the one to @d. Or something like that... :-)

Could there be a load which is chained after a store, which in turn has a store chained after it? Or any other chained operation besides a store whatever that might be?

I'm not finding a way to show a bug from code like that. If there's a load after one of the stores in our merging set, then it must be independent of any subsequent stores. Therefore, any stores chained in parallel to that load must also be independent. The wide merged store is guaranteed to be chained before the load, so that guarantees that any bits loaded that actually do depend on previous stores are still being stored first.

What if we have four stores to be merged:

N -> St1 -> St2 -> St3 -> Chain
         /
St -> Ld

Ld may be loading from one or more bytes overlapping St2 and/or St3. St could store to bytes in memory overlapping Ld, St2 and/or St3. Or am I missing something here?

(I guess the confusing thing here to me is that the stores to be merged are chained after one another, while they are actually expected to be non-overlapping by the algorithm...)

Personally I would feel safer if we did not allow any chains at all to get dropped - so we expect all the stores (except N) to only be chained before other stores in the group and nothing else. Ignoring and dropping chains seems best done as a separate optimization (like findBetterChain()), or at least deserves a comment when doing so, I'd say.

It says at the bottom of mergeTruncStores to rely on other DAGCombiner rules to remove the other individual stores. From what I can see in my (original) test case that actually does not happen - one of the i32 stores remain after isel. This seems problematic when considering the final ordering. The NewStore takes the place of the store sequence in the DAG, but if there remains one of them, it is likely then not correctly ordered...? If this is indeed the case that they may remain, then I suppose NewStore should be chained after N instead of Chain so that the ordering is correct. Somehow they should still get removed, though.

And what about N, the first node that never gets its uses checked..?

We don't care about values that are chained after N - that's the last original store in time in the set. That's because the merged store from this transform happens strictly before N. Ie, the chain for the merged store is always updated to the first store in our merged set of stores, and that set has at least 2 values in it.

I can see that now, as the chain of NewStore takes the place of N lastly in the method.

N -> St1 -> St2 -> St3 -> Chain
         /
St -> Ld

Ld may be loading from one or more bytes overlapping St2 and/or St3. St could store to bytes in memory overlapping Ld, St2 and/or St3. Or am I missing something here?

(I guess the confusing thing here to me is that the stores to be merged are chained after one another, while they are actually expected to be non-overlapping by the algorithm...)

Personally I would feel safer if we did not allow any chains at all to get dropped - so we expect all the stores (except N) to only be chained before other stores in the group and nothing else. Ignoring and dropping chains seems best done as a separate optimization (like findBetterChain()), or at least deserves a comment when doing so, I'd say.

Yes, let's just check that each store has one use only (as the chain operand of a subsequent store). That ensures that no other interfering operation could occur in the interval between the first store and N. That's the smallest patch, and it does not cause any missed optimizations based on regression tests.

spatel updated this revision to Diff 478680.Nov 29 2022, 12:32 PM

Patch updated:

  1. Check one-use of each store that occurs before 'N'.
  2. Revert back to the SystemZ asm test because that shows the miscompile.
jonpa accepted this revision.Dec 1 2022, 2:33 PM

LGTM - I think now we are on the safe side regarding the chains.

This revision is now accepted and ready to land.Dec 1 2022, 2:33 PM
This revision was landed with ongoing or failed builds.Dec 2 2022, 7:09 AM
This revision was automatically updated to reflect the committed changes.