This is an archive of the discontinued LLVM Phabricator instance.

[EarlyCSE] DSE of atomic unordered stores
ClosedPublic

Authored by reames on Dec 8 2015, 2:17 PM.

Details

Summary

The rules for removing trivially dead stores are a lot less complicated than loads. Since we know the later store post dominates the former, unless the former has side effects other than the actual store, we can remove it. One slightly surprising thing is that we can freely remove atomic stores, even if the later one isn't atomic. There's no guarantee the atomic one was every visible.

(I'd really appreciate a sanity check on the preceding argument. That somewhat surprised me when I realized it.)

For the moment, we don't handle DSE of ordered atomic stores. We could extend the same chain of reasoning to them, but the catch is we'd then have to model the ordering effect without a store instruction. Since our fences are a stronger than our operation orderings, simple using a fence isn't an obvious win. This arguable calls for a refinement in our fence specification, but that's (much) later work.

Diff Detail

Repository
rL LLVM

Event Timeline

reames updated this revision to Diff 42224.Dec 8 2015, 2:17 PM
reames retitled this revision from to [EarlyCSE] DSE of atomic unordered stores.
reames updated this object.
reames added a subscriber: llvm-commits.
majnemer accepted this revision.Dec 8 2015, 2:23 PM
majnemer edited edge metadata.

LGTM. You have a volatile-volatile test (test25). Does it make sense to add a simple-volatile test, unordered-volatile or a simple-unordered test?

This revision is now accepted and ready to land.Dec 8 2015, 2:23 PM
reames added a comment.Dec 8 2015, 2:40 PM

LGTM. You have a volatile-volatile test (test25). Does it make sense to add a simple-volatile test, unordered-volatile or a simple-unordered test?

simple-unordered is test22 (pre-existing)

The other two are probably undefined behaviour. Our specification is a bit unclear, but volatile (I believe) applies to a memory location, not an memory access. Thus, mixing volatile and non-volatile access to the same location is likely undefined behaviour.

dberlin edited edge metadata.Dec 8 2015, 2:52 PM
dberlin added a subscriber: dberlin.

I admit to being curious why you think volatile applies to a memory
location , mainly because the last time it was discussed at length, i
remember the consensus being that volatile is about the access, not the
location ;-)

The langref itself specifically talks about volatile memory accesses, and
volatile operations, and says things like:
"The optimizers *may* change the order of volatile operations relative to
non-volatile operations."
IE it goes to lengths to talk about the operations and not the locations
they refer to.

reames added a comment.Dec 8 2015, 3:04 PM

I admit to being curious why you think volatile applies to a memory
location , mainly because the last time it was discussed at length, i
remember the consensus being that volatile is about the access, not the
location ;-)

So, I will freely admit I haven't given this much thought. I'm trying to not change the volatile implementation at all with these changes. Having said that, let me explain...

In C++, I believe volatile applies to a lvalue (i.e. location). As such, all accesses to a particular location must be either volatile or non volatile. I know that you can use a const_cast to change the volatile qualifier on an access, but I thought it was still undefined behavior to access an location with the "wrong" qualifiers.

I'm actively trying *not* to exploit this here. If I changed behaviour for volatile loads or stores, that's unintentional and should be discussed.

jfb edited edge metadata.Dec 9 2015, 2:47 AM

I'm not sure this is good to go as-is.

lib/Transforms/Scalar/EarlyCSE.cpp
690 ↗(On Diff #42224)

Typo "removal".

test/Transforms/EarlyCSE/atomics.ll
131 ↗(On Diff #42224)

I think the store you keep still needs to be atomic, i.e. you're "sinking" the atomicity to all post-dominating stores. Let's ignore that this is unordered (silly example) and imagine it's a store release: in that case the writer could have been writing a non-zero value, then racing with itself and writing another non-zero value to the same location. The reader observes the transition to non-zero (yes it's racy *which* non-zero is seen, but one is seen). If you drop release then the algorithm becomes incorrect.

void writer(Payload p, atomic<int> *flag) {
  p->update();
  flag->write(1, release);
  // ...
  *reinterpret_cast<int*>(flag) = 2; // WTF?
}
void reader(Payload p, atomic<int> *flag) {
  while (!flag->load(acquire)) ;
  p->process();
}

In the above example p->process() isn't guaranteed to see the latest state from p. Yes the code is already UB, but we don't need to be silly.

Seen another way, replace reinterpret_cast with a flag->store(2, relaxed) and then we're well-defined, you can eliminate the store release, but you have to "sink" the release to the later store.

159 ↗(On Diff #42224)

Can you check that 3 is the value stored here? We should probably try to avoid unexpected things even when we're allowed to do the unexpected (and document that we're doing this).

170 ↗(On Diff #42224)

Why not just:

; CHECK-LABEL: @test27
; CHECK-NEXT: store atomic i32 3, i32* %p1 release
; CHECK-NEXT: ret

?

177 ↗(On Diff #42224)

This test isn't testing what the documentation says it is.

182 ↗(On Diff #42224)

Ditto on CHECK.

Responding to JF.

test/Transforms/EarlyCSE/atomics.ll
131 ↗(On Diff #42224)

Two problems with this:

  1. Your use of a release store is substantially different than what's being modified and tested here. In particular, a release store implies ordering where an unordered atomic does not. As a result, the basis for your example collapses. (If we were implementing ordered atomics here, it would be a problem.) -- Note I actually agree with your point that we'd have to keep the release store in the modified example. I believe I even said that in a comment in the code.
  2. The execution of the non-atomic store to a racy location is undefined. You are *not* guaranteed that it publishes a only non-zero value. Consider the constant value 0x01 being stored over the value 0x10. Pretend for a moment we see bit level tearing. It's perfectly legal for the memory to transition from 0x10 to 0x00 to 0x01.
159 ↗(On Diff #42224)

Ah, good catch. And actually, it's required to be 3. Storing 1 would not be valid. There could be some later well synchronized operation with a dependent read on another thread. It's only allowed to observe '3'.

170 ↗(On Diff #42224)

Would also work. I was trying to avoid testing irrelevant details so that if someone tweaked the test, it wouldn't spuriously fail. e.g. I could increase the alignment or change the values.

jfb added inline comments.Dec 10 2015, 12:54 PM
test/Transforms/EarlyCSE/atomics.ll
131 ↗(On Diff #42224)

Agreed. I think unordered is a silly example, but I'd like to make sure that if an enthusiastic person tries to optimize this code further then they know that they can't generalize the optimization as-is. I was suggesting that you keep the atomic ordering just because we don't really need to remove it, but your point about tearing leads me the change my mind (bad code is bad, no need to try too hard).

I'd be OK if you added a negative test as I illustrated (release store followed by unordered store) along with an explanation of which outcome is valid: you have to keep the release ordering, but which value gets stored is undefined since the unordered store can move above the release one.

170 ↗(On Diff #42224)

I'd like to make sure 3 is still the value stored by this.

reames updated this revision to Diff 42945.Dec 15 2015, 5:22 PM
reames edited edge metadata.

update per JF's comments

@JF - How does the new version look?

test/Transforms/EarlyCSE/atomics.ll
189 ↗(On Diff #42945)

done as @test29

228 ↗(On Diff #42945)

This turned out to be amusing. We hadn't actually implemented the optimization this was testing and the weaker check was spuriously passing. Thanks for pushing me on this.

The problem is that we have a release store marked as potentially reading memory which trips up:
if (Inst->mayReadFromMemory() &&

  !(MemInst.isValid() && !MemInst.mayReadFromMemory()))
LastStore = nullptr;

I'm going to mark these as missed opts for the moment, and will likely revisit this at some future point.

jfb added a comment.Dec 16 2015, 10:44 AM

Two comment edits, then lgtm.

lib/Transforms/Scalar/EarlyCSE.cpp
718 ↗(On Diff #42945)

I'd clarify: "there's no special requirement for unordered stores about removing atomic stores only".

753 ↗(On Diff #42945)

Could you also mention that we could "sink" the atomicity to all post-dominating stores? This is just documentation, I don't advocate that we do so at the moment since it's similar to @morisset's fence elimination proposal.

This revision was automatically updated to reflect the committed changes.