This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
7
EarlyCSE.cpp
-
test/Transforms/EarlyCSE/
-
Transforms/
-
EarlyCSE/
12
fence.ll

Differential D11434

Allow value forwarding past release fences in EarlyCSE
ClosedPublic

Authored by reames on Jul 22 2015, 3:22 PM.

Download Raw Diff

Details

Reviewers

dvyukov
• dberlin
nicholas
jfb
hfinkel
morisset

Commits

rGdfd890dd3a0f: Allow value forwarding past release fences in EarlyCSE
rL246134: Allow value forwarding past release fences in EarlyCSE

Summary

A release fence acts as a publication barrier for stores within the current thread to become visible to other threads which might observe the release fence. It does not require the current thread to observe stores performed on other threads. As a result, we can allow store-load and load-store forwarding across a release fence.

We do need to make sure that stores before the fence can be eliminated even if there's an otherwise store to the same location after the fence. In theory, we could reorder the second store above the fence and *then* eliminate the former, but we can't do this if the stores are on opposite sides of the fence.

I have a similar change, D11436, for MemoryDependenceAnalysis, but I consider this change much lower risk than that one.

p.s. The LangRef indicates only atomic loads and stores are effected by fences. This patch chooses to be far more conservative then that. I'm not even really sure the LangRef's definition is helpful.

Diff Detail

Event Timeline

reames updated this revision to Diff 30410.Jul 22 2015, 3:22 PM

reames retitled this revision from to Allow value forwarding past release fences in EarlyCSE.

reames updated this object.

reames added reviewers: jfb, morisset, hfinkel, nicholas, • dberlin.

reames added a subscriber: llvm-commits.

jfb added a reviewer: dvyukov.Jul 22 2015, 4:17 PM

jfb updated this object.

jfb added inline comments.Jul 22 2015, 4:47 PM

lib/Transforms/Scalar/EarlyCSE.cpp
635	I'm not sure I understand why this isn't firing when there's an intervening release fence (`test4` below). I think it could: there was a race anyways, so screw the code and keep one of the store (storing the last value, of course)? Another thread is guaranteed to observe a store to that memory location, but they're racing with each other so the first one isn't guaranteed to ever be observed. The second store is the more hilarious one to keep, but you could just store its value before the fence? e.g.: loc = 42; std::atomic_thread_fence(std::memory_order_release); loc = 0; Becomes: std::atomic_thread_fence(std::memory_order_release); loc = 0; Or: loc = 0; std::atomic_thread_fence(std::memory_order_release); ?
test/Transforms/EarlyCSE/fence.ll
43	You can't forward `%a` and eliminate `%a2`, but you could eliminate `%a`, and replace it by `%a2` if there are no intervening uses, no?

dvyukov added inline comments.Jul 23 2015, 6:30 AM

test/Transforms/EarlyCSE/fence.ll
43	In this particular case we can do whatever we want with accesses to %addr.i. No other thread can access it without a race.
57	Is it a full test? Or is it something that we don't handle yet? I am asking because in this particular case we can do whatever we want with accesses to %addr. Since we store to %addr both before and after the fence, we can conclude that this fence is not used to synchronize accesses to %addr, so we can eliminate/combine/etc accesses to %addr.

reames added inline comments.Jul 23 2015, 3:47 PM

lib/Transforms/Scalar/EarlyCSE.cpp
635	See my inline response for test4 below.
test/Transforms/EarlyCSE/fence.ll
43	@jfb - Yes, but that's not what the current optimization tries to do. Nor does EarlyCSE have a mechanism to do that in general even for unordered loads. @dvyukov - Yes, but we need to actually prove that and my change to EarlyCSE is no where near that aggressive. And frankly, I'm really not interested in exploiting the undefined behaviour part of this. Do you have suggestions for how to restructure the test in a way which still tests EarlyCSE but avoids disallowing the optimization you pointed out? Would simply removing the first load from the check sequence be sufficient?
57	In general, I don't believe it is legal to eliminate the first store. With all memory initialized to zero, consider these two threads: T1: b = 2; a = 1; release fence b = 4; T2: while (a == 0) {} if (b == 0) abort(); In this particular case I think you're reasoning is sound because we can tell the store to a doesn't exist. (Just to be clear, your concern here is an unnecessarily restricted optimization, not a correctness concern right?) Would splitting this test into two - one with a store to 'a' which is a negative test and one as is which is marked as a missed optimization - satisfy you?

dvyukov added inline comments.Jul 24 2015, 12:04 AM

test/Transforms/EarlyCSE/fence.ll
43	I can't answer your question. I don't understand what EarlyCSE does, change description talks only about release fences but this test contains an acquire fence. I am trying to figure out what this change does by reading the tests. And the tests says that we cannot do something that we actually can do...
57	I don't understand. Your example is a one large data race. Behavior is undefined.

jfb added inline comments.Jul 24 2015, 8:58 AM

test/Transforms/EarlyCSE/fence.ll
43	I think the main problem Dmitry and I have is the comment: it says the transformation can't happen but it in fact is allowed to happen and simply ins't implemented here. You're testing the implementation knowing what it does, but your comment hints to future contributors that doing this optimization isn't valid. I'm assuming that the test can't be restructured because it would require a loop, and that's something that the current implementation won't see through. You're testing a very minimal thing instead (an overly conservative on, since it's a race!), and ensuring that the current code can't optimize it.

jfb mentioned this in D11436: Allow value forwarding past release fences in GVN.Jul 24 2015, 9:26 AM

reames added inline comments.Jul 24 2015, 10:43 AM

test/Transforms/EarlyCSE/fence.ll
43	Guys, would simply adding a comment to the test explaining the alternate path by which this could be optimized satisfy you? The code in earlycse needs to be specific to release fences. I don't know of an easy way to exercise the code to form a negative test without something of similar form. p.s. I get that there is some optimization which can recognize this as a data race. However, simple load-load forwarding as implemented in EarlyCSE is not it. Nor is likely to ever be so.

jfb added inline comments.Jul 24 2015, 10:48 AM

test/Transforms/EarlyCSE/fence.ll
43	Yes.

Adjust test comments to indicate where the negative tests are testing implementation rather than the memory model per review comments.

jfb added inline comments.Aug 26 2015, 4:01 PM

lib/Transforms/Scalar/EarlyCSE.cpp
617	Getting re-acquainted with the code: it doesn't seem like this change exercises a test?
test/Transforms/EarlyCSE/fence.ll
59	principle

reames added inline comments.Aug 26 2015, 4:31 PM

lib/Transforms/Scalar/EarlyCSE.cpp
617	Huh? Not sure what you're asking. The first two tests in the included file cover cases this triggers.
test/Transforms/EarlyCSE/fence.ll
59	Will fix in next round.

jfb added inline comments.Aug 26 2015, 5:04 PM

lib/Transforms/Scalar/EarlyCSE.cpp
617	True, they check that store/fence/load and store/fence/store are handled properly with release. What I mean is: the code changes, but all of the tests are exactly the same. If code changes but tests remains the same, then what changed? :-)

reames added inline comments.Aug 26 2015, 5:28 PM

lib/Transforms/Scalar/EarlyCSE.cpp
617	I'm still not clear what you're trying to ask. We remove a load from both test1 and test2 which is not removed without the patch. (Well, at the time I wrote it. Haven't checked today, but I don't see any obvious new additions.) Can you clarify?

jfb added inline comments.Aug 26 2015, 5:38 PM

lib/Transforms/Scalar/EarlyCSE.cpp
617	Ah... The diff is wrong :-) fence.ll is new, but the current phabricator diff shows it as modified. It doesn't show modifications to test1 and test2, so it looks like they the same as current tip-of-tree. Could you fix the diff?

Diff should be fixed. Sorry for the confusion.

OK now this lgtm! I'd forgotten that fence.ll was new, so the bad diff was *very* confusing!

This revision is now accepted and ready to land.Aug 26 2015, 6:16 PM

Closed by commit rL246134: Allow value forwarding past release fences in EarlyCSE (authored by reames). · Explain WhyAug 26 2015, 6:33 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

EarlyCSE.cpp

11 lines

test/

Transforms/

EarlyCSE/

fence.ll

73 lines

Diff 30410

lib/Transforms/Scalar/EarlyCSE.cpp

Show First 20 Lines • Show All 607 Lines • ▼ Show 20 Lines	if (CallValue::canHandle(Inst)) {
}		}

// Otherwise, remember that we have this instruction.		// Otherwise, remember that we have this instruction.
AvailableCalls.insert(		AvailableCalls.insert(
Inst, std::pair<Value *, unsigned>(Inst, CurrentGeneration));		Inst, std::pair<Value *, unsigned>(Inst, CurrentGeneration));
continue;		continue;
}		}

		// A release fence requires that all stores complete before it, but does
		// not prevent the reordering of following loads 'before' the fence. As a
		jfbUnsubmitted Not Done Reply Inline Actions Getting re-acquainted with the code: it doesn't seem like this change exercises a test? jfb: Getting re-acquainted with the code: it doesn't seem like this change exercises a test?
		reamesAuthorUnsubmitted Not Done Reply Inline Actions Huh? Not sure what you're asking. The first two tests in the included file cover cases this triggers. reames: Huh? Not sure what you're asking. The first two tests in the included file cover cases this…
		jfbUnsubmitted Not Done Reply Inline Actions True, they check that store/fence/load and store/fence/store are handled properly with release. What I mean is: the code changes, but all of the tests are exactly the same. If code changes but tests remains the same, then what changed? :-) jfb: True, they check that store/fence/load and store/fence/store are handled properly with release.
		reamesAuthorUnsubmitted Not Done Reply Inline Actions I'm still not clear what you're trying to ask. We remove a load from both test1 and test2 which is not removed without the patch. (Well, at the time I wrote it. Haven't checked today, but I don't see any obvious new additions.) Can you clarify? reames: I'm still not clear what you're trying to ask. We remove a load from both test1 and test2…
		jfbUnsubmitted Not Done Reply Inline Actions Ah... The diff is wrong :-) fence.ll is new, but the current phabricator diff shows it as modified. It doesn't show modifications to test1 and test2, so it looks like they the same as current tip-of-tree. Could you fix the diff? jfb: Ah... The diff is wrong :-) fence.ll is new, but the current phabricator diff shows it as…
		// result, we don't need to consider it as writing to memory and don't need
		// to advance the generation. We do need to prevent DSE across the fence,
		// but that's handled above.
		if (FenceInst *FI = dyn_cast<FenceInst>(Inst))
		if (FI->getOrdering() == Release) {
		assert(Inst->mayReadFromMemory() && "relied on to prevent DSE above");
		continue;
		}

// Okay, this isn't something we can CSE at all. Check to see if it is		// Okay, this isn't something we can CSE at all. Check to see if it is
// something that could modify memory. If so, our available memory values		// something that could modify memory. If so, our available memory values
// cannot be used so bump the generation count.		// cannot be used so bump the generation count.
if (Inst->mayWriteToMemory()) {		if (Inst->mayWriteToMemory()) {
++CurrentGeneration;		++CurrentGeneration;

if (MemInst.isValid() && MemInst.isStore()) {		if (MemInst.isValid() && MemInst.isStore()) {
// We do a trivial form of DSE if there are two stores to the same		// We do a trivial form of DSE if there are two stores to the same
// location with no intervening loads. Delete the earlier store.		// location with no intervening loads. Delete the earlier store.
		jfbUnsubmitted Not Done Reply Inline Actions I'm not sure I understand why this isn't firing when there's an intervening release fence (`test4` below). I think it could: there was a race anyways, so screw the code and keep one of the store (storing the last value, of course)? Another thread is guaranteed to observe a store to that memory location, but they're racing with each other so the first one isn't guaranteed to ever be observed. The second store is the more hilarious one to keep, but you could just store its value before the fence? e.g.: loc = 42; std::atomic_thread_fence(std::memory_order_release); loc = 0; Becomes: std::atomic_thread_fence(std::memory_order_release); loc = 0; Or: loc = 0; std::atomic_thread_fence(std::memory_order_release); ? jfb: I'm not sure I understand why this isn't firing when there's an intervening release fence…
		reamesAuthorUnsubmitted Not Done Reply Inline Actions See my inline response for test4 below. reames: See my inline response for test4 below.
if (LastStore) {		if (LastStore) {
ParseMemoryInst LastStoreMemInst(LastStore, TTI);		ParseMemoryInst LastStoreMemInst(LastStore, TTI);
if (LastStoreMemInst.isMatchingMemLoc(MemInst)) {		if (LastStoreMemInst.isMatchingMemLoc(MemInst)) {
DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore		DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore
<< " due to: " << *Inst << '\n');		<< " due to: " << *Inst << '\n');
LastStore->eraseFromParent();		LastStore->eraseFromParent();
Changed = true;		Changed = true;
++NumDSE;		++NumDSE;
▲ Show 20 Lines • Show All 147 Lines • Show Last 20 Lines

test/Transforms/EarlyCSE/fence.ll

This file was added.

				; RUN: opt -S -early-cse < %s \| FileCheck %s

				; We can value forward across the fence since we can (semantically)
				; reorder the following load before the fence.
				define i32 @test(i32* %addr.i) {
				; CHECK-LABEL: @test
				; CHECK: store
				; CHECK: fence
				; CHECK-NOT: load
				; CHECK: ret
				store i32 5, i32* %addr.i, align 4
				fence release
				%a = load i32, i32* %addr.i, align 4
				ret i32 %a
				}

				; Same as above
				define i32 @test2(i32* noalias %addr.i, i32* noalias %otheraddr) {
				; CHECK-LABEL: @test2
				; CHECK: load
				; CHECK: fence
				; CHECK-NOT: load
				; CHECK: ret
				%a = load i32, i32* %addr.i, align 4
				fence release
				%a2 = load i32, i32* %addr.i, align 4
				%res = sub i32 %a, %a2
				ret i32 %a
				}

				; We can not value forward across an acquire barrier since we might
				; be syncronizing with another thread storing to the same variable
				; followed by a release fence. If this thread observed the release
				; had happened, we must present a consistent view of memory at the fence.
				define i32 @test3(i32* noalias %addr.i, i32* noalias %otheraddr) {
				; CHECK-LABEL: @test3
				; CHECK: load
				; CHECK: fence
				; CHECK: load
				; CHECK: ret
				%a = load i32, i32* %addr.i, align 4
				fence acquire
				%a2 = load i32, i32* %addr.i, align 4
				jfbUnsubmitted Not Done Reply Inline Actions You can't forward `%a` and eliminate `%a2`, but you could eliminate `%a`, and replace it by `%a2` if there are no intervening uses, no? jfb: You can't forward `%a` and eliminate `%a2`, but you could eliminate `%a`, and replace it by…
				dvyukovUnsubmitted Not Done Reply Inline Actions In this particular case we can do whatever we want with accesses to %addr.i. No other thread can access it without a race. dvyukov: In this particular case we can do whatever we want with accesses to %addr.i. No other thread…
				reamesAuthorUnsubmitted Not Done Reply Inline Actions @jfb - Yes, but that's not what the current optimization tries to do. Nor does EarlyCSE have a mechanism to do that in general even for unordered loads. @dvyukov - Yes, but we need to actually prove that and my change to EarlyCSE is no where near that aggressive. And frankly, I'm really not interested in exploiting the undefined behaviour part of this. Do you have suggestions for how to restructure the test in a way which still tests EarlyCSE but avoids disallowing the optimization you pointed out? Would simply removing the first load from the check sequence be sufficient? reames: @jfb - Yes, but that's not what the current optimization tries to do. Nor does EarlyCSE have a…
				dvyukovUnsubmitted Not Done Reply Inline Actions I can't answer your question. I don't understand what EarlyCSE does, change description talks only about release fences but this test contains an acquire fence. I am trying to figure out what this change does by reading the tests. And the tests says that we cannot do something that we actually can do... dvyukov: I can't answer your question. I don't understand what EarlyCSE does, change description talks…
				jfbUnsubmitted Not Done Reply Inline Actions I think the main problem Dmitry and I have is the comment: it says the transformation can't happen but it in fact is allowed to happen and simply ins't implemented here. You're testing the implementation knowing what it does, but your comment hints to future contributors that doing this optimization isn't valid. I'm assuming that the test can't be restructured because it would require a loop, and that's something that the current implementation won't see through. You're testing a very minimal thing instead (an overly conservative on, since it's a race!), and ensuring that the current code can't optimize it. jfb: I think the main problem Dmitry and I have is the comment: it says the transformation can't…
				reamesAuthorUnsubmitted Not Done Reply Inline Actions Guys, would simply adding a comment to the test explaining the alternate path by which this could be optimized satisfy you? The code in earlycse needs to be specific to release fences. I don't know of an easy way to exercise the code to form a negative test without something of similar form. p.s. I get that there is some optimization which can recognize this as a data race. However, simple load-load forwarding as implemented in EarlyCSE is not it. Nor is likely to ever be so. reames: Guys, would simply adding a comment to the test explaining the alternate path by which this…
				jfbUnsubmitted Not Done Reply Inline Actions Yes. jfb: Yes.
				%res = sub i32 %a, %a2
				ret i32 %a
				}

				; We can NOT dead store eliminate accross the fence
				define void @test4(i32* %addr.i) {
				; CHECK-LABEL: @test4
				; CHECK: store
				; CHECK: fence
				; CHECK: store
				; CHECK: ret
				store i32 5, i32* %addr.i, align 4
				fence release
				store i32 5, i32* %addr.i, align 4
				dvyukovUnsubmitted Not Done Reply Inline Actions Is it a full test? Or is it something that we don't handle yet? I am asking because in this particular case we can do whatever we want with accesses to %addr. Since we store to %addr both before and after the fence, we can conclude that this fence is not used to synchronize accesses to %addr, so we can eliminate/combine/etc accesses to %addr. dvyukov: Is it a full test? Or is it something that we don't handle yet? I am asking because in this…
				reamesAuthorUnsubmitted Not Done Reply Inline Actions In general, I don't believe it is legal to eliminate the first store. With all memory initialized to zero, consider these two threads: T1: b = 2; a = 1; release fence b = 4; T2: while (a == 0) {} if (b == 0) abort(); In this particular case I think you're reasoning is sound because we can tell the store to a doesn't exist. (Just to be clear, your concern here is an unnecessarily restricted optimization, not a correctness concern right?) Would splitting this test into two - one with a store to 'a' which is a negative test and one as is which is marked as a missed optimization - satisfy you? reames: In general, I don't believe it is legal to eliminate the first store. With all memory…
				dvyukovUnsubmitted Not Done Reply Inline Actions I don't understand. Your example is a one large data race. Behavior is undefined. dvyukov: I don't understand. Your example is a one large data race. Behavior is undefined.
				ret void
				}
				jfbUnsubmitted Not Done Reply Inline Actions principle jfb: principle
				reamesAuthorUnsubmitted Not Done Reply Inline Actions Will fix in next round. reames: Will fix in next round.

				; We could DSE across this fence, but don't. No other thread can
				; observe the order of the acquire fence and the store.
				define void @test5(i32* %addr.i) {
				; CHECK-LABEL: @test5
				; CHECK: store
				; CHECK: fence
				; CHECK: store
				; CHECK: ret
				store i32 5, i32* %addr.i, align 4
				fence acquire
				store i32 5, i32* %addr.i, align 4
				ret void
				}