Page MenuHomePhabricator

[RFC] Cross Block DSE
ClosedPublic

Authored by karthikthecool on Jul 13 2015, 6:55 AM.

Details

Summary

Hi All,
I would like to propose a RFC on Cross Block Dead Store Elimination.

Cross Block Dead Store elimination implemented in this patch works as follows-

  1. When a Store with non local memory dependency we call "handleNonLocalStoreDeletion".
  2. In handleNonLocalStoreDeletion we traverse the predecessor blocks of the store instruction till we encounter post domination frontier and try to find a candidate store for deletion.
  3. getStoreCandidates finds out if we have a candidate store for deletion which in any of the predecessor block.
  4. If we find a candidate store we then see that the store is not used by any instruction in any of the path from the current block till block were the overwriting store is present.
  5. We call DeleteDeadInstruction for the candidate store instruction if it is safe to remove the store.

No Regressions were found in make check-all and LLVM LNT test suites.

Please let me know your valuable option on the patch.

Thanks and Regards
Karthik Bhat

Diff Detail

Repository
rL LLVM

Event Timeline

karthikthecool retitled this revision from to [RFC] Cross Block DSE.
karthikthecool updated this object.
karthikthecool added a reviewer: hfinkel.
karthikthecool added a subscriber: llvm-commits.

Attaching LLVM LNT Test results.(15 Samples)
Execution time of 2 benchmarks improved post Cross block DSE patch. Surprisingly i didn't see any compile time regression.

eastig added a subscriber: eastig.Jul 13 2015, 7:51 AM
eastig added inline comments.Jul 13 2015, 7:57 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
655 ↗(On Diff #29560)

'nullptr' is returned when BB is not well formed. Maybe it's better to 'assert' here?

Karthik,

I don't know what was a reason to deprecate the DominanceFrontier class (see comments in DominanceFrontier.h). Your PostDominanceFrontier might be deprecated for the same reason.

Hi Evgeny,
Thanks for looking into the patch. I think the reason dom frontier is deprecared is that it is costly to calculate the same.. im not sure as well though.
Im currently using postdom frontier to get the basic block boundry till which a store might overwrite a memory location making it dead. Dead store elimation without having postdom seems to be a bit tough especially given the overhead it would require without postdom.
Thanks
Karthik

First, If you want to calculate post-dom-frontier, you should use the
linear-time IDF calculator and make it work for postdominance
frontiers, and just calculate the pieces you want.

Second, this looks like a *really* expensive approach to doing this.

A. Is there a reason you aren't just using a hash table storing each
store you see in the top-down walk, so that you don't have to go
looking for stores to eliminate?
In fact, you can do better than even this:
Walk the function once, finding stores.
Store the stores in a linked list ordered by DFS in/out numbers of the
postdom tree.

For your given block, get the DFS in/out of that node in the postdom tree.
The stores you may be able to eliminate are those in the list starting
at DFS in and ending at DFS out.

Calculating the post-dom frontier just to figure out where to stop
looking seems remarkably expensive as well

(you could also just use memoryssa, once i finish my managery tasks
that have kept me busy the past month or two, and get back to llvm
work)

B. If you want complete store elimination, this is normally formulated
as a PRE-like problem on the reverse flowgraph.

Hi Daniel,
Thanks for the input. Yes this approach is definitely very expensive. I was surprised when i didn't see any compile time regressions on LLVM LNT.
Your approach of using postdom and dfs in/out seems very interesting I will try out the same and update shortly.
Thanks for your valuable inputs. Will get back with the updated patch shortly.

Thanks and Regards
Karthik Bhat

Hi Daniel,
Please find the updated patch as per your inputs. I have removed the dependency from PDT Frontier and using the DFS in/out on PDT to find the candidate store.

There are no regressions in make check-all and LLVM LNT.

I also compiled llvm with modified/unmodified clang binary to measure compile time overhead. Not much compile time overhead is observed. The rough figures are as shown below-

With Change-

real   35m42.705s
user  136m54.137s
sys    3m19.238s

Without Change-

real    35m38.747s
user   136m58.076s
sys     3m19.637s

In LLVM LNT we observe improvement in 2 benchmarks mentioned in the previous posts.

Please let me know your valuable inputs on the same.

Thanks and Regards
Karthik Bhat

hfinkel added inline comments.Jul 25 2015, 6:22 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
61 ↗(On Diff #29877)

Use a DenseSet (I assume you're not depending on any ordering from std::map because it is storing pointers).

71 ↗(On Diff #29877)

give -> given

74 ↗(On Diff #29877)

You can use a range-based for loop here.

for (const auto &I : BB) {
  ...
}
85 ↗(On Diff #29877)

Range-based for.

946 ↗(On Diff #29877)

Use SmallVector<BasicBlock *, 16> (or some number you feel is appropriate).

947 ↗(On Diff #29877)

Use DenseSet, or SmallPtrSet if the size will normally be small.

954 ↗(On Diff #29877)

You don't need to do a cast here, you can compare a StoreInst* to an Instruction*, and they'll be equal only if they point to the same object (one is a base class of the other).

991 ↗(On Diff #29877)

When you use a SmallVector instead, you can use pop_back_val().

1018 ↗(On Diff #29877)

Add space before 'Conservatively'

1032 ↗(On Diff #29877)

Maybe you should store these in a SetVector instead to make this more efficient.

1040 ↗(On Diff #29877)

finding out candidate stroes -> finding candidate stores

1054 ↗(On Diff #29877)

Range-based for?

karthikthecool marked 12 inline comments as done.

Hi Hal,
Thanks for the review. Please find the updated patch addressing review comments.
Please let me know your opinion on the same.
Thanks and Regards
Karthik Bhat

Hi All,
Update the patch to fix a failure in LLVM-LNT testcase on 32bit ubuntu. Unfortunetly i was not able to catch it on my 64bit machine.
This failure occured as we incorrectly deleted a store that was followed by a store that partially aliased it.
When seraching if a candidate store is safe we conservately mark a store as unsafe if we found a store that may/partially alias the candidate store.
This fixes failure found in LLVM LNT on 32bit ubuntu.
We still observe some 30%+ execution time improvement in MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1.

Thanks and Regards
Karthik Bhat

hfinkel edited edge metadata.Aug 3 2015, 3:16 PM

Danny, are you okay with the algorithm at this point?

lib/Transforms/Scalar/DeadStoreElimination.cpp
1001 ↗(On Diff #30807)

If I'm reading this correctly, this call to AA->alias is going to essentially re-do all of the same analysis done in the AA->getModRefInfo call above. Could we check whether I is a StoreInst first and only to one query on store instructions instead of two?

Hi Hal,
Thanks for looking into the patch. Please find my comments inline.
Regards
Karthik Bhat

lib/Transforms/Scalar/DeadStoreElimination.cpp
1001 ↗(On Diff #30807)

Hi Hal,
The first check ensures that we see all instructions that may modify/ref a pointer (this is the pointer the store to which we are planning to delete). In case we find an instruction that refers to the pointer e.g. a load we then return false indicating that the store cannot be removed.

If the instruction is a store we can safely remove the 1st store if and only if both the store and the candidate MustAlias. To detect this i had to check the alias again.
Thanks and Regards
Karthik

hfinkel added inline comments.Aug 4 2015, 8:34 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
1001 ↗(On Diff #30807)

I understand that the interface here is less than ideal, but could you not skip the getModRefInfo call entirely if I is a StoreInst, and just use the result from AA->alias instead? In other words, hoist the dyn_cast<StoreInst>-based check above the ModRef logic?

LGTM with these comments

lib/Transforms/Scalar/DeadStoreElimination.cpp
101 ↗(On Diff #30807)

Do you have a testcase for where PDT->getRootNode() is null?

because that seems broken :)

108 ↗(On Diff #30807)

So, if we only have 1 block, why are you doing the rest of the work here, instead of returning false and exiting?
:)

515 ↗(On Diff #30807)

Much like the one above, i don't understand why this PDT->getRootNode() check is here.

The PDT should always have a root node, fake or not.
If you've got cases it doesn't, it seems like we should fix PDT

1001 ↗(On Diff #30807)

As Karthik says, i'm not sure there is a way around this.
You can get better time bounds with PRE-like algorithms, but the underlying work here has to be done.

The getModRefInfo is like finding your data dependences. For stores they may not be mustalias. If they are mod/ref, they block removal because they may read or write the memory from the first store before the second store does the same thing, so you can't remove the first store.

IE given
Store of 1
Load of 1
Store of 1

You can't remove the first store.

(You could actually see if it's legal to sink everything below the second store to enable you to delete the first, but that is effectively global code motion. not hard, just a different algorithm with it's own issues. You could also see if it's possible to delete the second store instead of the first, but this is hard in the simple cross-block DSE we are doing here)

Same with
Store of 1
may-store of 1
Load of 1
Store of 1

Now, interestingly, given:

Store of 1
may-store of 1
Store of 1

You can still eliminate the first store.
In fact, i'm struggling to come up with a case where Mod without ref matters if you have a later mustalias store.
It's early, but i think it's safe to eliminate store over store with only a may-mod store in the way.

Karthik, do you have a counter-example?

If not, i'd expect the check on line 960 above to take this into account (IE be okay with nomodref and mod, but not ref or modref)

But as you can see, even if there is nothing "in the way" between the first and second store, you still have to check whether you *mustalias* the second store to know that you can remove the first one.

SSA based store-PRE algorithms basically do this "not on the fly". They build memory use chains so they don't have to walk blocks (sound familiar? :P). They still have to do the mod-ref checks and mustalias checks, however. They are just structured in a way that it is not N^2 to figure out placement.

Sigh, my review got drafted before, but submitted after you wrote
this, so it looks like a reply to this part.

You are right that you could hoist this check, IFF alias returns
must/mayalias in cases where getmodrefinfo returns mod.

Is this really guaranteed to be the case?

They are different properties, though modrefinfo is often implemented
with help from alias. In our world of synchronization instructions,
isn't it possible for ordering to cause something to not be aliased
with a store (which is about the abstract memory locations) but still
be modified by it (due to it being a release or whatever) ?

Or do we consider this aliasing in the AA interface to try to avoid
user confusion? :)

I haven't looked deeply enough to know if we truly have consistency
here. I know getModRefInfo performs checks on ordering, etc, before it
calls alias(), and i assumed those checks were actually necessary.
But, you know, not a lot of unit tests here :)

wengxt added a subscriber: wengxt.Aug 6 2015, 1:02 PM
jingyue added a subscriber: jingyue.Aug 6 2015, 1:03 PM
karthikthecool marked 2 inline comments as done.Aug 10 2015, 3:37 AM

Hi Daniel,Hal,
Thanks for the review. Please find my comments inline. I will try to upload the updated patch shortly.
Thanks and Regards
Karthik

lib/Transforms/Scalar/DeadStoreElimination.cpp
101 ↗(On Diff #30807)

Hi Daniel,
I found in the below case we were getting getRootNode as NULL so added the above check.

void foo() {
  while(1);
}
108 ↗(On Diff #30807)

Updated code to executed FindFunctionBackedges and BackEdgesMap population only if Count > 1. I think we cannot return false here because we still need to do single block dse.

karthikthecool edited edge metadata.
karthikthecool marked 2 inline comments as done.

Hi All,
Rebased patch and implemented few review comments.
Please let me know your inputs on the same.
Thanks and Regards
Karthik Bhat

I think we should just fix the PDT in a separate patch, and remove this check.

The documentation for getRootNode says:
" /// getRootNode - This returns the entry node for the CFG of the
function. If

/// this tree represents the post-dominance relations for a

function, however,

/// this root may be a node with the block == NULL.  This is the case when
/// there are multiple exit nodes from a particular function.  Consumers of
/// post-dominance information must be capable of dealing with this
/// possibility."

Note: This may be a node with *block* == NULL, not that the root node
may be null.

The problem here is in PDT building.

GenericDomTreeConstruction does this:
if (TraitsTy::child_begin(I) == TraitsTy::child_end(I))

addRoot(I);

This is wrong, obviously, since in the case of an infinite loop like
you've shown, it will have a self-loop, but it is a root because there
is no exit.

This whole mechanism of trying to detect the roots is broken, in fact.
It's essentially trying to determine reachability without a DFS, which
is not possible in general.

The *only* correct way (i'm aware of) to determine which things need
to be children of artificial exit/a root is to DFS from all clear exit
blocks, and then, for anything still unvisited, DFS from each of them,
marking them as children of the artificial exit
(you could also reparent anything you reach that was marked as a
child of artificial exit , if we want to prefer anything over the
artificial exit as a parent. But this should not be necessary for
correct answers).

Please file a bug with your testcase and cc me, and i'l update the bug
with details.

(I'll see if i can find time to prepare a patch)

Hi Daniel,
Thanks for the comments.. Filed a bug as
https://llvm.org/bugs/show_bug.cgi?id=24415 .
I too will try to have look at this issue seperatly.. Any other comments on this patch Daniel?
Thanks for your valuable time.
Regards
Karthik

Hi Karthik,
Daniel pointed me to your patch, because I'm also working on something in DeadStoreElimination: http://reviews.llvm.org/D11854

I think your patch is not directly related to what I'm doing. Nevertheless I looked at your patch and posted some comments.

Erik

lib/Transforms/Scalar/DeadStoreElimination.cpp
949 ↗(On Diff #31649)

As I understand, this loop is just to get the iterator for I.
You can get it simply by:
BasicBlock::iterator BBI(SI);

959 ↗(On Diff #31649)

I think using PtrSize here is wrong. It should be the size of the memory access. This can give you a wrong NoModRef, e.g. if I is a load of a global which has a smaller size than the pointer size.

Maybe you should just get the MemoryLocation of SI instead of Pointer+PtrSize and use AA->getModRefInfo(I, Location).

1011 ↗(On Diff #31649)

Why is a path with a backedge not safe?

karthikthecool marked 2 inline comments as done.Aug 10 2015, 10:03 PM

Hi Erik,
Thanks for looking into the patch. Please find my comments inline. Will post the updated patch shortly.
Thanks and Regards
Karthik Bhat

lib/Transforms/Scalar/DeadStoreElimination.cpp
949 ↗(On Diff #31649)

Updated the code.

959 ↗(On Diff #31649)

Thanks updated the code.

1011 ↗(On Diff #31649)

Hi Erik,
I came across the following code from MultiSource/Benchmarks/MiBench/consumer-typeset benchmark because of which i conservately added this check. The below .ll snippet is from file z36.c TrieRead function-

for.cond:                                         ; preds = %for.inc, %if.else.618
  %indvars.iv1644 = phi i64 [ %indvars.iv.next1645, %for.inc ], [ 0, %if.else.618 ]
  %j.0 = phi i32 [ %j.1, %for.inc ], [ 1, %if.else.618 ]
  %prev.0 = phi i32 [ %prev.1, %for.inc ], [ 56, %if.else.618 ]
  %arrayidx626 = getelementptr inbounds [512 x i8], [512 x i8]* %str, i64 0, i64 %indvars.iv1644
  %42 = load i8, i8* %arrayidx626, align 1, !tbaa !18
  switch i8 %42, label %if.else.636 [
    i8 0, label %for.end
    i8 45, label %for.inc
  ]

if.else.636:                                      ; preds = %for.cond
  %idxprom639 = sext i32 %j.0 to i64
  %arrayidx640 = getelementptr inbounds [512 x i8], [512 x i8]* %key, i64 0, i64 %idxprom639
  store i8 %42, i8* %arrayidx640, align 1, !tbaa !18
  %conv641 = trunc i32 %prev.0 to i8
  %inc642 = add nsw i32 %j.0, 1
  %arrayidx644 = getelementptr inbounds [512 x i8], [512 x i8]* %value, i64 0, i64 %idxprom639
  store i8 %conv641, i8* %arrayidx644, align 1, !tbaa !18
  br label %for.inc

for.inc:                                          ; preds = %for.cond, %if.else.636
  %j.1 = phi i32 [ %inc642, %if.else.636 ], [ %j.0, %for.cond ]
  %prev.1 = phi i32 [ 56, %if.else.636 ], [ 57, %for.cond ]
  %indvars.iv.next1645 = add nuw nsw i64 %indvars.iv1644, 1
  br label %for.cond

for.end:                                          ; preds = %for.cond
  %prev.0.lcssa = phi i32 [ %prev.0, %for.cond ]
  %j.0.lcssa = phi i32 [ %j.0, %for.cond ]
  %idxprom647 = sext i32 %j.0.lcssa to i64
  %arrayidx648 = getelementptr inbounds [512 x i8], [512 x i8]* %key, i64 0, i64 %idxprom647
  store i8 46, i8* %arrayidx648, align 1, !tbaa !18
  %conv649 = trunc i32 %prev.0.lcssa to i8
  %inc650 = add nsw i32 %j.0.lcssa, 1
  %arrayidx652 = getelementptr inbounds [512 x i8], [512 x i8]* %value, i64 0, i64 %idxprom647
  store i8 %conv649, i8* %arrayidx652, align 1, !tbaa !18
  %idxprom653 = sext i32 %inc650 to i64
  %arrayidx654 = getelementptr inbounds [512 x i8], [512 x i8]* %key, i64 0, i64 %idxprom653
  store i8 0, i8* %arrayidx654, align 1, !tbaa !18
  %arrayidx657 = getelementptr inbounds [512 x i8], [512 x i8]* %value, i64 0, i64 %idxprom653
  store i8 56, i8* %arrayidx657, align 1, !tbaa !18
  %add658 = add nsw i32 %j.0.lcssa, 2
  %idxprom659 = sext i32 %add658 to i64
  %arrayidx660 = getelementptr inbounds [512 x i8], [512 x i8]* %value, i64 0, i64 %idxprom659
  store i8 0, i8* %arrayidx660, align 1, !tbaa !18
  %call666 = call fastcc i32 @TrieInsert(i8* %5, i8* %6, %struct.trie_rec* %11, i8* %arraydecay, i32 %inc)
  %tobool667 = icmp eq i32 %call666, 0
  br i1 %tobool667, label %if.then.668, label %while.cond.224.backedge

Here block for.end post dominates if.else.636 and for.inc and alias analysis returns MustAlias for the pair-

store i8 %42, i8* %arrayidx640, align 1, !tbaa !18
and
store i8 46, i8* %arrayidx648, align 1, !tbaa !18

and

store i8 %conv649, i8* %arrayidx652, align 1, !tbaa !18
and
store i8 %conv641, i8* %arrayidx644, align 1, !tbaa !18

When we do a DFS from block if.else.636 to see if it is safe to delete the store we take a backedge and reach for.end were we find a MustAlias Store and hence delete the store in the block if.else.636. But this is incorrect as the store in if.else.636 is inside a loop and stores to various memory location based on the iterator which is updated in for.cond but the store in for.end: stores to a single location after the loop is executed.
I came across this issue during testing hence conservately added a check not to take backedges during DFS.
I'm not sure if there is any other way to fix this. Please let me know if you have any inputs on this. It would be of great help.
Thanks

karthikthecool marked 2 inline comments as done.

Updated patch as per Erik's comments.

hfinkel added inline comments.Aug 11 2015, 12:23 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
111 ↗(On Diff #31771)

Range-based for

946 ↗(On Diff #31771)

Line too long.

959 ↗(On Diff #31771)

Line too long.

990 ↗(On Diff #31771)

Line too long.

1017 ↗(On Diff #31771)

Remove this function and instead call:

DeadStores.count(SI)
1032 ↗(On Diff #31771)

Line is too long.

1049 ↗(On Diff #31771)

Line too long.

Sigh, my review got drafted before, but submitted after you wrote
this, so it looks like a reply to this part.

You are right that you could hoist this check, IFF alias returns
must/mayalias in cases where getmodrefinfo returns mod.

Is this really guaranteed to be the case?

I think that it should be. I'm not the best with the history here, but there's nothing in the comments on the interface that suggests that there should be a difference w.r.t. ordering/volatile of the instructions, and other clients, MemoryDependenceAnalysis for example, layer on their own checks for ordering/volatile.

They are different properties, though modrefinfo is often implemented
with help from alias. In our world of synchronization instructions,
isn't it possible for ordering to cause something to not be aliased
with a store (which is about the abstract memory locations) but still
be modified by it (due to it being a release or whatever) ?

There are certainly ordering constraints to preserve, but I don't believe that AA is the right place to enforce them. Clients can certainly bail on atomics, and should if they might disturb the ordering and can't do better, but optimizations on atomics are still possible and we don't want AA clients that can handle some optimizations on some atomics to need to work around the AA interface because it forces a conservative behavior.

So this is the wrong thread to fix this problem ;) -- but I don't think adding any new dependence on this behavior is a good thing.

Or do we consider this aliasing in the AA interface to try to avoid
user confusion? :)

I haven't looked deeply enough to know if we truly have consistency
here. I know getModRefInfo performs checks on ordering, etc, before it
calls alias(), and i assumed those checks were actually necessary.
But, you know, not a lot of unit tests here :)

Indeed I know :)

Run clang-format on updated patch and implement Hal's review comments.
I agree with Hal that we may hoist dyn_cast<StoreInst> above ModRef check. In case of store we can rely on Alias Analysis's alias to check if the stores mustalias. For other cases we can use getModRefInfo to make make sure that this instruction which is not a store does not Mod/Ref the pointer. Updated and tested the code after modification.
Please let me know your inputs on the same.

Thanks once again to all reviewers for spending their valuable time on this patch.
Regards
Karthik Bhat

Sorry guys for multiple updates on patch..:( forgot to move dyn_cast<StoreInst> above getModRefInfo in one of the place in the previous patch.
Regards
Karthik Bhat

eeckstein added inline comments.Aug 11 2015, 11:29 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
958 ↗(On Diff #31788)

This is still not correct. You are passing the size of the pointer and not of the referenced memory location.

1010 ↗(On Diff #31788)

Thanks, I see the problem.

Updated patch as per Erik's comments.
Thanks and Regards
Karthik Bhat

Rebase patch and resolve confilts.
LNT Statistics:
Number of Cross Block Stores eliminated in LLVM-LNT: 849
Number of Benchmarks improved: 2

MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 : 38.86%
MultiSource/Benchmarks/mafft/pairlocalalign : 5.36%

No Regressions observed.

Hi Hal,Erik,Daniel,
Please let me know if you have any additional comments or if i can go ahead with committing this patch.
Thanks and Regards
Karthik Bhat

hfinkel accepted this revision.Aug 13 2015, 10:13 AM
hfinkel edited edge metadata.

LGTM.

lib/Transforms/Scalar/DeadStoreElimination.cpp
1002 ↗(On Diff #31900)

Missing space after period.

This revision is now accepted and ready to land.Aug 13 2015, 10:13 AM
This revision was automatically updated to reflect the committed changes.