Page MenuHomePhabricator

Allow DeadStoreElimination to track combinations of partial later wrties
ClosedPublic

Authored by hfinkel on Mar 29 2016, 5:41 PM.

Details

Summary

DeadStoreElimination can currently remove a small store rendered unnecessary by a later larger one, but cannot remove a larger store rendered unnecessary by a series of later smaller ones. This patch aims to rectify that.

It works by keeping an IntervalMap for each store later overwritten only partially, and filling in that interval map as more such stores are discovered. No additional walking or aliasing queries are added. If the IntervalMap forms an interval covering the the entire earlier store, then it is dead and can be removed.

I discovered this problem when investigating a performance issue with code like this on PowerPC:

#include <complex>
using namespace std;

complex<float> bar(complex<float> C);
complex<float> foo(complex<float> C) {
  return bar(C)*C;
}

which produces this:

define void @_Z4testSt7complexIfE(%"struct.std::complex"* noalias nocapture sret %agg.result, i64 %c.coerce) {
entry:
  %ref.tmp = alloca i64, align 8
  %tmpcast = bitcast i64* %ref.tmp to %"struct.std::complex"*
  %c.sroa.0.0.extract.shift = lshr i64 %c.coerce, 32
  %c.sroa.0.0.extract.trunc = trunc i64 %c.sroa.0.0.extract.shift to i32
  %0 = bitcast i32 %c.sroa.0.0.extract.trunc to float
  %c.sroa.2.0.extract.trunc = trunc i64 %c.coerce to i32
  %1 = bitcast i32 %c.sroa.2.0.extract.trunc to float
  call void @_Z3barSt7complexIfE(%"struct.std::complex"* nonnull sret %tmpcast, i64 %c.coerce)
  %2 = bitcast %"struct.std::complex"* %agg.result to i64*
  %3 = load i64, i64* %ref.tmp, align 8
  store i64 %3, i64* %2, align 4 ; <--- ***** THIS SHOULD NOT BE HERE ****
  %_M_value.realp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %agg.result, i64 0, i32 0, i32 0
  %4 = lshr i64 %3, 32
  %5 = trunc i64 %4 to i32
  %6 = bitcast i32 %5 to float
  %_M_value.imagp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %agg.result, i64 0, i32 0, i32 1
  %7 = trunc i64 %3 to i32
  %8 = bitcast i32 %7 to float
  %mul_ad.i.i = fmul fast float %6, %1
  %mul_bc.i.i = fmul fast float %8, %0
  %mul_i.i.i = fadd fast float %mul_ad.i.i, %mul_bc.i.i
  %mul_ac.i.i = fmul fast float %6, %0
  %mul_bd.i.i = fmul fast float %8, %1
  %mul_r.i.i = fsub fast float %mul_ac.i.i, %mul_bd.i.i
  store float %mul_r.i.i, float* %_M_value.realp.i.i, align 4
  store float %mul_i.i.i, float* %_M_value.imagp.i.i, align 4
  ret void
}

the problem here is not just that the i64 store is unnecessary, but also that it blocks further backend optimizations of the other uses of that i64 value in the backend.

For the value of the interval map, I'm currently using std::tuple<> (because it is empty).

Diff Detail

Repository
rL LLVM

Event Timeline

hfinkel updated this revision to Diff 52008.Mar 29 2016, 5:41 PM
hfinkel retitled this revision from to Allow DeadStoreElimination to track combinations of partial later wrties.
hfinkel updated this object.
hfinkel added a subscriber: llvm-commits.

This is great! This actually fixes a problem case I ran into some months ago in the wild, where a memset wasn't being wiped away despite being fully overwritten by the stores following it. Example shown below:

%struct.foostruct = type {
i32 (i8*, i8**, i32, i8, i8*)*,
i32 (i8*, i8**, i32, i8, i8*)*,
i32 (i8*, i8**, i32, i8, i8*)*,
i32 (i8*, i8**, i32, i8, i8*)*,
void (i8*, i32, i32)*
}
declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1)
declare void @goFunc(%struct.foostruct*)

define void @func()  {
  %bang = alloca %struct.foostruct, align 8
  %1 = bitcast %struct.foostruct* %bang to i8*
  call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 40, i32 8, i1 false)  <------    This is now removed
  %2 = getelementptr inbounds %struct.foostruct, %struct.foostruct* %bang, i64 0, i32 0
  store i32 (i8*, i8**, i32, i8, i8*)* @fa, i32 (i8*, i8**, i32, i8, i8*)** %2, align 8
  %3 = getelementptr inbounds %struct.foostruct, %struct.foostruct* %bang, i64 0, i32 1
  store i32 (i8*, i8**, i32, i8, i8*)* @fa, i32 (i8*, i8**, i32, i8, i8*)** %3, align 8
  %4 = getelementptr inbounds %struct.foostruct, %struct.foostruct* %bang, i64 0, i32 2
  store i32 (i8*, i8**, i32, i8, i8*)* @fa, i32 (i8*, i8**, i32, i8, i8*)** %4, align 8
  %5 = getelementptr inbounds %struct.foostruct, %struct.foostruct* %bang, i64 0, i32 3
  store i32 (i8*, i8**, i32, i8, i8*)* @fa, i32 (i8*, i8**, i32, i8, i8*)** %5, align 8
  %6 = getelementptr inbounds %struct.foostruct, %struct.foostruct* %bang, i64 0, i32 4
  store void (i8*, i32, i32)* null, void (i8*, i32, i32)** %6, align 8
  call void @goFunc(%struct.foostruct* %bang)
  ret void
}
mcrosier edited edge metadata.Mar 31 2016, 6:13 AM

Looks reasonable to me, but you may want to wait for additional reviews. Comments/questions inline.

lib/Transforms/Scalar/DeadStoreElimination.cpp
440 ↗(On Diff #52008)

not even -> never

442 ↗(On Diff #52008)

Should this be <= rather than <?

hfinkel added inline comments.Mar 31 2016, 9:43 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
442 ↗(On Diff #52008)

I don't think so. If LaterOff == EarlierOff + Earlier.Size, then it is starting on the byte after the earlier one ends.

dberlin edited edge metadata.Mar 31 2016, 9:55 AM
dberlin added a subscriber: dberlin.

This looks reasonable, assuming creating a bunch of intervalmaps is not
that expensive.

eeckstein added inline comments.Mar 31 2016, 11:13 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
457 ↗(On Diff #52008)

Shouldn't this be
if (ILI != IM.end() && ILI.stop() < LaterIntEnd) {

If yes, then what if the existing interval completely covers the Later-interval? I.e. if ILI.stop() >= LaterIntEnd

hfinkel added inline comments.Mar 31 2016, 12:47 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
457 ↗(On Diff #52008)

We want to know if we have an interval that overlaps with the current one. IM.find(LaterIntStart) will return the first interval that ends at or after the current one's start. Next we need to check if the interval starts before the current one's end (to see whether it overlaps the current one, or is purely after it). Checking whether it stops before the current one ends is insufficient because it will miss the case where the interval crosses the current one's end point. I'll improve the comments.

eeckstein accepted this revision.Mar 31 2016, 1:04 PM
eeckstein edited edge metadata.

LGTM (but please wait for other reviewers)

lib/Transforms/Scalar/DeadStoreElimination.cpp
457 ↗(On Diff #52008)

Oh, I see. I misunderstood the find() function.

This revision is now accepted and ready to land.Mar 31 2016, 1:04 PM
chandlerc edited edge metadata.Mar 31 2016, 6:56 PM

Minor comments below about the data structure, but I really wonder whether IntervalMap is the right tool for the job here. It seems awfully heavyweight, although it is clearly very efficient... And it seems to be fighting you on several fronts:

  1. You want a set, not a map.
  2. You can't insert overlapping ranges.
  3. The sparseness isn't likely to be useful as you're mostly interested in fairly small memory regions.

What about using a DenseMap<Instruction *, BitVector> and setting one bit per byte? My thoughts:

  • No complexity around overlaps.
  • No extra allocation for 64 byte and smaller regions. Pretty minimal allocation even for larger regions, up to 512 bytes no problem.
  • Super simple implementation, probably much faster for small regions to just set bits and test them.

The major downside I see is that it doesn't scale gracefully to *large* memory regions like a memset might hit...

What do you think? Maybe set an upper bound on size (512 bytes seems a good bound, keeps the BitVector on a cache line) and use BitVector? Is it worth the complexity of using a BitVector when small and an IntervalMap when large?

lib/Transforms/Scalar/DeadStoreElimination.cpp
348–349 ↗(On Diff #52008)

Ugh, so you really want an interval *set*. Is it too hard to build one?

444–448 ↗(On Diff #52008)

Find first, and assign the iterator on insert? It'd be particularly nice to add a try_emplace to DenseMap so that you can do this with a single lookup. =/

  • Original Message -----

From: "Chandler Carruth" <chandlerc@gmail.com>
To: hfinkel@anl.gov, echristo@gmail.com, igor@azulsystems.com, mcrosier@codeaurora.org, dberlin@dberlin.org,
eeckstein@apple.com
Cc: jvanadrighem@gmail.com, junbuml@codeaurora.org, llvm-commits@lists.llvm.org
Sent: Thursday, March 31, 2016 8:56:37 PM
Subject: Re: [PATCH] D18586: Allow DeadStoreElimination to track combinations of partial later wrties

chandlerc added a comment.

Minor comments below about the data structure, but I really wonder
whether IntervalMap is the right tool for the job here. It seems
awfully heavyweight, although it is clearly very efficient... And it
seems to be fighting you on several fronts:

  1. You want a set, not a map.
  2. You can't insert overlapping ranges.
  3. The sparseness isn't likely to be useful as you're mostly interested in fairly small memory regions.

    What about using a DenseMap<Instruction *, BitVector> and setting one bit per byte? My thoughts:
  4. No complexity around overlaps.
  5. No extra allocation for 64 byte and smaller regions. Pretty minimal allocation even for larger regions, up to 512 bytes no problem.
  6. Super simple implementation, probably much faster for small regions to just set bits and test them.

    The major downside I see is that it doesn't scale gracefully to *large* memory regions like a memset might hit...

What if I used a SparseBitVector. Perhaps that's the best of both worlds?

-Hal

What do you think? Maybe set an upper bound on size (512 bytes seems
a good bound, keeps the BitVector on a cache line) and use
BitVector? Is it worth the complexity of using a BitVector when
small and an IntervalMap when large?

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:348-349
@@ -337,1 +347,4 @@

+typedef IntervalMap<int64_t, std::tuple<>, 4,
+ IntervalMapHalfOpenInfo<int64_t>>
OverlapIntervalsTy;
+typedef DenseMap<Instruction *, OverlapIntervalsTy>
InstOverlapIntervalsTy;


Ugh, so you really want an interval *set*. Is it too hard to build
one?

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:444-448
@@ +443,7 @@
+ int64_t(LaterOff + Later.Size) >= EarlierOff) {
+ if (!IOL.count(DepWrite))
+ IOL.insert(std::make_pair(DepWrite,
OverlapIntervalsTy(OLAlloc)));
+
+ // Insert our part of the overlap into the map.
+ auto &IM = IOL.find(DepWrite)->second;
+ DEBUG(dbgs() << "DSE: Partial overwrite: Earlier [" <<
EarlierOff << ", " <<


Find first, and assign the iterator on insert? It'd be particularly
nice to add a try_emplace to DenseMap so that you can do this with a
single lookup. =/

http://reviews.llvm.org/D18586

  • Original Message -----

From: "Chandler Carruth" <chandlerc@gmail.com>
The major downside I see is that it doesn't scale gracefully to
*large* memory regions like a memset might hit...

What if I used a SparseBitVector. Perhaps that's the best of both worlds?

I think not sadly, at least if I'm thinking about this right...

SparseBitVector makes setting a large range or testing if all are set significantly more expensive when there are many bits than a normal BitVector, but with both the cost is linear in the number of bits set where as with IntervalMap it is constant in the size of the interval, and a very nice logarithmic factor in the number of disjoint intervals.

In the worst case for the IntervalMap, we would do N inserts for N stores, and if all were disjoint, the cost would be log(N), so total would be N*log(N).

In the worst case for either BitVector, we would do N inserts of the same M - 1 bits for an M byte size object which would be N * M. So when M becomes large, this will become quite slow. This isn't really helped by it being sparse either as it is sparse in the zero bits, not the set bits.

mcrosier resigned from this revision.May 26 2016, 6:49 AM
mcrosier removed a reviewer: mcrosier.
mcrosier removed a subscriber: mcrosier.
hfinkel updated this revision to Diff 59990.Jun 7 2016, 8:53 PM
hfinkel edited edge metadata.

Rebased, and replaced the use of IntervalMap with std::map. The reviewers are certainly right: we gain little by using an IntervalMap here (because IntervalMap does not handle overlapping inserts, so we end up doing the merging ourselves anyway).

In addition to the reviews here, I also discussed this with folks offline, and the conclusion was to go with a map or some general data structure at first. If we see a compile-time hit from this related to small stores, we can always add code to use a bitmap for small stores.

hfinkel updated this revision to Diff 59992.Jun 7 2016, 9:05 PM

Add a test case involving memset provided in the review by JakeVanAdrighem.

Ping (Chandler, are you okay with this now?)

junbuml added inline comments.Jun 16 2016, 11:01 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
406 ↗(On Diff #59992)

Don't you need to do :

LaterIntEnd = std::max(LaterIntEnd, ILI->first);

before erasing ILI in case ILI->first is larger than LaterIntEnd ?

413 ↗(On Diff #59992)

I think there is no need to assign to ILI here.

hfinkel added inline comments.Jun 16 2016, 11:29 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
406 ↗(On Diff #59992)

Thanks for catching that.

413 ↗(On Diff #59992)

Indeed.

chandlerc accepted this revision.Jun 16 2016, 2:13 PM
chandlerc edited edge metadata.

Rebased, and replaced the use of IntervalMap with std::map. The reviewers are certainly right: we gain little by using an IntervalMap here (because IntervalMap does not handle overlapping inserts, so we end up doing the merging ourselves anyway).

In addition to the reviews here, I also discussed this with folks offline, and the conclusion was to go with a map or some general data structure at first. If we see a compile-time hit from this related to small stores, we can always add code to use a bitmap for small stores.

I think I'm OK with this for now. I like the idea of using a simpler data structure at first. But please add a comment to the data structure that explains both the semantics (especially the surprising {end, start} representation, I had to read a fair amount of code to even see this) and captures a lot of the discussion we've had about tradeoffs around different data structures in case you or I aren't the ones to try to improve this later.

Everything below is further thoughts that might be distilled into comments, not anything I'm asking for right now:

My suspicion is that a sorted vector would work better than a map here. Among other things, it is very easy to have the intervals be represented in a natural way within the vector, and merely use a custom comparison to achieve the sorting behavior you want. Additionally, the overhead for the std::map will be roughly the same as the value size which is always an unfortunate tradeoff.

But as I said I'm happy for you to look at this as a follow-up patch. It shouldn't block the review as the code will be largely the same at this point, and the current version is almost certainly the simplest and most basic approach so it seems a good starting position.

From the discussion, I suspect the eventual end state will be ta use a BitVector for "small" regions, and a sorted vector for "large" regions. But that will definitely be quite complex and so I wouldn't want to go there until we have some useful test cases that show serious compile time hits here.

Thanks again,
-Chandler

This revision was automatically updated to reflect the committed changes.