This is an archive of the discontinued LLVM Phabricator instance.

[DSE] Remove stores in the same loop iteration
ClosedPublic

Authored by dmgreen on Apr 14 2021, 4:15 AM.

Details

Summary

DSE will currently only remove stores in the same block unless they can be guaranteed to be loop invariant. This expands that to any stores that are in the same Loop, at the same loop level. I believe this should still account for where AA/MSSA will not handle aliasing between loops, but allow the dead stores to be removed where they overlap in the same loop iteration. It requires adding loop info to DSE, but that looks fairly harmless.

The test case this helps is from code like this, which can come up in certain matrix operations:

  for(i=..)
    dst[i] = 0;
    for(j=..)
	  dst[i] += src[i*n+j];

After LICM, this becomes:

for(i=..)
  dst[i] = 0;
  sum = 0;
  for(j=..)
    sum += src[i*n+j];
  dst[i] = sum;

The first store is dead, but is not currently removed.

Diff Detail

Event Timeline

dmgreen created this revision.Apr 14 2021, 4:15 AM
dmgreen requested review of this revision.Apr 14 2021, 4:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2021, 4:15 AM
fhahn added a comment.Apr 14 2021, 7:56 AM

Thanks for the patch! Given that this only shifts the invocation of LI, there shouldn't be any problems in terms of compile time.

I'm not sure how well the loop cases are covered in multiblock-loops.ll, but I think it might be worth to add a few more interesting cases that we can and can't optimize, especially with stores to different indices.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1308

Can LI just be part of DSEState?

2048

Should also preserve LoopAnalysis?

nikic added inline comments.Apr 14 2021, 12:44 PM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1424–1425

I'm not sure this logic is correct. The problem I see is that LoopInfo only tracks natural loops. This means that even though Current and KillingDef might be in the same natural loop (say the nullptr loop), Current might still be part of a (non-natural) cycle, in which case KillingDef may not kill it.

I'm not particularly familiar with LoopInfo, but that's my understanding of how it works.

fhahn added inline comments.Apr 15 2021, 2:09 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1424–1425

I'm not sure this logic is correct. The problem I see is that LoopInfo only tracks natural loops. This means that even though Current and KillingDef might be in the same natural loop (say the nullptr loop), Current might still be part of a (non-natural) cycle, in which case KillingDef may not kill it.

I'm not particularly familiar with LoopInfo, but that's my understanding of how it works.

Yes I think that's correct. There might be cycles that LoopInfo does not detect. So we should not rely on comparing the loops on its own.

If there is no loop for a given block, they still might be in a cycle that has not been detected by LoopInfo due to irreducible control flow; if there's a loop for a given block, it could be in a cycle that's completely contained in the loop we found. If LI finds a loop for a block, I think it is guaranteed that the loop is only entered through the header though.

I think this should allow us to treat the phis in the header as being in the same iteration or invariant in any undetected cycles in the loop.

dmgreen updated this revision to Diff 338014.Apr 16 2021, 12:53 AM
dmgreen marked 2 inline comments as done.

Add a ContainsIrreducibleLoops flag, Move LoopInfo and add some irreducible tests.

Thanks for the patch! Given that this only shifts the invocation of LI, there shouldn't be any problems in terms of compile time.

It should iterate the same amount too, so I'm hoping it's OK. (But I have no evidence one way of the other).

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1424–1425

Ah, excellent point. I had not considered irreducible loops and they are not the kind of thing that naturally comes up in testing. (Unfortunately I feel like all the testing I could try would already have been run by Florian before DSE was committed, and it didn't show anything then just as it would not show any problems now. csmith didn't feel like much help). I was originally more worried about opening the door to any number of unrelated issues coming up from allowing DSE from different blocks where both LI's return nullptr.

For the moment, I have added a check using mayContainIrreducibleControl on the function and bailing if there are any. What do you think? Too much of a blunt hammer, or an OK way to handle this?

nikic added a comment.Apr 16 2021, 9:37 AM

Could you please rebase this? The patch doesn't apply cleanly to main. (Insert rant here about Phabricator not providing 3way patches, making it impossible to resolve conflicts when applying.)

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1424–1425

I think that's a good way to handle it. Dealing with irreducible control flow in full generality is unlikely to be worthwhile.

Might want to extract this condition into a separate function, as it's getting a bit hard to read...

dmgreen updated this revision to Diff 338166.Apr 16 2021, 11:02 AM

Rebase and move the condition logic into the start of IsGuaranteedLoopInvariant.

Compile-time: http://llvm-compile-time-tracker.com/compare.php?from=da627258742ae638b813d0341f069d5b4a6bd9ae&to=6c119f84a7fa3d6ea3b82e89d482c31836d5126f&stat=instructions

There is some impact, with the largest regression being sqlite3 with ThinLTO. I expect the reason isn't any of the added infrastructure, but rather the fact that we will now perform a longer walk in many cases.

nikic requested changes to this revision.Apr 16 2021, 1:24 PM

I believe this is still not quite correct. Consider the following variation:

@x = global [10 x i16] zeroinitializer, align 1

define i16 @test(i1 %cond) {
entry: 
  br label %do.body

do.body:                             
  %i.0 = phi i16 [ 0, %entry ], [ %inc, %if.end2 ]
  %arrayidx2 = getelementptr inbounds [10 x i16], [10 x i16]* @x, i16 0, i16 %i.0 
  store i16 2, i16* %arrayidx2, align 1 ;;; store is removed
  %exitcond = icmp eq i16 %i.0, 4
  br i1 %exitcond, label %exit, label %if.end

if.end:                   
  br i1 %cond, label %do.store, label %if.end2 
   
do.store:
  store i16 3, i16* %arrayidx2, align 1
  br label %if.end2 
   
if.end2:
  %inc = add nuw nsw i16 %i.0, 1
  br label %do.body 

exit:
  store i16 1, i16* %arrayidx2, align 1
  ret i16 0    
}

The current implementation will eliminate store store i16 2.

The reason is that DSE determines eliminable stores in two phases: First, a dominating access for the killing store is found. And second, all uses of the dominating access are checked. When store i16 1 is the killing access, we will bail out due to the loop invariance check. When store i16 3 is the killing store, we will not, because they are in the same loop. The store i16 1 will be examined as a use and skipped as a fully overwriting store.

This revision now requires changes to proceed.Apr 16 2021, 1:24 PM

Sigh. I should probably have found that problem. I hadn't considered multiple stores overriding happening like that.

dmgreen updated this revision to Diff 345077.May 13 2021, 2:54 AM

Hello. OK. I'm back

This makes two changes. isOverwrite now checks IsGuaranteedLoopInvariant, returning OW_Unknown if they are not loop invariant. Otherwise we cannot trust the MustAlias that AA provides.
It also limits the IsGuaranteedLoopInvariant check from including global scope, only checking inside the same loop (or same block). This should hopefully limit the iterating on blocks that are very unlikely to be helpful.

Together these can mean we keep removing stores in loops, but does mean that certain partial alias overlaps that were removed before may no longer be. I don't believe those will come up very often though.

This removes some extra stores in the llvm test suite:

Metric: dse.NumRemainingStores
test-suite...nchmarkGame/spectral-norm.test     9.00     7.00   -22.2%
test-suite...abench/jpeg/jpeg-6a/cjpeg.test   6029.00  5709.00  -5.3%
test-suite...nsumer-jpeg/consumer-jpeg.test   6171.00  5851.00  -5.2%
test-suite...nchmarks/Misc/ReedSolomon.test   248.00   239.00   -3.6%
test-suite...s/ASC_Sequoia/AMGmk/AMGmk.test   220.00   216.00   -1.8%
test-suite...marks/Ptrdist/yacr2/yacr2.test   314.00   310.00   -1.3%
test-suite...nchmarks/McCat/18-imp/imp.test   143.00   142.00   -0.7%
test-suite...ocBench/espresso/espresso.test   1776.00  1772.00  -0.2%
test-suite...pplications/oggenc/oggenc.test   4174.00  4165.00  -0.2%
test-suite.../Benchmarks/nbench/nbench.test   1375.00  1373.00  -0.1%
test-suite...lications/ClamAV/clamscan.test   10107.00 10099.00 -0.1%
test-suite...arks/mafft/pairlocalalign.test   5967.00  5963.00  -0.1%
test-suite...nsumer-lame/consumer-lame.test   4615.00  4612.00  -0.1%
test-suite...lications/sqlite3/sqlite3.test   16485.00 16478.00 -0.0%
test-suite...marks/7zip/7zip-benchmark.test   32005.00 31993.00 -0.0%
test-suite...ications/JM/lencod/lencod.test   13650.00 13645.00 -0.0%
test-suite...ications/JM/ldecod/ldecod.test   6124.00  6122.00  -0.0%
test-suite...rks/tramp3d-v4/tramp3d-v4.test   49150.00 49144.00 -0.0%
test-suite...-typeset/consumer-typeset.test   22146.00 22144.00 -0.0%
test-suite.../Benchmarks/Bullet/bullet.test   26355.00 26354.00 -0.0%

The number of defchecks and walks generally goes up a little on average, whilst the number of override checks goes down. But different cases can act very differently. sqlite is still doing more work.

WDYT? This is hopefully conservatively correct. Let me know if you think of ways it would not be.

nikic added a comment.May 13 2021, 3:05 AM

Could you please pre-commit the move of isOverwrite? Hard to see what changed otherwise.

dmgreen updated this revision to Diff 345126.May 13 2021, 7:21 AM

I've rebased over the movement of isOverwrite into DSEState (but not committed that yet, will do soon). I've also rebased over the tests, to show the ones the improve/get worse (and that most don't change).

dmgreen updated this revision to Diff 347330.May 24 2021, 2:33 AM

Rebase onto main. Any thoughts or comments on this now?

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1050

While it ultimately shouldn't make a difference, I think this should be passing Earlier as the memory location.

1271

Doc comment is outdated now -- and the function should probably also get renamed as well.

1273

const MemoryLocation &

dmgreen updated this revision to Diff 347783.May 25 2021, 2:46 PM

Changed IsGuaranteedLoopInvariant to isGuaranteedLoopInvariant (if this wasn't what you meant, let me know, and do you have suggestions for a better name?)
Reworded comments.
Fixed Later->Earlier memref, which then needed a rejig of one of the test to move the initial memloc out of the entry block.

nikic accepted this revision.May 27 2021, 2:18 PM

LGTM

Changed IsGuaranteedLoopInvariant to isGuaranteedLoopInvariant (if this wasn't what you meant, let me know, and do you have suggestions for a better name?)

That was not what I had in mind, but after your comment update, I think this framing works :)

This revision is now accepted and ready to land.May 27 2021, 2:18 PM
fhahn accepted this revision.May 30 2021, 4:37 AM

LGTM, thanks! I think it would be good to update some of the naming/comments slightly (as per the comments) before committing.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1048

nit: 'the same location' may be a bit confusing here. I guess it's meant as the same MemoryLocation value, but we can store to different concrete locations for locations that depend on a loop induction for example.

1279

Referring to candidates & elimination here seems a bit out-of-place given the name & doc-comment of the function. It would be good to update the function name (doesn't check for invariance any longer). It now more accurately tries to rule out cross-iteration dependencies?

dmgreen closed this revision.May 31 2021, 2:26 AM

I did some rewording, but I'm not sure it's really better than it was before. Feel free to update further if you see fit.

rG222aeb4d51a4

clin1 added a subscriber: clin1.Jun 8 2021, 12:04 PM

@dmgreen, we found that DSE is removing a non dead store in a loop with multiple backedges. File "dse-double-loop.ll" attached. The 2nd store postdominates the 1st one, but it doesn't run on every iteration, so the 1st one is still needed. We worked around it by checking numBackEdges > 1 in addition to the irreducibility check. Would you mind taking a look? I hope bugpoint did not over-reduce the test case; it's derived from "real" code in spec2000 186.crafty.

Oh so it does. Thanks for the report. I thought we had a test case for that...

I'll revert this for the time being.

clin1 added a comment.Jun 8 2021, 6:11 PM

Thank you for the quick response! Much appreciated.

dmgreen updated this revision to Diff 352395.Jun 16 2021, 5:11 AM

Reopening with adjusted code (and more test cases). The issues were related to the checks for all paths leading to the exit collectively postdominating the earlier dead store. But:

  • A store that walked back to itself was ignored, effectively killing itself. This was this issue that came up from the reproducer. It is only true if the access is loop invariant, so it now uses an extra isGuaranteedLoopInvariant() check.
  • Stores in blocks with later PO numbers were again effectively ignored. With loops they need to be handled, in this case by returning None.

The number of stores remove in the llvm test suite was almost the same as with the previous version of this patch. The total remaining stores was 3 more than the last set of numbers.

dmgreen reopened this revision.Jun 16 2021, 5:12 AM
This revision is now accepted and ready to land.Jun 16 2021, 5:12 AM
dmgreen requested review of this revision.Jun 16 2021, 5:12 AM
clin1 accepted this revision.Jun 16 2021, 11:19 AM

spec2000/186 passes now -- thanks.

This revision is now accepted and ready to land.Jun 16 2021, 11:19 AM

spec2000/186 passes now -- thanks.

Thanks for checking! We didn't see any failures here in our running of SPEC, I don't believe, so it's good to see its fixed for you with the latest version. There must be something different about the way they get run, perhaps because of the differences in architecture.

cfang added a subscriber: cfang.Nov 16 2021, 9:59 PM

We see a big performance regression caused by this patch due to register pressure increase.
We found that remove stores in the same iteration could increase
the register pressure dramatically. In the following piece of code, group 1 and group 2 stores
are eliminated and the register usage increased from 70 to 171. This is from our
critical application, I am not sure whether this DSE could estimate RP?

// group 1 stores
sw[0][j][i] = r0 * du0;
sw[1][j][i] = r0 * du1;
sw[2][j][i] = r0 * du2;

.....

#pragma unroll 8

for (int m = 0; m < 8; m++) {
  const double vs =sv[m][j];
  du0 += vs * sq[0][m][i];
  du1 += vs * sq[1][m][i];
  du2 += vs * sq[2][m][i];
}
// group 2 stores
sw[0][j][i] += r1 * du0;
sw[1][j][i] += r1 * du1;
sw[2][j][i] += r1 * du2;

......

#pragma unroll 8

for (int m = 0; m < 8; m++) {
  const double xt = st[m][k];
  dx0 += xt * ru[0][m];
  dx1 += xt * ru[1][m];
  dx2 += xt * ru[2][m];
}
// group 3 stores
sw[0][j][i] += r2 * dx0;
sw[1][j][i] += r2 * dx1;
sw[2][j][i] += r2 * dx2;

Any suggestions are welcome to resolve this issue. Thanks.

Was that code being vectorized before this patch?

Any suggestions are welcome to resolve this issue. Thanks.

Which pass is actually causing the increase in register pressure? I don't think DSE would be directly responsible, but rather some other transformation will be enabled and may go crazy. Figuring out which one and why would be a good start.

fhahn added a comment.Nov 17 2021, 1:00 AM

@dmgreen this review is still open, but the patch got recommitted again, right? Can it be closed?

Hello. Do you have a more full example where this is causing the regressions? Is it something like this https://godbolt.org/z/fW79sMhWP? Or is there more to it than that?

Like the others have said - it's likely a good thing to remove dead stores and other passes that should be handling the increase in register pressure more elegantly. Without a more complete example it's hard to say what or where though.

I guess there must be more to it than the example above - if all the loops are unrolled then all the instructions end up in the same block and DSE can remove all the stores. even before this patch.

Right. If we manually LICM'ed the load/stores in group 1 and group2 (with something like sum_i),
we will see the higher register pressure. Further, if we intentionally insert dead stores to store
the intermediate sum_is, the register pressure can be lowered. Some later passes should be responsible.

dmgreen closed this revision.Feb 15 2022, 1:04 AM

@dmgreen this review is still open, but the patch got recommitted again, right? Can it be closed?

Yep, this went in a long time ago now.