Page MenuHomePhabricator

[DSE] Remove stores in the same loop iteration
Needs ReviewPublic

Authored by dmgreen on Wed, Apr 14, 4:15 AM.

Details

Summary

DSE will currently only remove stores in the same block unless they can be guaranteed to be loop invariant. This expands that to any stores that are in the same Loop, at the same loop level. I believe this should still account for where AA/MSSA will not handle aliasing between loops, but allow the dead stores to be removed where they overlap in the same loop iteration. It requires adding loop info to DSE, but that looks fairly harmless.

The test case this helps is from code like this, which can come up in certain matrix operations:

  for(i=..)
    dst[i] = 0;
    for(j=..)
	  dst[i] += src[i*n+j];

After LICM, this becomes:

for(i=..)
  dst[i] = 0;
  sum = 0;
  for(j=..)
    sum += src[i*n+j];
  dst[i] = sum;

The first store is dead, but is not currently removed.

Diff Detail

Event Timeline

dmgreen created this revision.Wed, Apr 14, 4:15 AM
dmgreen requested review of this revision.Wed, Apr 14, 4:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptWed, Apr 14, 4:15 AM
fhahn added a comment.Wed, Apr 14, 7:56 AM

Thanks for the patch! Given that this only shifts the invocation of LI, there shouldn't be any problems in terms of compile time.

I'm not sure how well the loop cases are covered in multiblock-loops.ll, but I think it might be worth to add a few more interesting cases that we can and can't optimize, especially with stores to different indices.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1324

Can LI just be part of DSEState?

2063

Should also preserve LoopAnalysis?

nikic added inline comments.Wed, Apr 14, 12:44 PM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1432–1434

I'm not sure this logic is correct. The problem I see is that LoopInfo only tracks natural loops. This means that even though Current and KillingDef might be in the same natural loop (say the nullptr loop), Current might still be part of a (non-natural) cycle, in which case KillingDef may not kill it.

I'm not particularly familiar with LoopInfo, but that's my understanding of how it works.

fhahn added inline comments.Thu, Apr 15, 2:09 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1432–1434

I'm not sure this logic is correct. The problem I see is that LoopInfo only tracks natural loops. This means that even though Current and KillingDef might be in the same natural loop (say the nullptr loop), Current might still be part of a (non-natural) cycle, in which case KillingDef may not kill it.

I'm not particularly familiar with LoopInfo, but that's my understanding of how it works.

Yes I think that's correct. There might be cycles that LoopInfo does not detect. So we should not rely on comparing the loops on its own.

If there is no loop for a given block, they still might be in a cycle that has not been detected by LoopInfo due to irreducible control flow; if there's a loop for a given block, it could be in a cycle that's completely contained in the loop we found. If LI finds a loop for a block, I think it is guaranteed that the loop is only entered through the header though.

I think this should allow us to treat the phis in the header as being in the same iteration or invariant in any undetected cycles in the loop.

dmgreen updated this revision to Diff 338014.Fri, Apr 16, 12:53 AM
dmgreen marked 2 inline comments as done.

Add a ContainsIrreducibleLoops flag, Move LoopInfo and add some irreducible tests.

Thanks for the patch! Given that this only shifts the invocation of LI, there shouldn't be any problems in terms of compile time.

It should iterate the same amount too, so I'm hoping it's OK. (But I have no evidence one way of the other).

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1432–1434

Ah, excellent point. I had not considered irreducible loops and they are not the kind of thing that naturally comes up in testing. (Unfortunately I feel like all the testing I could try would already have been run by Florian before DSE was committed, and it didn't show anything then just as it would not show any problems now. csmith didn't feel like much help). I was originally more worried about opening the door to any number of unrelated issues coming up from allowing DSE from different blocks where both LI's return nullptr.

For the moment, I have added a check using mayContainIrreducibleControl on the function and bailing if there are any. What do you think? Too much of a blunt hammer, or an OK way to handle this?

nikic added a comment.Fri, Apr 16, 9:37 AM

Could you please rebase this? The patch doesn't apply cleanly to main. (Insert rant here about Phabricator not providing 3way patches, making it impossible to resolve conflicts when applying.)

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1432–1434

I think that's a good way to handle it. Dealing with irreducible control flow in full generality is unlikely to be worthwhile.

Might want to extract this condition into a separate function, as it's getting a bit hard to read...

dmgreen updated this revision to Diff 338166.Fri, Apr 16, 11:02 AM

Rebase and move the condition logic into the start of IsGuaranteedLoopInvariant.

Compile-time: http://llvm-compile-time-tracker.com/compare.php?from=da627258742ae638b813d0341f069d5b4a6bd9ae&to=6c119f84a7fa3d6ea3b82e89d482c31836d5126f&stat=instructions

There is some impact, with the largest regression being sqlite3 with ThinLTO. I expect the reason isn't any of the added infrastructure, but rather the fact that we will now perform a longer walk in many cases.

nikic requested changes to this revision.Fri, Apr 16, 1:24 PM

I believe this is still not quite correct. Consider the following variation:

@x = global [10 x i16] zeroinitializer, align 1

define i16 @test(i1 %cond) {
entry: 
  br label %do.body

do.body:                             
  %i.0 = phi i16 [ 0, %entry ], [ %inc, %if.end2 ]
  %arrayidx2 = getelementptr inbounds [10 x i16], [10 x i16]* @x, i16 0, i16 %i.0 
  store i16 2, i16* %arrayidx2, align 1 ;;; store is removed
  %exitcond = icmp eq i16 %i.0, 4
  br i1 %exitcond, label %exit, label %if.end

if.end:                   
  br i1 %cond, label %do.store, label %if.end2 
   
do.store:
  store i16 3, i16* %arrayidx2, align 1
  br label %if.end2 
   
if.end2:
  %inc = add nuw nsw i16 %i.0, 1
  br label %do.body 

exit:
  store i16 1, i16* %arrayidx2, align 1
  ret i16 0    
}

The current implementation will eliminate store store i16 2.

The reason is that DSE determines eliminable stores in two phases: First, a dominating access for the killing store is found. And second, all uses of the dominating access are checked. When store i16 1 is the killing access, we will bail out due to the loop invariance check. When store i16 3 is the killing store, we will not, because they are in the same loop. The store i16 1 will be examined as a use and skipped as a fully overwriting store.

This revision now requires changes to proceed.Fri, Apr 16, 1:24 PM

Sigh. I should probably have found that problem. I hadn't considered multiple stores overriding happening like that.

dmgreen updated this revision to Diff 345077.Thu, May 13, 2:54 AM

Hello. OK. I'm back

This makes two changes. isOverwrite now checks IsGuaranteedLoopInvariant, returning OW_Unknown if they are not loop invariant. Otherwise we cannot trust the MustAlias that AA provides.
It also limits the IsGuaranteedLoopInvariant check from including global scope, only checking inside the same loop (or same block). This should hopefully limit the iterating on blocks that are very unlikely to be helpful.

Together these can mean we keep removing stores in loops, but does mean that certain partial alias overlaps that were removed before may no longer be. I don't believe those will come up very often though.

This removes some extra stores in the llvm test suite:

Metric: dse.NumRemainingStores
test-suite...nchmarkGame/spectral-norm.test     9.00     7.00   -22.2%
test-suite...abench/jpeg/jpeg-6a/cjpeg.test   6029.00  5709.00  -5.3%
test-suite...nsumer-jpeg/consumer-jpeg.test   6171.00  5851.00  -5.2%
test-suite...nchmarks/Misc/ReedSolomon.test   248.00   239.00   -3.6%
test-suite...s/ASC_Sequoia/AMGmk/AMGmk.test   220.00   216.00   -1.8%
test-suite...marks/Ptrdist/yacr2/yacr2.test   314.00   310.00   -1.3%
test-suite...nchmarks/McCat/18-imp/imp.test   143.00   142.00   -0.7%
test-suite...ocBench/espresso/espresso.test   1776.00  1772.00  -0.2%
test-suite...pplications/oggenc/oggenc.test   4174.00  4165.00  -0.2%
test-suite.../Benchmarks/nbench/nbench.test   1375.00  1373.00  -0.1%
test-suite...lications/ClamAV/clamscan.test   10107.00 10099.00 -0.1%
test-suite...arks/mafft/pairlocalalign.test   5967.00  5963.00  -0.1%
test-suite...nsumer-lame/consumer-lame.test   4615.00  4612.00  -0.1%
test-suite...lications/sqlite3/sqlite3.test   16485.00 16478.00 -0.0%
test-suite...marks/7zip/7zip-benchmark.test   32005.00 31993.00 -0.0%
test-suite...ications/JM/lencod/lencod.test   13650.00 13645.00 -0.0%
test-suite...ications/JM/ldecod/ldecod.test   6124.00  6122.00  -0.0%
test-suite...rks/tramp3d-v4/tramp3d-v4.test   49150.00 49144.00 -0.0%
test-suite...-typeset/consumer-typeset.test   22146.00 22144.00 -0.0%
test-suite.../Benchmarks/Bullet/bullet.test   26355.00 26354.00 -0.0%

The number of defchecks and walks generally goes up a little on average, whilst the number of override checks goes down. But different cases can act very differently. sqlite is still doing more work.

WDYT? This is hopefully conservatively correct. Let me know if you think of ways it would not be.

nikic added a comment.Thu, May 13, 3:05 AM

Could you please pre-commit the move of isOverwrite? Hard to see what changed otherwise.

dmgreen updated this revision to Diff 345126.Thu, May 13, 7:21 AM

I've rebased over the movement of isOverwrite into DSEState (but not committed that yet, will do soon). I've also rebased over the tests, to show the ones the improve/get worse (and that most don't change).