This is an archive of the discontinued LLVM Phabricator instance.

[WIP][DSE] Remove memset that is overwritten by a store in a loop
Needs ReviewPublic

Authored by vdsered on Sep 4 2021, 1:45 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary
NOTE: This patch is still in progress

LLVM cannot optimize out some excessive store in loop/memset when another store in loop later writes to the same memory region

Practically, it might happen in cases like Fail to eliminate deadstore from vector resize when an array/vector is prefilled with some constant and for-loop writes something else there

These use cases basically look like this

for (int i = 0; i < N; ++i)
   P[i] = 1;
for (int i = 0; i < N; ++i)
   P[i] = 1;

or

memset(P, 0, N * sizeof(int))
for (int i = 0; i < N; ++i)
   P[i] = 1;

We have already partially solved this problem by LoopIdiomRecognize that transforms loop into memset which DSE can successfully remove like in this example

This patch should only generalize this to cases when LIR fails to transform loop with single basic block like in this example, so we cannot remove a memset because alias analysis gives up here

There are going to be subsequent patches that solve this for more cases so it is incrementally fixed

Diff Detail

Event Timeline

vdsered created this revision.Sep 4 2021, 1:45 PM
vdsered requested review of this revision.Sep 4 2021, 1:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 4 2021, 1:45 PM
vdsered updated this revision to Diff 371432.Sep 8 2021, 1:38 PM
  1. Added one test for a loop with several latches where we don't always overwrite the region where memory intrinsic writes to
  2. Fixed formatting
  3. Replaced typed pointers with opaque pointers (--force-opaque-pointers is used in tests)
  4. Fixed crash caused by an assertion for multiplication of operands with different types
  5. Optimized ContinuousMemoryRegion and removed one field from there
  6. Added a feature flag EnableMemIntrinsicEliminationByStores to enable/disable this feature at runtime (probably not the best naming)
vdsered updated this revision to Diff 371495.Sep 8 2021, 8:20 PM

Rebase + fixed opt-pipeline.ll tests for AMDGPU

vdsered updated this revision to Diff 372746.Sep 15 2021, 11:09 AM
vdsered edited the summary of this revision. (Show Details)
  1. Simplified code by removing unnecessary if-statements
  2. Turned AddMemRange lambda to a method
  3. memset can be eliminated when memset has constant length and loop has constant iteration count + added a test for this
vdsered updated this revision to Diff 377258.Oct 5 2021, 8:15 AM
vdsered edited the summary of this revision. (Show Details)
  • Transform loops into loop rotated simplify form
  • Use guards instead of directly finding icmps
  • Implemented optimization for loop + loop
  • Decreased the number of times when we compute range for memory intrinsics
  • Running DSE loop twice, 1 as it was before and 2 for optimization across loops

All your 3 examples - should just loopidiomrecognize just catch it and produce intrinsics? (Not sure about ordering of DSE and LoopIdiom)

Any other motivating cases/benchmarks?

Currently I see no reason to increase complexivity of DSE as LoopIdiom + DSE should work fine together to handle them.

vdsered added a comment.EditedOct 5 2021, 8:25 AM

Results for current patch for SingleSource/MultiSource are in the table below. Other don't show any different. Metric is dse.RemainingNumStores

namebaselineexperimentdiff
security-blowfish159.00160.000.6%
ReedSolomon209.00210.000.5%
paq8p2041.002044.000.1%
kc51975.0051974.00-0.0%
CLAMR20194.0020192.00-0.0%
consumer-typeset22144.0022139.00-0.0%
oggenc4229.004228.00-0.0%
make_dparser3033.003032.00-0.0%
ldecod6044.006042.00-0.0%
sim329.00328.00-0.3%
espresso1772.001766.00-0.3%
anagram120.00119.00-0.8%

I'm not sure why it removes less stores in the three tests.

vdsered added a comment.EditedOct 5 2021, 8:42 AM

Hi, @xbolva00. Are you sure that we can use loop idiom and transform arbitrary loop with a store into let's a memset?

Regardning memset, it accepts i8 type as a value. I know how it can transform a loop with stores like store i32 0, i32* %9 or e.g. store i32 286331153, i32* %9, but it'd fail on an example where we'd write store i32 1, i32* %9

xbolva00 added a subscriber: nikic.Oct 5 2021, 9:05 AM

Hi, @xbolva00. Are you sure that we can use loop idiom and transform arbitrary loop with a store into let's a memset?

Regardning memset, it accepts i8 type as a value. I know how it can transform a loop with stores like store i32 0, i32* %9 or e.g. store i32 286331153, i32* %9, but it'd fail on an example where we'd write store i32 1, i32* %9

Ah, right.

I think this would have nontrivial impact on compile time (@nikic ?) and the results from testsuite do not look so promising - I would expect more “hits” to justify (small if possible) compile time regression.

nikic added a comment.Oct 5 2021, 11:42 AM

I think this would have nontrivial impact on compile time (@nikic ?) and the results from testsuite do not look so promising - I would expect more “hits” to justify (small if possible) compile time regression.

Yes, this has a large negative effect: https://llvm-compile-time-tracker.com/compare.php?from=64eaffb613d0cb7fa7542fa48281a2e617ad8ee9&to=6e7452757e16c4260fa9a5862761a68ed778dbf9&stat=instructions

fhahn added a subscriber: fhahn.Feb 2 2022, 11:34 AM

@vdsered are you still planning on pushing this forward?

I think this would have nontrivial impact on compile time (@nikic ?) and the results from testsuite do not look so promising - I would expect more “hits” to justify (small if possible) compile time regression.

Yes, this has a large negative effect: https://llvm-compile-time-tracker.com/compare.php?from=64eaffb613d0cb7fa7542fa48281a2e617ad8ee9&to=6e7452757e16c4260fa9a5862761a68ed778dbf9&stat=instructions

Agreed that the compile-time impact looks very large, but I *think* it may not be as bad as it looks at the moment. There seems to be a bug in the code that means we effectively run the main DSE loop twice if EnableOptimizationAcrossLoops = false (as in this patch).

Thee patch unconditionally runs the whole main DSE loop again, as below. prepareStateForAcrossLoopOptimization has en early exit if !EnableOptimizationAcrossLoops. Otherwise clears MemDeps and roughly only adds MemDeps with loop ranges. So as a consequence, if !EnableOptimizationAcrossLoops we process *all* MemoryDeps again.

MadeChange |= State.prepareStateForAcrossLoopOptimization();
MadeChange |= runDSEOptimizationLoop(State);

I tried to rebase the patches and collect compile-time numbers with that issue fixed: https://llvm-compile-time-tracker.com/compare.php?from=f622c7b7d33b211517d8fe4f725d1028d786fc08&to=00a8810b123e606c19d9926d11183318323a8752&stat=instructions

NewPM-O3: +0.24%
NewPM-ReleaseThinLTO: +0.39%
NewPM-ReleaseLTO-g: +0.26%

We should still see if we can get those down further, but I think they look more encouraging to start with.

@vdsered are you still planning on pushing this forward?

I think this would have nontrivial impact on compile time (@nikic ?) and the results from testsuite do not look so promising - I would expect more “hits” to justify (small if possible) compile time regression.

Yes, this has a large negative effect: https://llvm-compile-time-tracker.com/compare.php?from=64eaffb613d0cb7fa7542fa48281a2e617ad8ee9&to=6e7452757e16c4260fa9a5862761a68ed778dbf9&stat=instructions

Agreed that the compile-time impact looks very large, but I *think* it may not be as bad as it looks at the moment. There seems to be a bug in the code that means we effectively run the main DSE loop twice if EnableOptimizationAcrossLoops = false (as in this patch).

Thee patch unconditionally runs the whole main DSE loop again, as below. prepareStateForAcrossLoopOptimization has en early exit if !EnableOptimizationAcrossLoops. Otherwise clears MemDeps and roughly only adds MemDeps with loop ranges. So as a consequence, if !EnableOptimizationAcrossLoops we process *all* MemoryDeps again.

MadeChange |= State.prepareStateForAcrossLoopOptimization();
MadeChange |= runDSEOptimizationLoop(State);

I tried to rebase the patches and collect compile-time numbers with that issue fixed: https://llvm-compile-time-tracker.com/compare.php?from=f622c7b7d33b211517d8fe4f725d1028d786fc08&to=00a8810b123e606c19d9926d11183318323a8752&stat=instructions

NewPM-O3: +0.24%
NewPM-ReleaseThinLTO: +0.39%
NewPM-ReleaseLTO-g: +0.26%

We should still see if we can get those down further, but I think they look more encouraging to start with.

@fhahn Yes, I do

Thank you for this analysis.

I think there are potential enhancements. For example, loop transformations shouldn't probably be done in this pass. It'd be probably better move this whole optimization into a specialized pass and run it closer to other loop passes in the default pipeline because loops'd be in the right form for free or loop deletion would remove loops that become empty after DSE and so on. However, I'm still not sure if this is a better idea than implementing it right here in DSE.

Plus, it'd be good to see how this behaves on larger projects (LLVM itself and so on).

fhahn added a comment.Feb 3 2022, 12:25 AM

snip

I think there are potential enhancements. For example, loop transformations shouldn't probably be done in this pass. It'd be probably better move this whole optimization into a specialized pass and run it closer to other loop passes in the default pipeline because loops'd be in the right form for free or loop deletion would remove loops that become empty after DSE and so on. However, I'm still not sure if this is a better idea than implementing it right here in DSE.

DSE shouldn't rotate loops, yes! Not sure if moving it to a separate pass is necessary to start with. The first set of loop optimizations should already be completed before DSE runs.

Plus, it'd be good to see how this behaves on larger projects (LLVM itself and so on).

Sure. Another interesting data point would be the impact with variable auto-init enabled. Also, another motivating case was raised as issue https://github.com/llvm/llvm-project/issues/53473

vdsered updated this revision to Diff 406195.EditedFeb 5 2022, 10:35 AM
  1. Rebased
  2. If EnableOptimizationAcrossLoops is false, then DSE does not run twice
  3. Added two more negative tests cases
  4. Removed loop transformations from this pass
  5. Skip killing store which is not guaranteed to execute
vdsered updated this revision to Diff 450613.Aug 7 2022, 3:16 AM
vdsered retitled this revision from [WIP][DSE] Memory intrinsics like memset, memcpy, memmove are removed if they are overwritten by a store in a loop to [WIP][DSE] Remove memset that is overwritten by a store in a loop.
vdsered edited the summary of this revision. (Show Details)
  • Rebase
  • Simplified this patch
Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2022, 3:16 AM