Page MenuHomePhabricator

[WIP][DSE] Memory intrinsics like memset, memcpy, memmove are removed if they are overwritten by a store in a loop
Needs ReviewPublic

Authored by vdsered on Sep 4 2021, 1:45 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary
NOTE: This patch is still in progress

The patch would allow DSE to remove more instructions that overwrite a specific region of memory. The examples below demonstrate some patterns that this patch is intended for

  1. memory intrinsic + loop
memset(Ptr, 0, N * sizeof(type of element));
for (int i = 0; i < N; ++i)
     Ptr[i] = 1;

memset is obviously overwritten and can be eliminated

  1. loop + loop
for (int i = 0; i < N; ++i)
     Ptr[i] = 1;
for (int i = 0; i < N; ++i)
     Ptr[i] = 1;

First for-loop is redundant because it is write to the same memory region as the second for-loop

  1. loop + memory intrinsic
for (int i = 0; i < N; ++i)
     Ptr[i] = 1;
memset(Ptr, 0, N * sizeof(type of element));

In all of examples there is a region of memory that we write to twice.

There are other ways to optimize this cases:

  1. If we'd fuse two loops for the second case, then LLVM removed one of the stores, but loop fusion has its own limitation like loops being adjacent and this seems to be too limiting + it doesn't help us when we work with the first case + it is not included in the default pipeline yet
  1. DSE can actually remove instructions like memset when they all are in the same basic block, but it seems to be lack of ability to remove it when they are in different basic blocks and even if it could optimize it, it would not help us with a case loop + loop or memory intrinsic + loop because Alias Analysis does not account for loops. Plus, there is LoopIdiomRecognize that can transform a loop -> memory intrinsic, but it is not always possible to convert a loop to a memory

Implementation description:

At first, we run the default DSE's loop to optimize as many stores as it can because it will be useful at the step when we will remove stores across loop. For example,
for (..)

P[i] = 1;
...
P[i] = 2;

...
for (...)

P[i] = 3;

DSE can remove P[i] = 1 because it is in the same iteration P[i] = 2 overwrites it (see %loop.1 on godbolt). We could probably consider both stores in first loop as candidates to be removed by P[i] = 3, but it is better to consider only P[i] = 2 so let DSE remove stores in the same loop iteration at first.

At second, we collect information about stores in loops. Particularly, we compute memory range [Ptr, Ptr + N * sizeof(type of element)] to which it writes. Plus, we check if each store fits some requirements:

  1. It is not atomic or volatile
  2. It may not throw
  3. Store size == stride in this loop
  4. It is write an integer or float

Some of them may be relaxed

For each loops where we find such a store we do the following things:

  1. Check if any block of this loop may throw
  2. Transform it to rotated simplify form which is required to find guard for now, but we could use it to unswitch loop in the future if it has any advantages and so on

Next, we run the default DSE's loop over the list of the stores trying to eliminate them. However, two things are different

  1. We don't use AA here rather comparing memory ranges we computed earlier
  2. We add guard (if there is one) as a killing block in getDomMemoryDef instead of the block of where the store is

Consider this code where %guard block points to %preheader2 for second loop and %exit. If we go to %preheader2 then store in %loop.2 overwrites store in %loop.1. On the other hand, if we go to %exit then that would mean that first loop is never executed. This seems to be reasonable to consider %guard as a killing block even when there is no actually store in it.

define void @foo(ptr %ptr, i32 %n) {
entry:
  %cond = icmp sgt i32 %n, 0
  br i1 %cond, label %preheader1, label %exit

preheader1:                                       ; preds = %entry
  %n.zext1 = zext i32 %n to i64
  br label %loop.1

guard:                                            ; preds = %loop.1
  br i1 %cond, label %preheader2, label %exit

preheader2:                                       ; preds = %guard
  %n.zext2 = zext i32 %n to i64
  br label %loop.2

loop.1:                                           ; preds = %loop.1, %preheader1
; 5 = MemoryPhi({preheader1,liveOnEntry},{loop.1,1})
  %loop.1.iv = phi i64 [ 0, %preheader1 ], [ %loop.1.iv.1, %loop.1 ]
  %loop.1.idx = getelementptr inbounds i32, ptr %ptr, i64 %loop.1.iv
; 1 = MemoryDef(5)
  store i32 2, ptr %loop.1.idx, align 4
  %loop.1.iv.1 = add nuw nsw i64 %loop.1.iv, 1
  %loop.1.cond = icmp eq i64 %loop.1.iv.1, %n.zext1
  br i1 %loop.1.cond, label %guard, label %loop.1

exit:                                             ; preds = %loop.2, %guard, %entry
; 3 = MemoryPhi({entry,liveOnEntry},{guard,1},{loop.2,2})
  ret void

loop.2:                                           ; preds = %loop.2, %preheader2
; 4 = MemoryPhi({preheader2,1},{loop.2,2})
  %loop.2.iv = phi i64 [ 0, %preheader2 ], [ %loop.2.iv.1, %loop.2 ]
  %loop.2.idx = getelementptr inbounds i32, ptr %ptr, i64 %loop.2.iv
; 2 = MemoryDef(4)
  store i32 3, ptr %loop.2.idx, align 4
  %loop.2.iv.1 = add nuw nsw i64 %loop.2.iv, 1
  %loop.2.cond = icmp eq i64 %loop.2.iv.1, %n.zext2
  br i1 %loop.2.cond, label %exit, label %loop.2
}

Diff Detail

Unit TestsFailed

TimeTest
170 msx64 debian > LLVM.CodeGen/AMDGPU::opt-pipeline.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -O0 -mtriple=amdgcn--amdhsa -disable-output -disable-verify -debug-pass=Structure -enable-new-pm=0 /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/opt-pipeline.ll 2>&1 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefix=GCN-O0 /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/opt-pipeline.ll

Event Timeline

vdsered created this revision.Sep 4 2021, 1:45 PM
vdsered requested review of this revision.Sep 4 2021, 1:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 4 2021, 1:45 PM
vdsered updated this revision to Diff 371432.Sep 8 2021, 1:38 PM
  1. Added one test for a loop with several latches where we don't always overwrite the region where memory intrinsic writes to
  2. Fixed formatting
  3. Replaced typed pointers with opaque pointers (--force-opaque-pointers is used in tests)
  4. Fixed crash caused by an assertion for multiplication of operands with different types
  5. Optimized ContinuousMemoryRegion and removed one field from there
  6. Added a feature flag EnableMemIntrinsicEliminationByStores to enable/disable this feature at runtime (probably not the best naming)
vdsered updated this revision to Diff 371495.Sep 8 2021, 8:20 PM

Rebase + fixed opt-pipeline.ll tests for AMDGPU

vdsered updated this revision to Diff 372746.Sep 15 2021, 11:09 AM
vdsered edited the summary of this revision. (Show Details)
  1. Simplified code by removing unnecessary if-statements
  2. Turned AddMemRange lambda to a method
  3. memset can be eliminated when memset has constant length and loop has constant iteration count + added a test for this
vdsered updated this revision to Diff 377258.Oct 5 2021, 8:15 AM
vdsered edited the summary of this revision. (Show Details)
  • Transform loops into loop rotated simplify form
  • Use guards instead of directly finding icmps
  • Implemented optimization for loop + loop
  • Decreased the number of times when we compute range for memory intrinsics
  • Running DSE loop twice, 1 as it was before and 2 for optimization across loops

All your 3 examples - should just loopidiomrecognize just catch it and produce intrinsics? (Not sure about ordering of DSE and LoopIdiom)

Any other motivating cases/benchmarks?

Currently I see no reason to increase complexivity of DSE as LoopIdiom + DSE should work fine together to handle them.

vdsered added a comment.EditedOct 5 2021, 8:25 AM

Results for current patch for SingleSource/MultiSource are in the table below. Other don't show any different. Metric is dse.RemainingNumStores

namebaselineexperimentdiff
security-blowfish159.00160.000.6%
ReedSolomon209.00210.000.5%
paq8p2041.002044.000.1%
kc51975.0051974.00-0.0%
CLAMR20194.0020192.00-0.0%
consumer-typeset22144.0022139.00-0.0%
oggenc4229.004228.00-0.0%
make_dparser3033.003032.00-0.0%
ldecod6044.006042.00-0.0%
sim329.00328.00-0.3%
espresso1772.001766.00-0.3%
anagram120.00119.00-0.8%

I'm not sure why it removes less stores in the three tests.

vdsered added a comment.EditedOct 5 2021, 8:42 AM

Hi, @xbolva00. Are you sure that we can use loop idiom and transform arbitrary loop with a store into let's a memset?

Regardning memset, it accepts i8 type as a value. I know how it can transform a loop with stores like store i32 0, i32* %9 or e.g. store i32 286331153, i32* %9, but it'd fail on an example where we'd write store i32 1, i32* %9

xbolva00 added a subscriber: nikic.Oct 5 2021, 9:05 AM

Hi, @xbolva00. Are you sure that we can use loop idiom and transform arbitrary loop with a store into let's a memset?

Regardning memset, it accepts i8 type as a value. I know how it can transform a loop with stores like store i32 0, i32* %9 or e.g. store i32 286331153, i32* %9, but it'd fail on an example where we'd write store i32 1, i32* %9

Ah, right.

I think this would have nontrivial impact on compile time (@nikic ?) and the results from testsuite do not look so promising - I would expect more “hits” to justify (small if possible) compile time regression.

nikic added a comment.Oct 5 2021, 11:42 AM

I think this would have nontrivial impact on compile time (@nikic ?) and the results from testsuite do not look so promising - I would expect more “hits” to justify (small if possible) compile time regression.

Yes, this has a large negative effect: https://llvm-compile-time-tracker.com/compare.php?from=64eaffb613d0cb7fa7542fa48281a2e617ad8ee9&to=6e7452757e16c4260fa9a5862761a68ed778dbf9&stat=instructions