This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Find identical loads in the callee
ClosedPublic

Authored by haicheng on Jun 6 2017, 9:40 AM.

Details

Summary

LLVM Inliner encourages to inline single block callee by giving it higher threshold. However, some large single block callees still cannot be inlined although they have many redundant instructions that can be removed if they are inlined.

The motivation example is a fully unrolled 3x3 matrix multiplication. It loads every data in matrix a and b three times because of the Stores between them. The SROA analysis can figure out that these Stores can be simplified and then these redundant loads should also be free. Thus, running sroa and gvn after inlining the callee can remove 54% of the instructions.

define void @outer(%struct.matrix* %a, %struct.matrix* %b) {
  %c = alloca %struct.matrix
  call void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c)
  ret void
}
define void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c) {

This simple patch tries to find repeated loads in the callee. It stops finding Loads if there are Stores that cannot be simplified or there are function calls in the callee. The above restriction can be relaxed to find more CSE opportunities with more expensive analysis.

I tested the patch with SPEC20xx and llvm-test-suite using O3+LTO/O3/O2/Os on x86 and AArch64. Only spec2006/milc gets impacted when using O3+LTO. On X86, the performance is improved by +3.6% and the code size is increased by 0.07%. On AArch64, the performance is improved by +4.8% and code size has no change.

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng created this revision.Jun 6 2017, 9:40 AM
haicheng edited the summary of this revision. (Show Details)Jun 6 2017, 1:00 PM
chandlerc requested changes to this revision.Jun 13 2017, 12:50 PM

Really cool!

After reading all of this and writing the comments below, I'd switch the term 'CSE' everywhere for 'Elimination'. I know its longer, but I think it is easier for people to read and not get confused about "but what is the subexpression?". This is all really just load elimination.

lib/Analysis/InlineCost.cpp
146–149

I wouldn't describe this (or limit it) to anything to do with SROA.

It should just model the CSE-ing of repeated loads that is expected to happen whenever we simplify away the stores that would otherwise cause them to be loads.

I wouldn't disable it when we don't have SROA-candidates as I can see it being valuable in plenty of other cases. As a trivial example, if a predicate causes all of the stores to memory to be dead, we won't traverse those blocks, and we should track the CSE simplification to any redundant loads. It'd be good to have a test case for this as well. As a concrete example of where I think this might well come up:

int foo(bool predicate) {
  if (some_global_variable && predicate) {
    function_that_writes_to_global();
    // lots more code
  }
  return some_global_variable;
}
161

I would call this disableLoadCSE as that reads better I think.

268

Similarly, I would call this LoadCSECostSavings.

867

clang-format please.

868

Some comment about what is going on here would be good as this isn't a function call with docs.

892–902

I would add a comment about this potentially clobbering the load.

I would also add a FIXME about making this more powerful. I see at least two possible ways:

  1. We can probably keep an initial set of CSE-able loads substracted from the cost even when we finally see a store. We just need to disable *further* accumulation of CSE savings. This isn't trivial if you follow my suggestion below (which I think is more important) to support multiple basic blocks. In that case, you'll need to use a bit more care to only accumulate the safe CSE savings. But it still seems worth doing even with that complexity in a follow-up patch so I'd document it with a FIXME here.
  1. We should probably at some point thread MemorySSA for the callee into this and then use that to actually compute *really* precise savings. But that's even more complex than #1 so definitely for a future patch. =]
1494–1495

Why?

One thing the inline cost analysis really tries to do is "straighten" CFGs that will end up folded when inlined, so it would be really good to allow simple CFGs to survive this much like my example above...

Especially with the initial policy of completely disabling the savings if we see a store *anywhere*.

test/Transforms/Inline/redundant-loads.ll
17–25

While motivating C code is nice, please use a much simpler test case.

Notably, you can set the thresholds artificially low and use a very small, focused test for it.

Also, you should include various *negative* tests for the ways in which this costs savings can be disabled. These tests *should* cause inlining if you accumulate the load elimination cost savings, but *shouldn't* because of a store, call, or other thing that blocks it.

This revision now requires changes to proceed.Jun 13 2017, 12:50 PM

Really cool!

After reading all of this and writing the comments below, I'd switch the term 'CSE' everywhere for 'Elimination'. I know its longer, but I think it is easier for people to read and not get confused about "but what is the subexpression?". This is all really just load elimination.

Thank you very much for looking at this patch. I agree that this change does not have to depend on SROA. Please see the inlined reply about enabling this change for multi-block function.

lib/Analysis/InlineCost.cpp
1494–1495

I observed some regressions when lifting this check. When two repeated loads are far away from each other, LLVM may not be able to reduce the number of loads. For example, LoadPre can add one load when it removes one.

I think it is more safe to support arbitrary CFG with the help of DominatorTree. Only the repeated loads dominated by other identical loads can be simplified. I think we need DominatorTree anyway if we try to address FIXME #1 you listed above because we need to iterate blocks in the reverse post order. So, I am going to add the support of multi-block function as another item in FIXME.

Your example above is still supported as long as SROA requirement is gone because InlineCost can figure out that the callee has single block after inlining.

haicheng updated this revision to Diff 103114.Jun 19 2017, 3:17 PM
haicheng edited edge metadata.Jun 19 2017, 3:17 PM

Lift the requirement of SROA. Update the test cases. Rewrite the comments. Add the support of multi-block function to the FIXME.

haicheng marked 6 inline comments as done.Jun 19 2017, 3:19 PM
haicheng updated this revision to Diff 103940.Jun 26 2017, 7:00 AM
haicheng marked an inline comment as done.
haicheng retitled this revision from [InlineCost] Find identical loads in single block function to [InlineCost] Find identical loads in the callee.
haicheng edited the summary of this revision. (Show Details)

I enabled the support of multi-block function. The benchmark that got regressed before was spec2017/xalancbmk. I looked into it and no hot function (> 0.5% execution time) got changed.

Please take a look. Thank you.

Haicheng

The description says "The motivation example is a fully unrolled 3x3 matrix multiplication. It loads every data in matrix a and b three times because of the Stores between them", but from reading the code I don't see how this case is handled since it seems that you bail as soon as you see a store (or a call)? What did I miss? Is the description still up-to-date?

lib/Analysis/InlineCost.cpp
268

Why do we have both LoadEliminationCostSavings and LoadEliminationCost ?

haicheng added a comment.EditedJun 28 2017, 7:08 AM

The description says "The motivation example is a fully unrolled 3x3 matrix multiplication. It loads every data in matrix a and b three times because of the Stores between them", but from reading the code I don't see how this case is handled since it seems that you bail as soon as you see a store (or a call)? What did I miss? Is the description still up-to-date?

Hi Mehdi,

Sorry for the confusion. The matrix multiplication was the test case, but I replaced it with some simple ones when I updated the patch.

Here is the IR of matrix multiplication example. Every input data are loaded three times (e.g. %arrayidx8). The address of Stores in the callee are based on an alloca in the caller. Inline cost model can figure out that these stores can be eliminated if the callee is inlined (see visitStore()), then the repeated Loads clobbered by the Stores are safe to be eliminated too. Does that make sense?

If the callee has Stores that cannot be eliminated, I just bail out for now.

%struct.matrix = type { [3 x [3 x double]] }

define void @outer(%struct.matrix* %a, %struct.matrix* %b) {
  %c = alloca %struct.matrix
  call void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c)
  ret void
}

; Every data in matrix a and b are loaded three times.
; typedef struct { double m[3][3]; } matrix;
; void matrix_multiply( matrix *a, matrix *b, matrix *c ){
;   int i,j,k;
;     for(i=0; i<3; i++)
;       for(j=0; j<3; j++)
;          for(k=0; k<3; k++)
;            c->m[i][j] += a->m[i][k] * b->m[k][j];
; }

define void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c) {
entry:
  %arrayidx8 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 0, i64 0
  %arrayidx18 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 0, i64 0
  %0 = load double, double* %arrayidx8
  %arrayidx13 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 0, i64 0
  %1 = load double, double* %arrayidx13
  %mul = fmul double %0, %1
  %2 = load double, double* %arrayidx18
  %add = fadd double %2, %mul
  store double %add, double* %arrayidx18
  %arrayidx8.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 0, i64 1
  %3 = load double, double* %arrayidx8.1
  %arrayidx13.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 1, i64 0
  %4 = load double, double* %arrayidx13.1
  %mul.1 = fmul double %3, %4
  %add.1 = fadd double %add, %mul.1
  store double %add.1, double* %arrayidx18
  %arrayidx8.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 0, i64 2
  %5 = load double, double* %arrayidx8.2
  %arrayidx13.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 2, i64 0
  %6 = load double, double* %arrayidx13.2
  %mul.2 = fmul double %5, %6
  %add.2 = fadd double %add.1, %mul.2
  store double %add.2, double* %arrayidx18
  %arrayidx18.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 0, i64 1
  %7 = load double, double* %arrayidx8
  %arrayidx13.140 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 0, i64 1
  %8 = load double, double* %arrayidx13.140
  %mul.141 = fmul double %7, %8
  %9 = load double, double* %arrayidx18.1
  %add.142 = fadd double %9, %mul.141
  store double %add.142, double* %arrayidx18.1
  %10 = load double, double* %arrayidx8.1
  %arrayidx13.1.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 1, i64 1
  %11 = load double, double* %arrayidx13.1.1
  %mul.1.1 = fmul double %10, %11
  %add.1.1 = fadd double %add.142, %mul.1.1
  store double %add.1.1, double* %arrayidx18.1
  %12 = load double, double* %arrayidx8.2
  %arrayidx13.2.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 2, i64 1
  %13 = load double, double* %arrayidx13.2.1
  %mul.2.1 = fmul double %12, %13
  %add.2.1 = fadd double %add.1.1, %mul.2.1
  store double %add.2.1, double* %arrayidx18.1
  %arrayidx18.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 0, i64 2
  %14 = load double, double* %arrayidx8
  %arrayidx13.243 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 0, i64 2
  %15 = load double, double* %arrayidx13.243
  %mul.244 = fmul double %14, %15
  %16 = load double, double* %arrayidx18.2
  %add.245 = fadd double %16, %mul.244
  store double %add.245, double* %arrayidx18.2
  %17 = load double, double* %arrayidx8.1
  %arrayidx13.1.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 1, i64 2
  %18 = load double, double* %arrayidx13.1.2
  %mul.1.2 = fmul double %17, %18
  %add.1.2 = fadd double %add.245, %mul.1.2
  store double %add.1.2, double* %arrayidx18.2
  %19 = load double, double* %arrayidx8.2
  %arrayidx13.2.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 2, i64 2
  %20 = load double, double* %arrayidx13.2.2
  %mul.2.2 = fmul double %19, %20
  %add.2.2 = fadd double %add.1.2, %mul.2.2
  store double %add.2.2, double* %arrayidx18.2
  %arrayidx8.146 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 1, i64 0
  %arrayidx18.147 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 1, i64 0
  %21 = load double, double* %arrayidx8.146
  %22 = load double, double* %arrayidx13
  %mul.149 = fmul double %21, %22
  %23 = load double, double* %arrayidx18.147
  %add.150 = fadd double %23, %mul.149
  store double %add.150, double* %arrayidx18.147
  %arrayidx8.1.151 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 1, i64 1
  %24 = load double, double* %arrayidx8.1.151
  %25 = load double, double* %arrayidx13.1
  %mul.1.153 = fmul double %24, %25
  %add.1.154 = fadd double %add.150, %mul.1.153
  store double %add.1.154, double* %arrayidx18.147
  %arrayidx8.2.155 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 1, i64 2
  %26 = load double, double* %arrayidx8.2.155
  %27 = load double, double* %arrayidx13.2
  %mul.2.157 = fmul double %26, %27
  %add.2.158 = fadd double %add.1.154, %mul.2.157
  store double %add.2.158, double* %arrayidx18.147
  %arrayidx18.1.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 1, i64 1
  %28 = load double, double* %arrayidx8.146
  %29 = load double, double* %arrayidx13.140
  %mul.141.1 = fmul double %28, %29
  %30 = load double, double* %arrayidx18.1.1
  %add.142.1 = fadd double %30, %mul.141.1
  store double %add.142.1, double* %arrayidx18.1.1
  %31 = load double, double* %arrayidx8.1.151
  %32 = load double, double* %arrayidx13.1.1
  %mul.1.1.1 = fmul double %31, %32
  %add.1.1.1 = fadd double %add.142.1, %mul.1.1.1
  store double %add.1.1.1, double* %arrayidx18.1.1
  %33 = load double, double* %arrayidx8.2.155
  %34 = load double, double* %arrayidx13.2.1
  %mul.2.1.1 = fmul double %33, %34
  %add.2.1.1 = fadd double %add.1.1.1, %mul.2.1.1
  store double %add.2.1.1, double* %arrayidx18.1.1
  %arrayidx18.2.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 1, i64 2
  %35 = load double, double* %arrayidx8.146
  %36 = load double, double* %arrayidx13.243
  %mul.244.1 = fmul double %35, %36
  %37 = load double, double* %arrayidx18.2.1
  %add.245.1 = fadd double %37, %mul.244.1
  store double %add.245.1, double* %arrayidx18.2.1
  %38 = load double, double* %arrayidx8.1.151
  %39 = load double, double* %arrayidx13.1.2
  %mul.1.2.1 = fmul double %38, %39
  %add.1.2.1 = fadd double %add.245.1, %mul.1.2.1
  store double %add.1.2.1, double* %arrayidx18.2.1
  %40 = load double, double* %arrayidx8.2.155
  %41 = load double, double* %arrayidx13.2.2
  %mul.2.2.1 = fmul double %40, %41
  %add.2.2.1 = fadd double %add.1.2.1, %mul.2.2.1
  store double %add.2.2.1, double* %arrayidx18.2.1
  %arrayidx8.260 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 2, i64 0
  %arrayidx18.261 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 2, i64 0
  %42 = load double, double* %arrayidx8.260
  %43 = load double, double* %arrayidx13
  %mul.263 = fmul double %42, %43
  %44 = load double, double* %arrayidx18.261
  %add.264 = fadd double %44, %mul.263
  store double %add.264, double* %arrayidx18.261
  %arrayidx8.1.265 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 2, i64 1
  %45 = load double, double* %arrayidx8.1.265
  %46 = load double, double* %arrayidx13.1
  %mul.1.267 = fmul double %45, %46
  %add.1.268 = fadd double %add.264, %mul.1.267
  store double %add.1.268, double* %arrayidx18.261
  %arrayidx8.2.269 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 2, i64 2
  %47 = load double, double* %arrayidx8.2.269
  %48 = load double, double* %arrayidx13.2
  %mul.2.271 = fmul double %47, %48
  %add.2.272 = fadd double %add.1.268, %mul.2.271
  store double %add.2.272, double* %arrayidx18.261
  %arrayidx18.1.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 2, i64 1
  %49 = load double, double* %arrayidx8.260
  %50 = load double, double* %arrayidx13.140
  %mul.141.2 = fmul double %49, %50
  %51 = load double, double* %arrayidx18.1.2
  %add.142.2 = fadd double %51, %mul.141.2
  store double %add.142.2, double* %arrayidx18.1.2
  %52 = load double, double* %arrayidx8.1.265
  %53 = load double, double* %arrayidx13.1.1
  %mul.1.1.2 = fmul double %52, %53
  %add.1.1.2 = fadd double %add.142.2, %mul.1.1.2
  store double %add.1.1.2, double* %arrayidx18.1.2
  %54 = load double, double* %arrayidx8.2.269
  %55 = load double, double* %arrayidx13.2.1
  %mul.2.1.2 = fmul double %54, %55
  %add.2.1.2 = fadd double %add.1.1.2, %mul.2.1.2
  store double %add.2.1.2, double* %arrayidx18.1.2
  %arrayidx18.2.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 2, i64 2
  %56 = load double, double* %arrayidx8.260
  %57 = load double, double* %arrayidx13.243
  %mul.244.2 = fmul double %56, %57
  %58 = load double, double* %arrayidx18.2.2
  %add.245.2 = fadd double %58, %mul.244.2
  store double %add.245.2, double* %arrayidx18.2.2
  %59 = load double, double* %arrayidx8.1.265
  %60 = load double, double* %arrayidx13.1.2
  %mul.1.2.2 = fmul double %59, %60
  %add.1.2.2 = fadd double %add.245.2, %mul.1.2.2
  store double %add.1.2.2, double* %arrayidx18.2.2
  %61 = load double, double* %arrayidx8.2.269
  %62 = load double, double* %arrayidx13.2.2
  %mul.2.2.2 = fmul double %61, %62
  %add.2.2.2 = fadd double %add.1.2.2, %mul.2.2.2
  store double %add.2.2.2, double* %arrayidx18.2.2
  ret void
}

Haicheng

lib/Analysis/InlineCost.cpp
268

LoadEliminationCost is used in the cost calculation, and LoadEliminationCostSavings is used for stats printing. This is the same as SROACostSavings and SROAArgCosts.

Kindly ping.

Kindly ping.

Will try to get to this patch this week, sorry for delay and thanks for the update.

eraman added inline comments.Jul 7 2017, 4:22 PM
lib/Analysis/InlineCost.cpp
867

Nit: As written, IMO this is not very clear. I would be more specific - "... there is no reachable stores/calls or all reachable stores can be SROA'd..."

965

If simplifyCallSite returns true below, disabling this is not needed. Even this is a bit conservative - I think you could call canConstantFoldCallTo directly and if it returns true do not disable load elimination.

haicheng updated this revision to Diff 105850.Jul 10 2017, 7:40 AM
haicheng marked 2 inline comments as done.

Check canConstantFoldCallTo() in visitCallSite(). Thank you, Easwaran.

mehdi_amini added inline comments.Jul 12 2017, 8:14 PM
lib/Analysis/InlineCost.cpp
268

I'm still not making sense of the difference between the two: as far as I can tell they are zero-initialized in the same ctor and gets incremented the same way?

327

It seems to me that it is LoadEliminationCost that needs to be set to 0 instead LoadEliminationCostSavings?

Hi Mehdi,

I can remove LoadEliminationCostSavings if you think it is better.

Haicheng

lib/Analysis/InlineCost.cpp
268

Your understanding is correct. If load elimination is not disabled, they should have the same value. LoadEliminationCostSavings is just used in stats printing.

Eventually, every load address needs to have its own LoadEliminationCost as written in the FIXME no.1. At that time, LoadEliminationCostSavings is the sum of all LoadEliminationCost.

327

When load elimination is disabled, LoadEliminationCostSavings needs to be set to 0 so that we can print the correct value. LoadEliminationCost is not used any more after LoadElimination is disabled.

mehdi_amini added inline comments.Jul 13 2017, 9:07 AM
lib/Analysis/InlineCost.cpp
327

But calling multiple times disableLoadElimination will increase the Cost which does not seem right?

haicheng updated this revision to Diff 106645.Jul 14 2017, 8:33 AM

Add a check of EnableLoadElimination before calling disableLoadElimination() in disableSROA().

haicheng added inline comments.Jul 14 2017, 8:35 AM
lib/Analysis/InlineCost.cpp
327

disableLoadElimination() should be called only once. It is guarded by the flag EnableLoadElimination. I missed one check in disableSROA() before calling disableLoadElimination() and now I fix it. Thank you.

mehdi_amini added inline comments.Jul 14 2017, 9:25 AM
lib/Analysis/InlineCost.cpp
327

It seems fragile: I'd expect the API itself to be idempotent. I'm fine with guarding it on EnableLoadElimination but the guard should be in the method itself.

haicheng updated this revision to Diff 106693.Jul 14 2017, 12:39 PM

Move the check of EnableLoadElimination inside disableLoadElimination().

haicheng marked 2 inline comments as done.Jul 14 2017, 12:43 PM

Kindly Ping.

Kindly Ping (#2).

Kindly Ping (#3).

Kindly Ping (#4)

Would anyone please take a look? I believe I made all the changes that are requested.

Kindly Ping (#5)

This patch can reduce the inline cost of the questionable function in PR34173.

eraman edited edge metadata.Aug 22 2017, 2:25 PM

LGTM with the comments addressed, but please check with Chandler if he has more to add.

lib/Analysis/InlineCost.cpp
150

Nit: SmallPtrSet rounds up the size to a power of 2 and so this will get rounded up to 16. I prefer using a power of 2 explicitly rather than just letting the implementation round it up.

268

I understand that once you improve this you might mimic what SROA does and need to have a separate variable for both. But you don' need this now and so I agree with Mehdi's comments.

test/Transforms/Inline/redundant-loads.ll
67

Even if this load is eliminated, this callee will not be inlined into the caller (cost and threshold are both 0) and so you're really not testing that the call is preventing the second load from being eliminated. One fix is to increase the threshold to 5 and adjust the other cases slightly if needed.

haicheng updated this revision to Diff 112414.Aug 23 2017, 12:00 PM
haicheng marked 6 inline comments as done.

Address Easwaran's comments. Thank you very much.

@chandlerc , would you please take another look?

Haicheng

mcrosier added inline comments.Sep 1 2017, 11:06 AM
lib/Analysis/InlineCost.cpp
867

"there is there is"

haicheng updated this revision to Diff 113577.Sep 1 2017, 12:35 PM

Fix a typo found by Chad. Thank you.

dberlin edited edge metadata.Sep 1 2017, 12:37 PM

(I suspect most folks are waiting for chandler to say whether he feels this is good enough at this point or not)

Kindly Ping

(I suspect most folks are waiting for chandler to say whether he feels this is good enough at this point or not)

I agree.

chandlerc requested changes to this revision.Sep 19 2017, 6:01 PM

Sorry for the awful delay when I got pulled into stuff, should be able to stay with the (small) stuff left in the review.

lib/Analysis/InlineCost.cpp
867

'data are' -> 'data is'.

868

I would say something more like:

'If the data is already leaded from this address and hasn't be clobbered by any stores or calls, this load ...'

870–871

No need for two ifs, short circuit is sufficient.

893

'clobber the repeated loads and prevent them from elimination' -> 'clobber loads and prevent repeated loads from being eliminated'

966–967

I don't think we should use canConstantFoldCallTo here, because simplifyCallSite will already do that below along with a bunch of other things.

I think you really want to sink this down and handle it in a more detailed way. As a specific example: you should handle debug info intrinsics correctly (ignore them) and you should probably make some effort to handle lifetime markers because it seems like those can be handled trivially.

1494–1495

Yeah, this makes sense. Especially the fact that the multiple block check handles the trivial cases for us makes it easy. THanks for the explanation!

test/Transforms/Inline/redundant-loads.ll
9

I really dislike basing tests on debug prints. It makes them quite brittle. Typically for the inliner we craft a function that is too expensive to inline w/o the specific threshold applying, and then check that it does or doesn't get inlined in each case.

This revision now requires changes to proceed.Sep 19 2017, 6:01 PM
haicheng updated this revision to Diff 116206.Sep 21 2017, 9:25 AM
haicheng edited edge metadata.
haicheng marked 7 inline comments as done.

Thank you, Chandler.

I rewrote the test cases, fixed the comments, and made the patch less conservative when checking a callsite.

haicheng updated this revision to Diff 116384.Sep 22 2017, 12:05 PM

Fix a typo in the test case. Please take a look, thank you.

haicheng updated this revision to Diff 117194.Sep 29 2017, 12:03 PM

Whitelist lifetime and debug info intrinsics as not needed to disable load elimination.

Please take a look. Thank you.

mcrosier added inline comments.Sep 29 2017, 1:01 PM
lib/Analysis/InlineCost.cpp
989

Does the assume intrinsic fall into the category as well (or is it properly handled by the default case)?

hfinkel added inline comments.
lib/Analysis/InlineCost.cpp
993

llvm.ptr.annotation seems like it might belong in this list too (maybe this is the same list as isAssumeLikeIntrinsic in ValueTracking.cpp?).

haicheng added inline comments.Sep 29 2017, 1:33 PM
lib/Analysis/InlineCost.cpp
993

I think all free intrinsics can be put here including (see TargetTransformInfoImpl.h getIntrinsicCost()):

Intrinsic::annotation:
Intrinsic::assume:
Intrinsic::dbg_declare:
Intrinsic::dbg_value:
Intrinsic::invariant_start:
Intrinsic::invariant_end:
Intrinsic::lifetime_start:
Intrinsic::lifetime_end:
Intrinsic::objectsize:
Intrinsic::ptr_annotation:
Intrinsic::var_annotation:
Intrinsic::experimental_gc_result:
Intrinsic::experimental_gc_relocate:
Intrinsic::coro_alloc:
Intrinsic::coro_begin:
Intrinsic::coro_free:
Intrinsic::coro_end:
Intrinsic::coro_frame:
Intrinsic::coro_size:
Intrinsic::coro_suspend:
Intrinsic::coro_param:
Intrinsic::coro_subfn_addr:

What do you think?

hfinkel added inline comments.Sep 29 2017, 2:08 PM
lib/Analysis/InlineCost.cpp
993

That seems dangerous. I don't think it's a good idea to derive semantic information from the cost model.

haicheng updated this revision to Diff 117297.Oct 1 2017, 6:52 PM

Use ValueTracking::isAssumeLikeIntrinsic() to find intrinsics that do not clobber load.

Kindly ping

Kindly Ping (#2)

Kindly Ping (#3)

Kindly Ping (#4)

Kindly Ping (#5)

Kindly Ping (#6)

Kindly Ping (#7)

Kindly Ping (#8)

I ran the spec2006/17 performance again on top of the ToT. Here is the number:

BenchmarkPerformance (+ is better)Code Size (+ is larger)
spec2006/milc+6.14%-0.04%
spec2017/deepsjeng+1.06%-0.32%
spec2017/xalancbmk+0.74%0%
spec2006/povray+0.14%0%
spec2006/h264ref+0.02%+0.13%
spec2017/imagick0%+0.01%
spec2017/gcc-0.02%0%
spec2017/povray-0.12%0%
spec2017/blender-0.52%0.02%
spec2006/xalancbmk-0.71%0%
junbuml edited edge metadata.Dec 5 2017, 11:32 AM

I took a closer look at this and LGTM.
I tried to think about other possible negative test cases , but nothing other than outer2 and outer3 I can think of.

haicheng updated this revision to Diff 126283.Dec 9 2017, 1:55 PM

Thank you, Jun.

I added a test case about indirect call.

mcrosier accepted this revision.Dec 11 2017, 6:44 AM

All of the review feedback has been addressed and I have no additional comments. LGTM. Thanks, Haicheng.

Thank you, Chad. I will wait for a couple of days before I commit this patch in case I will get some new comments.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 15 2017, 6:35 AM
This revision was automatically updated to reflect the committed changes.