This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
ValueTracking.h
-
lib/Analysis/
-
Analysis/
-
InlineCost.cpp
-
ValueTracking.cpp
-
test/Transforms/Inline/
-
Transforms/
-
Inline/
-
redundant-loads.ll

Differential D33946

[InlineCost] Find identical loads in the callee
ClosedPublic

Authored by haicheng on Jun 6 2017, 9:40 AM.

Download Raw Diff

Details

Reviewers

chandlerc
eraman
junbuml
• dberlin
mcrosier

Commits

rGa44615155282: [InlineCost] Find repeated loads in the callee
rL320814: [InlineCost] Find repeated loads in the callee

Summary

LLVM Inliner encourages to inline single block callee by giving it higher threshold. However, some large single block callees still cannot be inlined although they have many redundant instructions that can be removed if they are inlined.

The motivation example is a fully unrolled 3x3 matrix multiplication. It loads every data in matrix a and b three times because of the Stores between them. The SROA analysis can figure out that these Stores can be simplified and then these redundant loads should also be free. Thus, running sroa and gvn after inlining the callee can remove 54% of the instructions.

define void @outer(%struct.matrix* %a, %struct.matrix* %b) {
  %c = alloca %struct.matrix
  call void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c)
  ret void
}
define void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c) {

This simple patch tries to find repeated loads in the callee. It stops finding Loads if there are Stores that cannot be simplified or there are function calls in the callee. The above restriction can be relaxed to find more CSE opportunities with more expensive analysis.

I tested the patch with SPEC20xx and llvm-test-suite using O3+LTO/O3/O2/Os on x86 and AArch64. Only spec2006/milc gets impacted when using O3+LTO. On X86, the performance is improved by +3.6% and the code size is increased by 0.07%. On AArch64, the performance is improved by +4.8% and code size has no change.

Diff Detail

Repository: rL LLVM

Event Timeline

haicheng created this revision.Jun 6 2017, 9:40 AM

Herald added subscribers: kristof.beyls, mehdi_amini, mcrosier, aemerson. · View Herald TranscriptJun 6 2017, 9:40 AM

haicheng edited the summary of this revision. (Show Details)Jun 6 2017, 1:00 PM

Kindly ping.

Really cool!

After reading all of this and writing the comments below, I'd switch the term 'CSE' everywhere for 'Elimination'. I know its longer, but I think it is easier for people to read and not get confused about "but what is the subexpression?". This is all really just load elimination.

lib/Analysis/InlineCost.cpp
145–148 ↗	(On Diff #101570)	I wouldn't describe this (or limit it) to anything to do with SROA. It should just model the CSE-ing of repeated loads that is expected to happen whenever we simplify away the stores that would otherwise cause them to be loads. I wouldn't disable it when we don't have SROA-candidates as I can see it being valuable in plenty of other cases. As a trivial example, if a predicate causes all of the stores to memory to be dead, we won't traverse those blocks, and we should track the CSE simplification to any redundant loads. It'd be good to have a test case for this as well. As a concrete example of where I think this might well come up: int foo(bool predicate) { if (some_global_variable && predicate) { function_that_writes_to_global(); // lots more code } return some_global_variable; }
161 ↗	(On Diff #101570)	I would call this `disableLoadCSE` as that reads better I think.
265 ↗	(On Diff #101570)	Similarly, I would call this `LoadCSECostSavings`.
842 ↗	(On Diff #101570)	clang-format please.
843 ↗	(On Diff #101570)	Some comment about what is going on here would be good as this isn't a function call with docs.
864–865 ↗	(On Diff #101570)	I would add a comment about this potentially clobbering the load. I would also add a FIXME about making this more powerful. I see at least two possible ways: We can probably keep an initial set of CSE-able loads substracted from the cost even when we finally see a store. We just need to disable further accumulation of CSE savings. This isn't trivial if you follow my suggestion below (which I think is more important) to support multiple basic blocks. In that case, you'll need to use a bit more care to only accumulate the safe CSE savings. But it still seems worth doing even with that complexity in a follow-up patch so I'd document it with a FIXME here. We should probably at some point thread MemorySSA for the callee into this and then use that to actually compute really precise savings. But that's even more complex than #1 so definitely for a future patch. =]
1469–1470 ↗	(On Diff #101570)	Why? One thing the inline cost analysis really tries to do is "straighten" CFGs that will end up folded when inlined, so it would be really good to allow simple CFGs to survive this much like my example above... Especially with the initial policy of completely disabling the savings if we see a store anywhere.
test/Transforms/Inline/redundant-loads.ll
16–24 ↗	(On Diff #101570)	While motivating C code is nice, please use a much simpler test case. Notably, you can set the thresholds artificially low and use a very small, focused test for it. Also, you should include various negative tests for the ways in which this costs savings can be disabled. These tests should cause inlining if you accumulate the load elimination cost savings, but shouldn't because of a store, call, or other thing that blocks it.

This revision now requires changes to proceed.Jun 13 2017, 12:50 PM

In D33946#779305, @chandlerc wrote:

Really cool!

After reading all of this and writing the comments below, I'd switch the term 'CSE' everywhere for 'Elimination'. I know its longer, but I think it is easier for people to read and not get confused about "but what is the subexpression?". This is all really just load elimination.

Thank you very much for looking at this patch. I agree that this change does not have to depend on SROA. Please see the inlined reply about enabling this change for multi-block function.

lib/Analysis/InlineCost.cpp
1469–1470 ↗	(On Diff #101570)	I observed some regressions when lifting this check. When two repeated loads are far away from each other, LLVM may not be able to reduce the number of loads. For example, LoadPre can add one load when it removes one. I think it is more safe to support arbitrary CFG with the help of DominatorTree. Only the repeated loads dominated by other identical loads can be simplified. I think we need DominatorTree anyway if we try to address FIXME #1 you listed above because we need to iterate blocks in the reverse post order. So, I am going to add the support of multi-block function as another item in FIXME. Your example above is still supported as long as SROA requirement is gone because InlineCost can figure out that the callee has single block after inlining.

haicheng updated this revision to Diff 103114.Jun 19 2017, 3:17 PM

Lift the requirement of SROA. Update the test cases. Rewrite the comments. Add the support of multi-block function to the FIXME.

Herald added a reviewer: • dberlin. · View Herald TranscriptJun 19 2017, 3:17 PM

haicheng marked 6 inline comments as done.Jun 19 2017, 3:19 PM

I enabled the support of multi-block function. The benchmark that got regressed before was spec2017/xalancbmk. I looked into it and no hot function (> 0.5% execution time) got changed.

Please take a look. Thank you.

Haicheng

The description says "The motivation example is a fully unrolled 3x3 matrix multiplication. It loads every data in matrix a and b three times because of the Stores between them", but from reading the code I don't see how this case is handled since it seems that you bail as soon as you see a store (or a call)? What did I miss? Is the description still up-to-date?

lib/Analysis/InlineCost.cpp
265 ↗	(On Diff #101570)	Why do we have both `LoadEliminationCostSavings` and `LoadEliminationCost` ?

In D33946#791658, @mehdi_amini wrote:

The description says "The motivation example is a fully unrolled 3x3 matrix multiplication. It loads every data in matrix a and b three times because of the Stores between them", but from reading the code I don't see how this case is handled since it seems that you bail as soon as you see a store (or a call)? What did I miss? Is the description still up-to-date?

Hi Mehdi,

Sorry for the confusion. The matrix multiplication was the test case, but I replaced it with some simple ones when I updated the patch.

Here is the IR of matrix multiplication example. Every input data are loaded three times (e.g. %arrayidx8). The address of Stores in the callee are based on an alloca in the caller. Inline cost model can figure out that these stores can be eliminated if the callee is inlined (see visitStore()), then the repeated Loads clobbered by the Stores are safe to be eliminated too. Does that make sense?

If the callee has Stores that cannot be eliminated, I just bail out for now.

%struct.matrix = type { [3 x [3 x double]] }

define void @outer(%struct.matrix* %a, %struct.matrix* %b) {
  %c = alloca %struct.matrix
  call void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c)
  ret void
}

; Every data in matrix a and b are loaded three times.
; typedef struct { double m[3][3]; } matrix;
; void matrix_multiply( matrix *a, matrix *b, matrix *c ){
;   int i,j,k;
;     for(i=0; i<3; i++)
;       for(j=0; j<3; j++)
;          for(k=0; k<3; k++)
;            c->m[i][j] += a->m[i][k] * b->m[k][j];
; }

define void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c) {
entry:
  %arrayidx8 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 0, i64 0
  %arrayidx18 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 0, i64 0
  %0 = load double, double* %arrayidx8
  %arrayidx13 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 0, i64 0
  %1 = load double, double* %arrayidx13
  %mul = fmul double %0, %1
  %2 = load double, double* %arrayidx18
  %add = fadd double %2, %mul
  store double %add, double* %arrayidx18
  %arrayidx8.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 0, i64 1
  %3 = load double, double* %arrayidx8.1
  %arrayidx13.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 1, i64 0
  %4 = load double, double* %arrayidx13.1
  %mul.1 = fmul double %3, %4
  %add.1 = fadd double %add, %mul.1
  store double %add.1, double* %arrayidx18
  %arrayidx8.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 0, i64 2
  %5 = load double, double* %arrayidx8.2
  %arrayidx13.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 2, i64 0
  %6 = load double, double* %arrayidx13.2
  %mul.2 = fmul double %5, %6
  %add.2 = fadd double %add.1, %mul.2
  store double %add.2, double* %arrayidx18
  %arrayidx18.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 0, i64 1
  %7 = load double, double* %arrayidx8
  %arrayidx13.140 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 0, i64 1
  %8 = load double, double* %arrayidx13.140
  %mul.141 = fmul double %7, %8
  %9 = load double, double* %arrayidx18.1
  %add.142 = fadd double %9, %mul.141
  store double %add.142, double* %arrayidx18.1
  %10 = load double, double* %arrayidx8.1
  %arrayidx13.1.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 1, i64 1
  %11 = load double, double* %arrayidx13.1.1
  %mul.1.1 = fmul double %10, %11
  %add.1.1 = fadd double %add.142, %mul.1.1
  store double %add.1.1, double* %arrayidx18.1
  %12 = load double, double* %arrayidx8.2
  %arrayidx13.2.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 2, i64 1
  %13 = load double, double* %arrayidx13.2.1
  %mul.2.1 = fmul double %12, %13
  %add.2.1 = fadd double %add.1.1, %mul.2.1
  store double %add.2.1, double* %arrayidx18.1
  %arrayidx18.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 0, i64 2
  %14 = load double, double* %arrayidx8
  %arrayidx13.243 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 0, i64 2
  %15 = load double, double* %arrayidx13.243
  %mul.244 = fmul double %14, %15
  %16 = load double, double* %arrayidx18.2
  %add.245 = fadd double %16, %mul.244
  store double %add.245, double* %arrayidx18.2
  %17 = load double, double* %arrayidx8.1
  %arrayidx13.1.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 1, i64 2
  %18 = load double, double* %arrayidx13.1.2
  %mul.1.2 = fmul double %17, %18
  %add.1.2 = fadd double %add.245, %mul.1.2
  store double %add.1.2, double* %arrayidx18.2
  %19 = load double, double* %arrayidx8.2
  %arrayidx13.2.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %b, i64 0, i32 0, i64 2, i64 2
  %20 = load double, double* %arrayidx13.2.2
  %mul.2.2 = fmul double %19, %20
  %add.2.2 = fadd double %add.1.2, %mul.2.2
  store double %add.2.2, double* %arrayidx18.2
  %arrayidx8.146 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 1, i64 0
  %arrayidx18.147 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 1, i64 0
  %21 = load double, double* %arrayidx8.146
  %22 = load double, double* %arrayidx13
  %mul.149 = fmul double %21, %22
  %23 = load double, double* %arrayidx18.147
  %add.150 = fadd double %23, %mul.149
  store double %add.150, double* %arrayidx18.147
  %arrayidx8.1.151 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 1, i64 1
  %24 = load double, double* %arrayidx8.1.151
  %25 = load double, double* %arrayidx13.1
  %mul.1.153 = fmul double %24, %25
  %add.1.154 = fadd double %add.150, %mul.1.153
  store double %add.1.154, double* %arrayidx18.147
  %arrayidx8.2.155 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 1, i64 2
  %26 = load double, double* %arrayidx8.2.155
  %27 = load double, double* %arrayidx13.2
  %mul.2.157 = fmul double %26, %27
  %add.2.158 = fadd double %add.1.154, %mul.2.157
  store double %add.2.158, double* %arrayidx18.147
  %arrayidx18.1.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 1, i64 1
  %28 = load double, double* %arrayidx8.146
  %29 = load double, double* %arrayidx13.140
  %mul.141.1 = fmul double %28, %29
  %30 = load double, double* %arrayidx18.1.1
  %add.142.1 = fadd double %30, %mul.141.1
  store double %add.142.1, double* %arrayidx18.1.1
  %31 = load double, double* %arrayidx8.1.151
  %32 = load double, double* %arrayidx13.1.1
  %mul.1.1.1 = fmul double %31, %32
  %add.1.1.1 = fadd double %add.142.1, %mul.1.1.1
  store double %add.1.1.1, double* %arrayidx18.1.1
  %33 = load double, double* %arrayidx8.2.155
  %34 = load double, double* %arrayidx13.2.1
  %mul.2.1.1 = fmul double %33, %34
  %add.2.1.1 = fadd double %add.1.1.1, %mul.2.1.1
  store double %add.2.1.1, double* %arrayidx18.1.1
  %arrayidx18.2.1 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 1, i64 2
  %35 = load double, double* %arrayidx8.146
  %36 = load double, double* %arrayidx13.243
  %mul.244.1 = fmul double %35, %36
  %37 = load double, double* %arrayidx18.2.1
  %add.245.1 = fadd double %37, %mul.244.1
  store double %add.245.1, double* %arrayidx18.2.1
  %38 = load double, double* %arrayidx8.1.151
  %39 = load double, double* %arrayidx13.1.2
  %mul.1.2.1 = fmul double %38, %39
  %add.1.2.1 = fadd double %add.245.1, %mul.1.2.1
  store double %add.1.2.1, double* %arrayidx18.2.1
  %40 = load double, double* %arrayidx8.2.155
  %41 = load double, double* %arrayidx13.2.2
  %mul.2.2.1 = fmul double %40, %41
  %add.2.2.1 = fadd double %add.1.2.1, %mul.2.2.1
  store double %add.2.2.1, double* %arrayidx18.2.1
  %arrayidx8.260 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 2, i64 0
  %arrayidx18.261 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 2, i64 0
  %42 = load double, double* %arrayidx8.260
  %43 = load double, double* %arrayidx13
  %mul.263 = fmul double %42, %43
  %44 = load double, double* %arrayidx18.261
  %add.264 = fadd double %44, %mul.263
  store double %add.264, double* %arrayidx18.261
  %arrayidx8.1.265 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 2, i64 1
  %45 = load double, double* %arrayidx8.1.265
  %46 = load double, double* %arrayidx13.1
  %mul.1.267 = fmul double %45, %46
  %add.1.268 = fadd double %add.264, %mul.1.267
  store double %add.1.268, double* %arrayidx18.261
  %arrayidx8.2.269 = getelementptr inbounds %struct.matrix, %struct.matrix* %a, i64 0, i32 0, i64 2, i64 2
  %47 = load double, double* %arrayidx8.2.269
  %48 = load double, double* %arrayidx13.2
  %mul.2.271 = fmul double %47, %48
  %add.2.272 = fadd double %add.1.268, %mul.2.271
  store double %add.2.272, double* %arrayidx18.261
  %arrayidx18.1.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 2, i64 1
  %49 = load double, double* %arrayidx8.260
  %50 = load double, double* %arrayidx13.140
  %mul.141.2 = fmul double %49, %50
  %51 = load double, double* %arrayidx18.1.2
  %add.142.2 = fadd double %51, %mul.141.2
  store double %add.142.2, double* %arrayidx18.1.2
  %52 = load double, double* %arrayidx8.1.265
  %53 = load double, double* %arrayidx13.1.1
  %mul.1.1.2 = fmul double %52, %53
  %add.1.1.2 = fadd double %add.142.2, %mul.1.1.2
  store double %add.1.1.2, double* %arrayidx18.1.2
  %54 = load double, double* %arrayidx8.2.269
  %55 = load double, double* %arrayidx13.2.1
  %mul.2.1.2 = fmul double %54, %55
  %add.2.1.2 = fadd double %add.1.1.2, %mul.2.1.2
  store double %add.2.1.2, double* %arrayidx18.1.2
  %arrayidx18.2.2 = getelementptr inbounds %struct.matrix, %struct.matrix* %c, i64 0, i32 0, i64 2, i64 2
  %56 = load double, double* %arrayidx8.260
  %57 = load double, double* %arrayidx13.243
  %mul.244.2 = fmul double %56, %57
  %58 = load double, double* %arrayidx18.2.2
  %add.245.2 = fadd double %58, %mul.244.2
  store double %add.245.2, double* %arrayidx18.2.2
  %59 = load double, double* %arrayidx8.1.265
  %60 = load double, double* %arrayidx13.1.2
  %mul.1.2.2 = fmul double %59, %60
  %add.1.2.2 = fadd double %add.245.2, %mul.1.2.2
  store double %add.1.2.2, double* %arrayidx18.2.2
  %61 = load double, double* %arrayidx8.2.269
  %62 = load double, double* %arrayidx13.2.2
  %mul.2.2.2 = fmul double %61, %62
  %add.2.2.2 = fadd double %add.1.2.2, %mul.2.2.2
  store double %add.2.2.2, double* %arrayidx18.2.2
  ret void
}

Haicheng

lib/Analysis/InlineCost.cpp
265 ↗	(On Diff #101570)	`LoadEliminationCost` is used in the cost calculation, and `LoadEliminationCostSavings` is used for stats printing. This is the same as `SROACostSavings` and `SROAArgCosts`.

Kindly ping.

In D33946#799730, @haicheng wrote:

Kindly ping.

Will try to get to this patch this week, sorry for delay and thanks for the update.

eraman added inline comments.Jul 7 2017, 4:22 PM

lib/Analysis/InlineCost.cpp
836 ↗	(On Diff #103940)	Nit: As written, IMO this is not very clear. I would be more specific - "... there is no reachable stores/calls or all reachable stores can be SROA'd..."
935 ↗	(On Diff #103940)	If simplifyCallSite returns true below, disabling this is not needed. Even this is a bit conservative - I think you could call canConstantFoldCallTo directly and if it returns true do not disable load elimination.

Check canConstantFoldCallTo() in visitCallSite(). Thank you, Easwaran.

mehdi_amini added inline comments.Jul 12 2017, 8:14 PM

lib/Analysis/InlineCost.cpp
327 ↗	(On Diff #105850)	It seems to me that it is `LoadEliminationCost` that needs to be set to 0 instead `LoadEliminationCostSavings`?
265 ↗	(On Diff #101570)	I'm still not making sense of the difference between the two: as far as I can tell they are zero-initialized in the same ctor and gets incremented the same way?

Hi Mehdi,

I can remove LoadEliminationCostSavings if you think it is better.

Haicheng

lib/Analysis/InlineCost.cpp
327 ↗	(On Diff #105850)	When load elimination is disabled, LoadEliminationCostSavings needs to be set to 0 so that we can print the correct value. LoadEliminationCost is not used any more after LoadElimination is disabled.
265 ↗	(On Diff #101570)	Your understanding is correct. If load elimination is not disabled, they should have the same value. LoadEliminationCostSavings is just used in stats printing. Eventually, every load address needs to have its own LoadEliminationCost as written in the FIXME no.1. At that time, LoadEliminationCostSavings is the sum of all LoadEliminationCost.

mehdi_amini added inline comments.Jul 13 2017, 9:07 AM

lib/Analysis/InlineCost.cpp
327 ↗	(On Diff #105850)	But calling multiple times disableLoadElimination will increase the Cost which does not seem right?

Add a check of EnableLoadElimination before calling disableLoadElimination() in disableSROA().

haicheng added inline comments.Jul 14 2017, 8:35 AM

lib/Analysis/InlineCost.cpp
327 ↗	(On Diff #105850)	disableLoadElimination() should be called only once. It is guarded by the flag EnableLoadElimination. I missed one check in disableSROA() before calling disableLoadElimination() and now I fix it. Thank you.

mehdi_amini added inline comments.Jul 14 2017, 9:25 AM

lib/Analysis/InlineCost.cpp
327 ↗	(On Diff #105850)	It seems fragile: I'd expect the API itself to be idempotent. I'm fine with guarding it on EnableLoadElimination but the guard should be in the method itself.

Move the check of EnableLoadElimination inside disableLoadElimination().

haicheng marked 2 inline comments as done.Jul 14 2017, 12:43 PM

Kindly Ping.

Kindly Ping (#2).

Kindly Ping (#3).

Kindly Ping (#4)

Would anyone please take a look? I believe I made all the changes that are requested.

Kindly Ping (#5)

This patch can reduce the inline cost of the questionable function in PR34173.

LGTM with the comments addressed, but please check with Chandler if he has more to add.

lib/Analysis/InlineCost.cpp
150 ↗	(On Diff #106693)	Nit: SmallPtrSet rounds up the size to a power of 2 and so this will get rounded up to 16. I prefer using a power of 2 explicitly rather than just letting the implementation round it up.
265 ↗	(On Diff #101570)	I understand that once you improve this you might mimic what SROA does and need to have a separate variable for both. But you don' need this now and so I agree with Mehdi's comments.
test/Transforms/Inline/redundant-loads.ll
67 ↗	(On Diff #106693)	Even if this load is eliminated, this callee will not be inlined into the caller (cost and threshold are both 0) and so you're really not testing that the call is preventing the second load from being eliminated. One fix is to increase the threshold to 5 and adjust the other cases slightly if needed.

Address Easwaran's comments. Thank you very much.

@chandlerc , would you please take another look?

Haicheng

Kindly Ping

mcrosier added inline comments.Sep 1 2017, 11:06 AM

lib/Analysis/InlineCost.cpp
1008 ↗	(On Diff #112414)	"there is there is"

Fix a typo found by Chad. Thank you.

(I suspect most folks are waiting for chandler to say whether he feels this is good enough at this point or not)

Kindly Ping

In D33946#859104, @dberlin wrote:

(I suspect most folks are waiting for chandler to say whether he feels this is good enough at this point or not)

I agree.

Sorry for the awful delay when I got pulled into stuff, should be able to stay with the (small) stuff left in the review.

lib/Analysis/InlineCost.cpp
1008 ↗	(On Diff #113577)	'data are' -> 'data is'.
1009 ↗	(On Diff #113577)	I would say something more like: 'If the data is already leaded from this address and hasn't be clobbered by any stores or calls, this load ...'
1011–1012 ↗	(On Diff #113577)	No need for two `if`s, short circuit is sufficient.
1033 ↗	(On Diff #113577)	'clobber the repeated loads and prevent them from elimination' -> 'clobber loads and prevent repeated loads from being eliminated'
1106–1107 ↗	(On Diff #113577)	I don't think we should use `canConstantFoldCallTo` here, because `simplifyCallSite` will already do that below along with a bunch of other things. I think you really want to sink this down and handle it in a more detailed way. As a specific example: you should handle debug info intrinsics correctly (ignore them) and you should probably make some effort to handle lifetime markers because it seems like those can be handled trivially.
1469–1470 ↗	(On Diff #101570)	Yeah, this makes sense. Especially the fact that the multiple block check handles the trivial cases for us makes it easy. THanks for the explanation!
test/Transforms/Inline/redundant-loads.ll
8 ↗	(On Diff #113577)	I really dislike basing tests on debug prints. It makes them quite brittle. Typically for the inliner we craft a function that is too expensive to inline w/o the specific threshold applying, and then check that it does or doesn't get inlined in each case.

This revision now requires changes to proceed.Sep 19 2017, 6:01 PM

Thank you, Chandler.

I rewrote the test cases, fixed the comments, and made the patch less conservative when checking a callsite.

Fix a typo in the test case. Please take a look, thank you.

Whitelist lifetime and debug info intrinsics as not needed to disable load elimination.

Please take a look. Thank you.

mcrosier added inline comments.Sep 29 2017, 1:01 PM

lib/Analysis/InlineCost.cpp
1129 ↗	(On Diff #117194)	Does the assume intrinsic fall into the category as well (or is it properly handled by the default case)?

hfinkel added a subscriber: hfinkel.Sep 29 2017, 1:21 PM

hfinkel added inline comments.

lib/Analysis/InlineCost.cpp
1133 ↗	(On Diff #117194)	llvm.ptr.annotation seems like it might belong in this list too (maybe this is the same list as isAssumeLikeIntrinsic in ValueTracking.cpp?).

haicheng added inline comments.Sep 29 2017, 1:33 PM

lib/Analysis/InlineCost.cpp

1133 ↗

(On Diff #117194)

I think all free intrinsics can be put here including (see TargetTransformInfoImpl.h getIntrinsicCost()):

Intrinsic::annotation:
Intrinsic::assume:
Intrinsic::dbg_declare:
Intrinsic::dbg_value:
Intrinsic::invariant_start:
Intrinsic::invariant_end:
Intrinsic::lifetime_start:
Intrinsic::lifetime_end:
Intrinsic::objectsize:
Intrinsic::ptr_annotation:
Intrinsic::var_annotation:
Intrinsic::experimental_gc_result:
Intrinsic::experimental_gc_relocate:
Intrinsic::coro_alloc:
Intrinsic::coro_begin:
Intrinsic::coro_free:
Intrinsic::coro_end:
Intrinsic::coro_frame:
Intrinsic::coro_size:
Intrinsic::coro_suspend:
Intrinsic::coro_param:
Intrinsic::coro_subfn_addr:

What do you think?

hfinkel added inline comments.Sep 29 2017, 2:08 PM

lib/Analysis/InlineCost.cpp
1133 ↗	(On Diff #117194)	That seems dangerous. I don't think it's a good idea to derive semantic information from the cost model.

Use ValueTracking::isAssumeLikeIntrinsic() to find intrinsics that do not clobber load.

Kindly ping

Kindly Ping (#2)

Kindly Ping (#3)

Kindly Ping (#4)

Kindly Ping (#5)

Kindly Ping (#6)

Kindly Ping (#7)

Kindly Ping (#8)

I ran the spec2006/17 performance again on top of the ToT. Here is the number:

Benchmark	Performance (+ is better)	Code Size (+ is larger)
spec2006/milc	+6.14%	-0.04%
spec2017/deepsjeng	+1.06%	-0.32%
spec2017/xalancbmk	+0.74%	0%
spec2006/povray	+0.14%	0%
spec2006/h264ref	+0.02%	+0.13%
spec2017/imagick	0%	+0.01%
spec2017/gcc	-0.02%	0%
spec2017/povray	-0.12%	0%
spec2017/blender	-0.52%	0.02%
spec2006/xalancbmk	-0.71%	0%

I took a closer look at this and LGTM.
I tried to think about other possible negative test cases , but nothing other than outer2 and outer3 I can think of.

Thank you, Jun.

I added a test case about indirect call.

All of the review feedback has been addressed and I have no additional comments. LGTM. Thanks, Haicheng.

Thank you, Chad. I will wait for a couple of days before I commit this patch in case I will get some new comments.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 15 2017, 6:35 AM

Closed by commit rL320814: [InlineCost] Find repeated loads in the callee (authored by haicheng). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

ValueTracking.h

4 lines

lib/

Analysis/

InlineCost.cpp

56 lines

ValueTracking.cpp

2 lines

test/

Transforms/

Inline/

redundant-loads.ll

186 lines

Diff 127121

llvm/trunk/include/llvm/Analysis/ValueTracking.h

Show First 20 Lines • Show All 360 Lines • ▼ Show 20 Lines	class Value;
/// Memory dependence arises for example if the instruction reads from		/// Memory dependence arises for example if the instruction reads from
/// memory or may produce effects or undefined behaviour. Memory dependent		/// memory or may produce effects or undefined behaviour. Memory dependent
/// instructions generally cannot be reorderd with respect to other memory		/// instructions generally cannot be reorderd with respect to other memory
/// dependent instructions or moved into non-dominated basic blocks.		/// dependent instructions or moved into non-dominated basic blocks.
/// Instructions which just compute a value based on the values of their		/// Instructions which just compute a value based on the values of their
/// operands are not memory dependent.		/// operands are not memory dependent.
bool mayBeMemoryDependent(const Instruction &I);		bool mayBeMemoryDependent(const Instruction &I);

		/// Return true if it is an intrinsic that cannot be speculated but also
		/// cannot trap.
		bool isAssumeLikeIntrinsic(const Instruction *I);

/// Return true if it is valid to use the assumptions provided by an		/// Return true if it is valid to use the assumptions provided by an
/// assume intrinsic, I, at the point in the control-flow identified by the		/// assume intrinsic, I, at the point in the control-flow identified by the
/// context instruction, CxtI.		/// context instruction, CxtI.
bool isValidAssumeForContext(const Instruction I, const Instruction CxtI,		bool isValidAssumeForContext(const Instruction I, const Instruction CxtI,
const DominatorTree *DT = nullptr);		const DominatorTree *DT = nullptr);

enum class OverflowResult { AlwaysOverflows, MayOverflow, NeverOverflows };		enum class OverflowResult { AlwaysOverflows, MayOverflow, NeverOverflows };

▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/InlineCost.cpp

Show All 19 Lines
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/BlockFrequencyInfo.h"		#include "llvm/Analysis/BlockFrequencyInfo.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/ConstantFolding.h"		#include "llvm/Analysis/ConstantFolding.h"
#include "llvm/Analysis/CFG.h"		#include "llvm/Analysis/CFG.h"
#include "llvm/Analysis/InstructionSimplify.h"		#include "llvm/Analysis/InstructionSimplify.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/CallingConv.h"		#include "llvm/IR/CallingConv.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/GlobalAlias.h"		#include "llvm/IR/GlobalAlias.h"
#include "llvm/IR/InstVisitor.h"		#include "llvm/IR/InstVisitor.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {

/// Keep track of dead blocks due to the constant arguments.		/// Keep track of dead blocks due to the constant arguments.
SetVector<BasicBlock *> DeadBlocks;		SetVector<BasicBlock *> DeadBlocks;

/// The mapping of the blocks to their known unique successors due to the		/// The mapping of the blocks to their known unique successors due to the
/// constant arguments.		/// constant arguments.
DenseMap<BasicBlock , BasicBlock > KnownSuccessors;		DenseMap<BasicBlock , BasicBlock > KnownSuccessors;

		/// Model the elimination of repeated loads that is expected to happen
		/// whenever we simplify away the stores that would otherwise cause them to be
		/// loads.
		bool EnableLoadElimination;
		SmallPtrSet<Value *, 16> LoadAddrSet;
		int LoadEliminationCost;

// Custom simplification helper routines.		// Custom simplification helper routines.
bool isAllocaDerivedArg(Value *V);		bool isAllocaDerivedArg(Value *V);
bool lookupSROAArgAndCost(Value V, Value &Arg,		bool lookupSROAArgAndCost(Value V, Value &Arg,
DenseMap<Value *, int>::iterator &CostIt);		DenseMap<Value *, int>::iterator &CostIt);
void disableSROA(DenseMap<Value *, int>::iterator CostIt);		void disableSROA(DenseMap<Value *, int>::iterator CostIt);
void disableSROA(Value *V);		void disableSROA(Value *V);
void findDeadBlocks(BasicBlock CurrBB, BasicBlock NextBB);		void findDeadBlocks(BasicBlock CurrBB, BasicBlock NextBB);
void accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,		void accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,
int InstructionCost);		int InstructionCost);
		void disableLoadElimination();
bool isGEPFree(GetElementPtrInst &GEP);		bool isGEPFree(GetElementPtrInst &GEP);
bool canFoldInboundsGEP(GetElementPtrInst &I);		bool canFoldInboundsGEP(GetElementPtrInst &I);
bool accumulateGEPOffset(GEPOperator &GEP, APInt &Offset);		bool accumulateGEPOffset(GEPOperator &GEP, APInt &Offset);
bool simplifyCallSite(Function *F, CallSite CS);		bool simplifyCallSite(Function *F, CallSite CS);
template <typename Callable>		template <typename Callable>
bool simplifyInstruction(Instruction &I, Callable Evaluate);		bool simplifyInstruction(Instruction &I, Callable Evaluate);
ConstantInt stripAndComputeInBoundsConstantOffsets(Value &V);		ConstantInt stripAndComputeInBoundsConstantOffsets(Value &V);

▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	CallAnalyzer(const TargetTransformInfo &TTI,
CandidateCS(CSArg), Params(Params), Threshold(Params.DefaultThreshold),		CandidateCS(CSArg), Params(Params), Threshold(Params.DefaultThreshold),
Cost(0), ComputeFullInlineCost(OptComputeFullInlineCost \|\|		Cost(0), ComputeFullInlineCost(OptComputeFullInlineCost \|\|
Params.ComputeFullInlineCost \|\| ORE),		Params.ComputeFullInlineCost \|\| ORE),
IsCallerRecursive(false), IsRecursiveCall(false),		IsCallerRecursive(false), IsRecursiveCall(false),
ExposesReturnsTwice(false), HasDynamicAlloca(false),		ExposesReturnsTwice(false), HasDynamicAlloca(false),
ContainsNoDuplicateCall(false), HasReturn(false), HasIndirectBr(false),		ContainsNoDuplicateCall(false), HasReturn(false), HasIndirectBr(false),
HasFrameEscape(false), AllocatedSize(0), NumInstructions(0),		HasFrameEscape(false), AllocatedSize(0), NumInstructions(0),
NumVectorInstructions(0), VectorBonus(0), SingleBBBonus(0),		NumVectorInstructions(0), VectorBonus(0), SingleBBBonus(0),
NumConstantArgs(0), NumConstantOffsetPtrArgs(0), NumAllocaArgs(0),		EnableLoadElimination(true), LoadEliminationCost(0), NumConstantArgs(0),
NumConstantPtrCmps(0), NumConstantPtrDiffs(0),		NumConstantOffsetPtrArgs(0), NumAllocaArgs(0), NumConstantPtrCmps(0),
NumInstructionsSimplified(0), SROACostSavings(0),		NumConstantPtrDiffs(0), NumInstructionsSimplified(0),
SROACostSavingsLost(0) {}		SROACostSavings(0), SROACostSavingsLost(0) {}

bool analyzeCall(CallSite CS);		bool analyzeCall(CallSite CS);

int getThreshold() { return Threshold; }		int getThreshold() { return Threshold; }
int getCost() { return Cost; }		int getCost() { return Cost; }

// Keep a bunch of stats about the cost savings found so we can print them		// Keep a bunch of stats about the cost savings found so we can print them
// out when debugging.		// out when debugging.
Show All 38 Lines
/// savings associated with it back into the inline cost measurement.		/// savings associated with it back into the inline cost measurement.
void CallAnalyzer::disableSROA(DenseMap<Value *, int>::iterator CostIt) {		void CallAnalyzer::disableSROA(DenseMap<Value *, int>::iterator CostIt) {
// If we're no longer able to perform SROA we need to undo its cost savings		// If we're no longer able to perform SROA we need to undo its cost savings
// and prevent subsequent analysis.		// and prevent subsequent analysis.
Cost += CostIt->second;		Cost += CostIt->second;
SROACostSavings -= CostIt->second;		SROACostSavings -= CostIt->second;
SROACostSavingsLost += CostIt->second;		SROACostSavingsLost += CostIt->second;
SROAArgCosts.erase(CostIt);		SROAArgCosts.erase(CostIt);
		disableLoadElimination();
}		}

/// \brief If 'V' maps to a SROA candidate, disable SROA for it.		/// \brief If 'V' maps to a SROA candidate, disable SROA for it.
void CallAnalyzer::disableSROA(Value *V) {		void CallAnalyzer::disableSROA(Value *V) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, int>::iterator CostIt;
if (lookupSROAArgAndCost(V, SROAArg, CostIt))		if (lookupSROAArgAndCost(V, SROAArg, CostIt))
disableSROA(CostIt);		disableSROA(CostIt);
}		}

/// \brief Accumulate the given cost for a particular SROA candidate.		/// \brief Accumulate the given cost for a particular SROA candidate.
void CallAnalyzer::accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,		void CallAnalyzer::accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,
int InstructionCost) {		int InstructionCost) {
CostIt->second += InstructionCost;		CostIt->second += InstructionCost;
SROACostSavings += InstructionCost;		SROACostSavings += InstructionCost;
}		}

		void CallAnalyzer::disableLoadElimination() {
		if (EnableLoadElimination) {
		Cost += LoadEliminationCost;
		EnableLoadElimination = false;
		}
		}

/// \brief Accumulate a constant GEP offset into an APInt if possible.		/// \brief Accumulate a constant GEP offset into an APInt if possible.
///		///
/// Returns false if unable to compute the offset for any reason. Respects any		/// Returns false if unable to compute the offset for any reason. Respects any
/// simplified values known during the analysis of this callsite.		/// simplified values known during the analysis of this callsite.
bool CallAnalyzer::accumulateGEPOffset(GEPOperator &GEP, APInt &Offset) {		bool CallAnalyzer::accumulateGEPOffset(GEPOperator &GEP, APInt &Offset) {
unsigned IntPtrWidth = DL.getPointerSizeInBits();		unsigned IntPtrWidth = DL.getPointerSizeInBits();
assert(IntPtrWidth == Offset.getBitWidth());		assert(IntPtrWidth == Offset.getBitWidth());

▲ Show 20 Lines • Show All 710 Lines • ▼ Show 20 Lines	if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {
if (I.isSimple()) {		if (I.isSimple()) {
accumulateSROACost(CostIt, InlineConstants::InstrCost);		accumulateSROACost(CostIt, InlineConstants::InstrCost);
return true;		return true;
}		}

disableSROA(CostIt);		disableSROA(CostIt);
}		}

		// If the data is already loaded from this address and hasn't been clobbered
		// by any stores or calls, this load is likely to be redundant and can be
		// eliminated.
		if (EnableLoadElimination &&
		!LoadAddrSet.insert(I.getPointerOperand()).second) {
		LoadEliminationCost += InlineConstants::InstrCost;
		return true;
		}

return false;		return false;
}		}

bool CallAnalyzer::visitStore(StoreInst &I) {		bool CallAnalyzer::visitStore(StoreInst &I) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, int>::iterator CostIt;
if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {		if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {
if (I.isSimple()) {		if (I.isSimple()) {
accumulateSROACost(CostIt, InlineConstants::InstrCost);		accumulateSROACost(CostIt, InlineConstants::InstrCost);
return true;		return true;
}		}

disableSROA(CostIt);		disableSROA(CostIt);
}		}

		// The store can potentially clobber loads and prevent repeated loads from
		// being eliminated.
		// FIXME:
		// 1. We can probably keep an initial set of eliminatable loads substracted
		// from the cost even when we finally see a store. We just need to disable
		// further accumulation of elimination savings.
		// 2. We should probably at some point thread MemorySSA for the callee into
		// this and then use that to actually compute really precise savings.
		disableLoadElimination();
return false;		return false;
}		}

bool CallAnalyzer::visitExtractValue(ExtractValueInst &I) {		bool CallAnalyzer::visitExtractValue(ExtractValueInst &I) {
// Constant folding for extract value is trivial.		// Constant folding for extract value is trivial.
if (simplifyInstruction(I, [&](SmallVectorImpl<Constant *> &COps) {		if (simplifyInstruction(I, [&](SmallVectorImpl<Constant *> &COps) {
return ConstantExpr::getExtractValue(COps[0], I.getIndices());		return ConstantExpr::getExtractValue(COps[0], I.getIndices());
}))		}))
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	if (Function *F = CS.getCalledFunction()) {
if (simplifyCallSite(F, CS))		if (simplifyCallSite(F, CS))
return true;		return true;

// Next check if it is an intrinsic we know about.		// Next check if it is an intrinsic we know about.
// FIXME: Lift this into part of the InstVisitor.		// FIXME: Lift this into part of the InstVisitor.
if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CS.getInstruction())) {		if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CS.getInstruction())) {
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
default:		default:
		if (!CS.onlyReadsMemory() && !isAssumeLikeIntrinsic(II))
		disableLoadElimination();
return Base::visitCallSite(CS);		return Base::visitCallSite(CS);

case Intrinsic::load_relative:		case Intrinsic::load_relative:
// This is normally lowered to 4 LLVM instructions.		// This is normally lowered to 4 LLVM instructions.
Cost += 3 * InlineConstants::InstrCost;		Cost += 3 * InlineConstants::InstrCost;
return false;		return false;

case Intrinsic::memset:		case Intrinsic::memset:
case Intrinsic::memcpy:		case Intrinsic::memcpy:
case Intrinsic::memmove:		case Intrinsic::memmove:
		disableLoadElimination();
// SROA can usually chew through these intrinsics, but they aren't free.		// SROA can usually chew through these intrinsics, but they aren't free.
return false;		return false;
case Intrinsic::localescape:		case Intrinsic::localescape:
HasFrameEscape = true;		HasFrameEscape = true;
return false;		return false;
}		}
}		}

Show All 10 Lines	if (TTI.isLoweredToCall(F)) {
Cost += CS.arg_size() * InlineConstants::InstrCost;		Cost += CS.arg_size() * InlineConstants::InstrCost;

// Everything other than inline ASM will also have a significant cost		// Everything other than inline ASM will also have a significant cost
// merely from making the call.		// merely from making the call.
if (!isa<InlineAsm>(CS.getCalledValue()))		if (!isa<InlineAsm>(CS.getCalledValue()))
Cost += InlineConstants::CallPenalty;		Cost += InlineConstants::CallPenalty;
}		}

		if (!CS.onlyReadsMemory())
		disableLoadElimination();
return Base::visitCallSite(CS);		return Base::visitCallSite(CS);
}		}

// Otherwise we're in a very special case -- an indirect function call. See		// Otherwise we're in a very special case -- an indirect function call. See
// if we can be particularly clever about this.		// if we can be particularly clever about this.
Value *Callee = CS.getCalledValue();		Value *Callee = CS.getCalledValue();

// First, pay the price of the argument setup. We account for the average		// First, pay the price of the argument setup. We account for the average
// 1 instruction per call argument setup here.		// 1 instruction per call argument setup here.
Cost += CS.arg_size() * InlineConstants::InstrCost;		Cost += CS.arg_size() * InlineConstants::InstrCost;

// Next, check if this happens to be an indirect function call to a known		// Next, check if this happens to be an indirect function call to a known
// function in this inline context. If not, we've done all we can.		// function in this inline context. If not, we've done all we can.
Function *F = dyn_cast_or_null<Function>(SimplifiedValues.lookup(Callee));		Function *F = dyn_cast_or_null<Function>(SimplifiedValues.lookup(Callee));
if (!F)		if (!F) {
		if (!CS.onlyReadsMemory())
		disableLoadElimination();
return Base::visitCallSite(CS);		return Base::visitCallSite(CS);
		}

// If we have a constant that we are calling as a function, we can peer		// If we have a constant that we are calling as a function, we can peer
// through it and see the function target. This happens not infrequently		// through it and see the function target. This happens not infrequently
// during devirtualization and so we want to give it a hefty bonus for		// during devirtualization and so we want to give it a hefty bonus for
// inlining, but cap that bonus in the event that inlining wouldn't pan		// inlining, but cap that bonus in the event that inlining wouldn't pan
// out. Pretend to inline the function, with a custom threshold.		// out. Pretend to inline the function, with a custom threshold.
auto IndirectCallParams = Params;		auto IndirectCallParams = Params;
IndirectCallParams.DefaultThreshold = InlineConstants::IndirectCallThreshold;		IndirectCallParams.DefaultThreshold = InlineConstants::IndirectCallThreshold;
CallAnalyzer CA(TTI, GetAssumptionCache, GetBFI, PSI, ORE, *F, CS,		CallAnalyzer CA(TTI, GetAssumptionCache, GetBFI, PSI, ORE, *F, CS,
IndirectCallParams);		IndirectCallParams);
if (CA.analyzeCall(CS)) {		if (CA.analyzeCall(CS)) {
// We were able to inline the indirect call! Subtract the cost from the		// We were able to inline the indirect call! Subtract the cost from the
// threshold to get the bonus we want to apply, but don't go below zero.		// threshold to get the bonus we want to apply, but don't go below zero.
Cost -= std::max(0, CA.getThreshold() - CA.getCost());		Cost -= std::max(0, CA.getThreshold() - CA.getCost());
}		}

		if (!F->onlyReadsMemory())
		disableLoadElimination();
return Base::visitCallSite(CS);		return Base::visitCallSite(CS);
}		}

bool CallAnalyzer::visitReturnInst(ReturnInst &RI) {		bool CallAnalyzer::visitReturnInst(ReturnInst &RI) {
// At least one return instruction will be free after inlining.		// At least one return instruction will be free after inlining.
bool Free = !HasReturn;		bool Free = !HasReturn;
HasReturn = true;		HasReturn = true;
return Free;		return Free;
▲ Show 20 Lines • Show All 586 Lines • ▼ Show 20 Lines	#define DEBUG_PRINT_STAT(x) dbgs() << " " #x ": " << x << "\n"
DEBUG_PRINT_STAT(NumConstantOffsetPtrArgs);		DEBUG_PRINT_STAT(NumConstantOffsetPtrArgs);
DEBUG_PRINT_STAT(NumAllocaArgs);		DEBUG_PRINT_STAT(NumAllocaArgs);
DEBUG_PRINT_STAT(NumConstantPtrCmps);		DEBUG_PRINT_STAT(NumConstantPtrCmps);
DEBUG_PRINT_STAT(NumConstantPtrDiffs);		DEBUG_PRINT_STAT(NumConstantPtrDiffs);
DEBUG_PRINT_STAT(NumInstructionsSimplified);		DEBUG_PRINT_STAT(NumInstructionsSimplified);
DEBUG_PRINT_STAT(NumInstructions);		DEBUG_PRINT_STAT(NumInstructions);
DEBUG_PRINT_STAT(SROACostSavings);		DEBUG_PRINT_STAT(SROACostSavings);
DEBUG_PRINT_STAT(SROACostSavingsLost);		DEBUG_PRINT_STAT(SROACostSavingsLost);
		DEBUG_PRINT_STAT(LoadEliminationCost);
DEBUG_PRINT_STAT(ContainsNoDuplicateCall);		DEBUG_PRINT_STAT(ContainsNoDuplicateCall);
DEBUG_PRINT_STAT(Cost);		DEBUG_PRINT_STAT(Cost);
DEBUG_PRINT_STAT(Threshold);		DEBUG_PRINT_STAT(Threshold);
#undef DEBUG_PRINT_STAT		#undef DEBUG_PRINT_STAT
}		}
#endif		#endif

/// \brief Test that there are no attribute conflicts between Caller and Callee		/// \brief Test that there are no attribute conflicts between Caller and Callee
▲ Show 20 Lines • Show All 221 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/ValueTracking.cpp

Show First 20 Lines • Show All 477 Lines • ▼ Show 20 Lines	if (llvm::all_of(V->users(), [&](const User *U) {
}		}
}		}
}		}

return false;		return false;
}		}

// Is this an intrinsic that cannot be speculated but also cannot trap?		// Is this an intrinsic that cannot be speculated but also cannot trap?
static bool isAssumeLikeIntrinsic(const Instruction *I) {		bool llvm::isAssumeLikeIntrinsic(const Instruction *I) {
if (const CallInst *CI = dyn_cast<CallInst>(I))		if (const CallInst *CI = dyn_cast<CallInst>(I))
if (Function *F = CI->getCalledFunction())		if (Function *F = CI->getCalledFunction())
switch (F->getIntrinsicID()) {		switch (F->getIntrinsicID()) {
default: break;		default: break;
// FIXME: This list is repeated from NoTTI::getIntrinsicCost.		// FIXME: This list is repeated from NoTTI::getIntrinsicCost.
case Intrinsic::assume:		case Intrinsic::assume:
case Intrinsic::sideeffect:		case Intrinsic::sideeffect:
case Intrinsic::dbg_declare:		case Intrinsic::dbg_declare:
▲ Show 20 Lines • Show All 4,273 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/Inline/redundant-loads.ll

				; RUN: opt -inline < %s -S -o - -inline-threshold=3 \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				declare void @pad() readnone

				define void @outer1(i32* %a) {
				; CHECK-LABEL: @outer1(
				; CHECK-NOT: call void @inner1
				%b = alloca i32
				call void @inner1(i32* %a, i32* %b)
				ret void
				}

				define void @inner1(i32* %a, i32* %b) {
				%1 = load i32, i32* %a
				store i32 %1, i32 * %b ; This store does not clobber the first load.
				%2 = load i32, i32* %a
				call void @pad()
				%3 = load i32, i32* %a
				ret void
				}


				define void @outer2(i32* %a, i32* %b) {
				; CHECK-LABEL: @outer2(
				; CHECK: call void @inner2
				call void @inner2(i32* %a, i32* %b)
				ret void
				}

				define void @inner2(i32* %a, i32* %b) {
				%1 = load i32, i32* %a
				store i32 %1, i32 * %b ; This store clobbers the first load.
				%2 = load i32, i32* %a
				call void @pad()
				ret void
				}


				define void @outer3(i32* %a) {
				; CHECK-LABEL: @outer3(
				; CHECK: call void @inner3
				call void @inner3(i32* %a)
				ret void
				}

				declare void @ext()

				define void @inner3(i32* %a) {
				%1 = load i32, i32* %a
				call void @ext() ; This call clobbers the first load.
				%2 = load i32, i32* %a
				ret void
				}


				define void @outer4(i32* %a, i32* %b, i32* %c) {
				; CHECK-LABEL: @outer4(
				; CHECK-NOT: call void @inner4
				call void @inner4(i32* %a, i32* %b, i1 false)
				ret void
				}

				define void @inner4(i32* %a, i32* %b, i1 %pred) {
				%1 = load i32, i32* %a
				br i1 %pred, label %cond_true, label %cond_false

				cond_true:
				store i32 %1, i32 * %b ; This store does not clobber the first load.
				br label %cond_false

				cond_false:
				%2 = load i32, i32* %a
				call void @pad()
				%3 = load i32, i32* %a
				%4 = load i32, i32* %a
				ret void
				}


				define void @outer5(i32* %a, double %b) {
				; CHECK-LABEL: @outer5(
				; CHECK-NOT: call void @inner5
				call void @inner5(i32* %a, double %b)
				ret void
				}

				declare double @llvm.fabs.f64(double) nounwind readnone

				define void @inner5(i32* %a, double %b) {
				%1 = load i32, i32* %a
				%2 = call double @llvm.fabs.f64(double %b) ; This intrinsic does not clobber the first load.
				%3 = load i32, i32* %a
				call void @pad()
				ret void
				}

				define void @outer6(i32* %a, i8* %ptr) {
				; CHECK-LABEL: @outer6(
				; CHECK-NOT: call void @inner6
				call void @inner6(i32* %a, i8* %ptr)
				ret void
				}

				declare void @llvm.lifetime.start.p0i8(i64, i8* nocapture) argmemonly nounwind

				define void @inner6(i32* %a, i8* %ptr) {
				%1 = load i32, i32* %a
				call void @llvm.lifetime.start.p0i8(i64 32, i8* %ptr) ; This intrinsic does not clobber the first load.
				%2 = load i32, i32* %a
				call void @pad()
				%3 = load i32, i32* %a
				ret void
				}

				define void @outer7(i32* %a) {
				; CHECK-LABEL: @outer7(
				; CHECK-NOT: call void @inner7
				call void @inner7(i32* %a)
				ret void
				}

				declare void @ext2() readnone

				define void @inner7(i32* %a) {
				%1 = load i32, i32* %a
				call void @ext2() ; This call does not clobber the first load.
				%2 = load i32, i32* %a
				ret void
				}


				define void @outer8(i32* %a) {
				; CHECK-LABEL: @outer8(
				; CHECK-NOT: call void @inner8
				call void @inner8(i32* %a, void ()* @ext2)
				ret void
				}

				define void @inner8(i32* %a, void ()* %f) {
				%1 = load i32, i32* %a
				call void %f() ; This indirect call does not clobber the first load.
				%2 = load i32, i32* %a
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				ret void
				}


				define void @outer9(i32* %a) {
				; CHECK-LABEL: @outer9(
				; CHECK: call void @inner9
				call void @inner9(i32* %a, void ()* @ext)
				ret void
				}

				define void @inner9(i32* %a, void ()* %f) {
				%1 = load i32, i32* %a
				call void %f() ; This indirect call clobbers the first load.
				%2 = load i32, i32* %a
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				call void @pad()
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Find identical loads in the calleeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 127121

llvm/trunk/include/llvm/Analysis/ValueTracking.h

llvm/trunk/lib/Analysis/InlineCost.cpp

llvm/trunk/lib/Analysis/ValueTracking.cpp

llvm/trunk/test/Transforms/Inline/redundant-loads.ll

[InlineCost] Find identical loads in the callee
ClosedPublic