The heuristics used in LSR cannot always guarantee LSR's final solution is the best.
Based the existing LSR's cost model, this change calculate the cost of input of LSR.
If LSR's final solution is more expensive than just using input IR, then we don't
use LSR's solution.
Details
Diff Detail
Event Timeline
I like this change, thanks for implementing it!
It is useful for those kernels where the programmer knows how
to get a better set of induction variables than the suboptimal
IVs selected by the compiler heuristic.
I have wanted something like this! Thanks for doing it.
test/Transforms/LoopStrengthReduce/AArch64/skip-lsr-solution.ll | ||
---|---|---|
2 | This needs "REQUIRES: asserts". |
Here's a testcase where LSR generates code that it worse than the original (on Hexagon):
#include <stdint.h> typedef struct { int32_t X, Y; int32_t *Ptr; } S; int foo(S* Data, int32_t Off, int32_t Idx, int32_t *Out) { int32_t Col = Idx; if (Off >= Data->X) return 5; while (Col >= Data->Y) Col -= Data->Y; *Out = *(Data->Ptr + Col + Data->Y * Off); return 0; }
Compile it with clang -target hexagon -O3. The code you added eventually punts (FormInputLSRUseAndFormula returns false), and LSR proceeds to do its thing. I did some analysis, and the problem is with %add.ptr = getelementptr inbounds i32, i32* %2, i32 %Col.0. This is not an "address use", since it goes into another GEP. It exists in the original source, but LSR never looks at it. Your code does and that makes it exit early. Maybe you should restrict the uses you look at to the same ones that LSR starts with?
Here's the IR immediately before LSR:
define dso_local i32 @foo(%struct.S* nocapture readonly %Data, i32 %Off, i32 %Idx, i32* nocapture %Out) local_unnamed_addr #0 { entry: %X = getelementptr inbounds %struct.S, %struct.S* %Data, i32 0, i32 0 %0 = load i32, i32* %X, align 4, !tbaa !2 %cmp = icmp sgt i32 %0, %Off br i1 %cmp, label %while.cond.preheader, label %cleanup while.cond.preheader: ; preds = %entry %Y = getelementptr inbounds %struct.S, %struct.S* %Data, i32 0, i32 1 %1 = load i32, i32* %Y, align 4, !tbaa !8 br label %while.cond while.cond: ; preds = %while.cond, %while.cond.preheader %Col.0 = phi i32 [ %sub, %while.cond ], [ %Idx, %while.cond.preheader ] %cmp1 = icmp slt i32 %Col.0, %1 %sub = sub nsw i32 %Col.0, %1 br i1 %cmp1, label %while.end, label %while.cond while.end: ; preds = %while.cond %Ptr = getelementptr inbounds %struct.S, %struct.S* %Data, i32 0, i32 2 %2 = load i32*, i32** %Ptr, align 4, !tbaa !9 %add.ptr = getelementptr inbounds i32, i32* %2, i32 %Col.0 %mul = mul nsw i32 %1, %Off %add.ptr4 = getelementptr inbounds i32, i32* %add.ptr, i32 %mul %3 = load i32, i32* %add.ptr4, align 4, !tbaa !10 store i32 %3, i32* %Out, align 4, !tbaa !10 br label %cleanup cleanup: ; preds = %entry, %while.end %retval.0 = phi i32 [ 0, %while.end ], [ 5, %entry ] ret i32 %retval.0 }
lib/Transforms/Scalar/LoopStrengthReduce.cpp | ||
---|---|---|
5727 | Do you have some statistics on how many times this currently happens on a benchmark of your choice? | |
test/Transforms/LoopStrengthReduce/AArch64/skip-lsr-solution.ll | ||
2 | I don't see any CHECK statement depending on -debug-only, so instead of requiring asserts, let's just remove that flag. Also please remove the other flag: -lsr-insns-cost=true as I see that its default value is true: "lsr-insns-cost", cl::Hidden, cl::init(true), |
test/Transforms/LoopStrengthReduce/AArch64/skip-lsr-solution.ll | ||
---|---|---|
2 | It's there, the first CHECK line: CHECK: Skip using LSR's solution. |
This patch is causing the following lit-test fails:
Builtins-i386-linux :: divsc3_test.c LLVM :: CodeGen/X86/2006-05-11-InstrSched.ll LLVM :: CodeGen/X86/MergeConsecutiveStores.ll LLVM :: CodeGen/X86/atom-fixup-lea3.ll LLVM :: CodeGen/X86/conditional-tailcall.ll LLVM :: CodeGen/X86/loop-strength-reduce8.ll LLVM :: CodeGen/X86/lsr-interesting-step.ll LLVM :: CodeGen/X86/merge_store.ll LLVM :: CodeGen/X86/misched-matrix.ll LLVM :: CodeGen/X86/multiple-loop-post-inc.ll LLVM :: CodeGen/X86/ragreedy-hoist-spill.ll LLVM :: CodeGen/X86/regalloc-reconcile-broken-hints.ll LLVM :: DebugInfo/COFF/fpo-shrink-wrap.ll LLVM :: Transforms/LoopStrengthReduce/2013-01-14-ReuseCast.ll LLVM :: Transforms/LoopStrengthReduce/X86/ivchain-X86.ll LLVM :: Transforms/LoopStrengthReduce/X86/lsr-filtering-scaledreg.ll LLVM :: Transforms/LoopStrengthReduce/X86/nested-loop.ll LLVM :: Transforms/LoopStrengthReduce/funclet.ll LLVM :: Transforms/LoopStrengthReduce/pr27056.ll
One of the reasons for these fails is that the current GetOrCreateLSRUse function is not taking the reference of "S" as a parameter (explained in the inline comment). Though adding "&" reduces the number of fails, followings are still failing:
Builtins-i386-linux :: divsc3_test.c LLVM :: CodeGen/X86/conditional-tailcall.ll LLVM :: CodeGen/X86/regalloc-reconcile-broken-hints.ll LLVM :: DebugInfo/COFF/fpo-shrink-wrap.ll LLVM :: Transforms/LoopStrengthReduce/2013-01-14-ReuseCast.ll LLVM :: Transforms/LoopStrengthReduce/funclet.ll LLVM :: Transforms/LoopStrengthReduce/pr27056.ll
There might be some other problems in the cost modelling algorithm which needs further investigation.
lib/Transforms/Scalar/LoopStrengthReduce.cpp | ||
---|---|---|
3357 | The "S" parameter is missing the "&" and it must be added. This function calls getUse() with "S", and getUse modifies the parameter. Without the "&", the modification by getUse will not be seen by the caller of GetOrCreateLSRUse. |
I tried this patch on exynos-m3 and there are several benchmarks improving by about 5%.
Among those benchmarks are spec2000 188.ammp and 256.bzip2 that improve by 3%.
All performance degradations are within noise level.
lib/Transforms/Scalar/LoopStrengthReduce.cpp | ||
---|---|---|
3603 | s/on/one/ | |
3620 | s/getGEPExptr/getGEPExpr/ |
Sorry for the extremely long delay on this change. Now I just updated the patch. Please take a look and let me know any comment.
Compile it with clang -target hexagon -O3. The code you added eventually punts (FormInputLSRUseAndFormula returns false), and LSR proceeds to do its thing. I did some analysis, and the problem is with %add.ptr = getelementptr inbounds i32, i32* %2, i32 %Col.0. This is not an "address use", since it goes into another GEP. It exists in the original source, but LSR never looks at it. Your code does and that makes it exit early. Maybe you should restrict the uses you look at to the same ones that LSR starts with?
We should consider costs of all SCEVable instructions between phi and leaf IV users. When creating the initial formula, those SCEVable instructions in the middle must be folded as a part of formula, but to find the input cost without LSR transformation we should count costs of all those instructions.
lib/Transforms/Scalar/LoopStrengthReduce.cpp | ||
---|---|---|
3357 | Thanks for catching ! | |
5727 | For now I want to be conservative on skipping LSR even when the input cost is shown to be cheap, so I applied some weight on the input cost before comparing with the selected solution. Because of this weight, it doesn't seem to happen widely. In my test for spec2000, it impact only on 6 loops. | |
test/Transforms/LoopStrengthReduce/AArch64/skip-lsr-solution.ll | ||
2 | In order to force to use #instruction in the cost model for this test, we need to have -lsr-insns-cost=true specifically in the command-line because the occurrence of lsr-insns-cost is checked in Cost::isLess(). |
The "S" parameter is missing the "&" and it must be added. This function calls getUse() with "S", and getUse modifies the parameter. Without the "&", the modification by getUse will not be seen by the caller of GetOrCreateLSRUse.