This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improvement in CodeGen instruction selection for LEAs.
ClosedPublic

Authored by jbhateja on Jul 5 2017, 8:40 AM.

Details

Summary

1/ Operand folding during complex pattern matching for LEAs has been extended, such that it promotes Scale to

 accommodate similar operand appearing in the DAG  e.g.
             T1 = A + B
             T2 = T1 + 10
             T3 = T2 + A
For above DAG rooted at T3, X86AddressMode will now look like
            Base = B , Index = A , Scale = 2 , Disp = 10

2/ During OptimizeLEAPass down the pipeline factorization is now performed over LEAs so that if there is an opportunity

then complex LEAs (having 3 operands) could be factored out  e.g.
            leal 1(%rax,%rcx,1), %rdx
            leal 1(%rax,%rcx,2), %rcx
will be factored as following
            leal 1(%rax,%rcx,1), %rdx
            leal (%rdx,%rcx)   , %edx

3/ Aggressive operand folding for AM based selection for LEAs is sensitive to loops, thus avoiding creation of any complex LEAs within a loop.

4/ Simplify LEA converts (lea (BASE,1,INDEX,0) --> add (BASE, INDEX) which offers better through put.

PR32755 will be taken care of by this pathc.

Previous patch revisions : r313343 , r314886

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lsaba edited edge metadata.Jul 6 2017, 1:42 AM

Please add test cases for the optimization added in OptimizeLEAPass

lsaba added inline comments.Jul 6 2017, 2:06 AM
lib/Target/X86/X86OptimizeLEAs.cpp
86

This comment is only valid for the else statement, please change it to explain the different cases between Identical and Similar Disp

292

Please add a comment explaining what this function does

jbhateja updated this revision to Diff 105755.Jul 9 2017, 1:44 AM

Changes for review comments for patch.

jbhateja marked 11 inline comments as done.Jul 9 2017, 1:52 AM

Reviewers,

Have posted an RFC for wider fix at following link

Kindly review the same and add your comments

https://groups.google.com/forum/#!topic/llvm-dev/x2LDXpON500

Thanks.

Missed one change to be submitted, build is running will upload the change post lit. Thanks

spatel added a subscriber: spatel.Jul 9 2017, 8:21 AM
jbhateja updated this revision to Diff 105790.Jul 9 2017, 12:41 PM

Review comments changes cont..

lsaba added inline comments.Jul 10 2017, 6:20 AM
lib/Target/X86/X86OptimizeLEAs.cpp
737

should it also be erased from the LEAs list?

jbhateja added inline comments.Jul 10 2017, 8:24 AM
lib/Target/X86/X86OptimizeLEAs.cpp
737

Why do you think so ?

LEAs is a Map where

Key = F ( BASE , INDEX , DISP , SEGMENT)
Value = Vector of MI (LEA Instr).
This MAP is populated per BasicBlock basis.

Outer Loop traverse over Map entries

Sort Vector in decresing order of Scale.
Inner Loop traverses over Sorted vector of LEA for a given Key
LI1 insturction will be traversed only once.

Map will be delted once we leave this function.

Machine CSE which is value number based is already run before this pass so if there are multiple identical LEAs (i.e same BASE/INDEX/SCALE/DISP/SEGMENT) in a BasicBlock they will be factored out before we land up here..

lsaba added inline comments.Jul 11 2017, 5:02 AM
lib/Target/X86/X86OptimizeLEAs.cpp
737

just making sure:)
by the way, can't this algorithm work cross a function's basic blocks?

jbhateja updated this revision to Diff 106796.Jul 16 2017, 2:37 AM

Performing LEA factorization/CSE across bacis blocks.

jbhateja marked an inline comment as done.Jul 17 2017, 8:05 AM
jbhateja marked 2 inline comments as done and an inline comment as not done.Jul 17 2017, 8:07 AM

ping @reviewers

lsaba added inline comments.Jul 19 2017, 8:03 AM
lib/Target/X86/X86OptimizeLEAs.cpp
341

it is unclear what this function does, can you explain?

jbhateja added inline comments.Jul 19 2017, 9:54 AM
lib/Target/X86/X86OptimizeLEAs.cpp
341

In a nutshell we are implementing a scoped hash map. Which is LEAs.
Every time we enter a new scope and encounter an LEA we first record the length of list of MIs corresponding to MemOpKey of new LEA. After that we insert the new LEA in the beginning of the list which is a value field of the hash map.

When we leave a scope we remove the LEA instructions from the LEAs hash map. Since we recorded the original length of list of MIs when we entered the scope at exit we keep on removing elements from the beginning of list till the size becomes same as what was recorded at the entry.

jbhateja marked an inline comment as done.Jul 19 2017, 9:55 AM
lsaba added inline comments.Jul 24 2017, 5:52 AM
lib/Target/X86/X86OptimizeLEAs.cpp
342

already initialized at the beginning of the function

test/CodeGen/X86/lea-opt-csebb.ll
1 ↗(On Diff #106796)

can you please add a test case that covers scale >1 cases

test/CodeGen/X86/umul-with-overflow.ll
6 ↗(On Diff #106796)

why did this test change?

jbhateja added inline comments.Jul 24 2017, 8:18 AM
test/CodeGen/X86/lea-opt-csebb.ll
1 ↗(On Diff #106796)

This commit if you see has two parts
1/ pattern matching based on addressing mode (which is limited currently).
2/ factoring of LEAs which is generic.

Checking in incremental changes should be fine I guess.

Generic pattern will need to be brought out of addessing mode based selection as I described in following link
https://groups.google.com/forum/#!topic/llvm-dev/x2LDXpON500

Please comment in the thread.

test/CodeGen/X86/umul-with-overflow.ll
6 ↗(On Diff #106796)

Beecause I generated its output with script utils/update_llc_test_checks.py which adds
an assertion for each instruction. I think it sould be fine.

jbhateja added inline comments.Jul 24 2017, 8:20 AM
lib/Target/X86/X86OptimizeLEAs.cpp
342

Yes.

ping @ reviewers. can we do an incremental checkin for this.

Thanks

lsaba added inline comments.Jul 26 2017, 4:21 AM
test/CodeGen/X86/lea-opt-csebb.ll
1 ↗(On Diff #106796)

I am not sure i understand what you mean by "Generic pattern will need to be brought out of addessing mode" , as far as i understand, for the following C code:

int foo(int a, int b) {

int x = a + 2*b + 4; 
int y = a + 4*b + 4; 
int c = x*y ;
return c;

}

the currently generated IR:
define i32 @foo(i32 %a, i32 %b) local_unnamed_addr #0 {
entry:

%mul = shl i32 %b, 1
%add = add i32 %a, 4
%add1 = add i32 %add, %mul
%mul2 = shl i32 %b, 2
%add4 = add i32 %add, %mul2
%mul5 = mul nsw i32 %add1, %add4
ret i32 %mul5

}

the currently generated asm:

leal 4(%rdi,%rsi,2), %ecx
leal 4(%rdi,%rsi,4), %eax
imull %ecx, %eax
retq

this will be refactored by this optimization in this current commit (not a future commit) to:
leal 4(%rdi,%rsi,2), %ecx
leal (%ecx,%rsi,2), %eax
imull %ecx, %eax
retq

please correct me if im wrong

test/CodeGen/X86/lea-opt-cst.ll
3

please generate the test with the original checks before your changes and commit it first in a separate commit

test/CodeGen/X86/umul-with-overflow.ll
6 ↗(On Diff #106796)

This needs to be in a separate pre commit. please commit and rebase

RKSimon added inline comments.Jul 26 2017, 4:24 AM
lib/Target/X86/X86OptimizeLEAs.cpp
189

Can we avoid the static?

240

Comment describing the purpose of the class

263

(style) cleanup the positions of the * - check what clang-format does

263

comment

278

comment

298

(style) remove braces

test/CodeGen/X86/lea-opt-csebb.ll
1 ↗(On Diff #106796)

Please can you commit this test file to trunk with current codegen and update the patch to show the diff

test/CodeGen/X86/lea-opt-cst.ll
3

Please can you commit this test file to trunk with current codegen and update the patch to show the diff

test/CodeGen/X86/umul-with-overflow.ll
6 ↗(On Diff #106796)

I regenerated this recently - please rebase

utils/TableGen/DAGISelMatcherGen.cpp
308

This is still here

jbhateja added inline comments.Jul 26 2017, 8:33 AM
test/CodeGen/X86/lea-opt-csebb.ll
1 ↗(On Diff #106796)

Hi Lama,

By generic patten handling I meant LEA folding into complex LEAs which is currently restrictive.

Consider following case

%struct.SA = type { i32 , i32 , i32 , i32 , i32};

define void @foo(%struct.SA* nocapture %ctx, i32 %n) local_unnamed_addr #0 {
entry:

%h0 = getelementptr inbounds %struct.SA, %struct.SA* %ctx, i64 0, i32 0
%0 = load i32, i32* %h0, align 8
%h3 = getelementptr inbounds %struct.SA, %struct.SA* %ctx, i64 0, i32 3
%h4 = getelementptr inbounds %struct.SA, %struct.SA* %ctx, i64 0, i32 4
%1 = load i32, i32* %h4, align 8
%add = add i32 %0, 1
%add4 = add i32 %add, %1
%add5 = add i32 %add4, %1
store i32 %add5, i32* %h3, align 4
%add10 = add i32 %add5, %1
%add29 = add i32 %add10, %1
store i32 %add29, i32* %h4, align 8
ret void

}

ASM :

foo: # @foo
.cfi_startproc

BB#0: # %entry

movl (%rdi), %eax
movl 16(%rdi), %ecx
leal (%rax,%rcx,2), %edx
leal 1(%rax,%rcx,2), %eax
movl %eax, 12(%rdi)
leal 1(%rdx,%rcx,2), %eax
movl %eax, 16(%rdi)

It could be further optimized to following:

movl (%rdi), %eax
movl 16(%rdi), %ecx
leal 1(%rax,%rcx,2), %edx
movl %eax, 12(%rdi)
leal (%rdx,%rcx,2), %eax
movl %eax, 16(%rdi)

Folding is currently being done as a part of addressing mode matcher, I feel that efficient
folding can only be done as a separate MI pass, that is what I explained in the proposal (http://lists.llvm.org/pipermail/llvm-dev/2017-July/115182.html).

Thanks for your example I will add it to the test cases , it demonstrates generic ness of Factorization.

lsaba added inline comments.Jul 27 2017, 12:48 AM
test/CodeGen/X86/lea-opt-csebb.ll
1 ↗(On Diff #106796)

Hi,

Thanks, I understand the need for a more generic pattern matching and I agree.
This is unrelated to my comment which refers solely to the Factorize LEA optimization which needs more testing, for example covering different Scale values (like the example i provided) and testing factorizing LEAs cross Basic Blocks.

lsaba added inline comments.Jul 31 2017, 4:50 AM
lib/Target/X86/X86OptimizeLEAs.cpp
724

This could end up in an assertion failure if LI1 is at the beginning of the BB, need to handle it separately, for example in this reproducer :

; ModuleID = 'bugpoint-reduced-simplified.bc'
source_filename = "bugpoint-output-2ef2e5d.bc"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind readnone uwtable
define i32 @foo(i32 %a, i32 %b, i32 %d, i32 %y, i32 %x) local_unnamed_addr #0 {
entry:

%mul1 = shl i32 %b, 1
%add2 = add i32 %a, 4
%add3 = add i32 %add2, %mul1
%mul4 = shl i32 %b, 2
%add6 = add i32 %add2, %mul4
br label %for.body

for.cond.cleanup: ; preds = %for.body

ret i32 %add

for.body: ; preds = %for.body, %entry

%x.addr.015 = phi i32 [ %x, %entry ], [ %add3, %for.body ]
%y.addr.014 = phi i32 [ %y, %entry ], [ %add6, %for.body ]
%mul = mul nsw i32 %x.addr.015, %y.addr.014
%add = add nsw i32 0, %mul
%exitcond = icmp eq i32 undef, %d
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !1

}

attributes #0 = { norecurse nounwind readnone uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp- math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+ mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 6.0.0 (cfe/trunk 309511)"}
!1 = distinct !{!1, !2}
!2 = !{!"llvm.loop.unroll.disable"}

RKSimon added inline comments.Jul 31 2017, 5:58 AM
lib/Target/X86/X86OptimizeLEAs.cpp
29

(style) Insert the include in alphabetical order (so before MachineFunctionPass.h)

jbhateja updated this revision to Diff 109119.Aug 1 2017, 7:48 AM

1/ Changes to cover review comments.
2/ Handling for patterns involving SUBREG_TO_REG as LEA operands.
3/ Formatting changes.

RKSimon added inline comments.Aug 1 2017, 9:14 AM
lib/Target/X86/X86OptimizeLEAs.cpp
131

(style)

for (unsigned i = 1, e = MI->getNumOperands(); i <e ; i++)
216

(style)

for (unsigned i = 1, e = MI1->getNumOperands(); i < e; ++i)
test/CodeGen/X86/umul-with-overflow.ll
6 ↗(On Diff #106796)

Still needs to be rebased - you've lost the x86_64 tests

jbhateja updated this revision to Diff 109147.Aug 1 2017, 10:03 AM

Pinging reviewers. Kindly pour your comments.
Thanks

lsaba added a comment.Aug 6 2017, 1:33 AM

Pinging reviewers. Kindly pour your comments.
Thanks

Please address my last comment (Line 920)

Pinging reviewers. Kindly pour your comments.
Thanks

Please address my last comment (Line 920)

The test case you provided is giving correct results with currently checked in changes.

lsaba added inline comments.Aug 7 2017, 5:17 AM
lib/Target/X86/X86OptimizeLEAs.cpp
209

need to check MO2.isReg()

jbhateja added inline comments.Aug 8 2017, 7:03 AM
lib/Target/X86/X86OptimizeLEAs.cpp
209

Yes, I shall take care of this.

Kindly let me know if there are any other comments apart from this.

It shall save iterations.

@ reviewers , kindly let me know if there are any more comments apart from last comment from lsaba.
Thanks.

lsaba added a comment.Aug 8 2017, 9:54 AM

@ reviewers , kindly let me know if there are any more comments apart from last comment from lsaba.
Thanks.

Hi,
I ran the patch on several benchmarks to check performance, overall the changes look good, but there is a regression in one of the benchmarks (EEMBC/coremark-pro) caused by creating an undesired lea instruction instead of the previously created add instruction, I am working on creating a simple reproducer for the problem and would appreciate your patience.

Thanks

lsaba added a comment.Aug 9 2017, 8:19 AM

@ reviewers , kindly let me know if there are any more comments apart from last comment from lsaba.
Thanks.

Hi,
I ran the patch on several benchmarks to check performance, overall the changes look good, but there is a regression in one of the benchmarks (EEMBC/coremark-pro) caused by creating an undesired lea instruction instead of the previously created add instruction, I am working on creating a simple reproducer for the problem and would appreciate your patience.

Thanks

The change in X86DAGToDAGISel::matchAddressBase is good when it allows us to git rid of extra lea/add instructions, or replace slow lea with fast lea, but in some cases it only replaces an add instruction with a lea instruction and since the throughput of add instruction is higher, we would prefer to keep the add instruction, for example, for the following IR:

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
; Function Attrs: norecurse nounwind uwtable
define void @foo() local_unnamed_addr #0 {
entry:
  br i1 undef, label %BB2, label %BB1
BB1:                                  ; preds = %entry
  %rem.us.1 = srem i32 undef, 65536
  br label %BB2
BB2:      ; preds = %BB1, %entry
  %s = phi i32 [ undef, %entry ], [ %rem.us.1, %BB1 ]
  %a = phi i32 [ 1, %entry ], [ 0, %BB1 ]
  %mul1 = mul nsw i32 %s, %a
  %rem1 = srem i32 %mul1, 65536
  %add1 = add nsw i32 %rem1, %a
  %conv1 = trunc i32 %add1 to i16
  store i16 %conv1, i16* undef, align 2, !tbaa !1
  %add2 = add i32 %add1, %a
  %0 = trunc i32 %add2 to i16
  %conv2 = and i16 %0, 255
  store i16 %conv2, i16* undef, align 2, !tbaa !1
  ret void
}
attributes #0 = { norecurse nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-
math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="core-avx2" "target-
features"="+aes,+avx,+avx2,+bmi,+bmi2,+cx16,+f16c,+fma,+fsgsbase,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" "unsafe-fp-math"="true" "use-soft-float"="false" }
!llvm.ident = !{!0}
!0 = !{!"clang version 6.0.0 (cfe/trunk 310239)"}
!1 = !{!2, !2, i64 0}
!2 = !{!"short", !3, i64 0}
!3 = !{!"omnipotent char", !4, i64 0}
!4 = !{!"Simple C/C++ TBAA"}

the originally generated code was:

.LBB0_2:                                # %for.cond11.for.inc35_crit_edge.us.unr-lcssa
   movl	%eax, %ecx
   imull	%eax, %ecx
   movl	%ecx, %edx
   sarl	$31, %edx
   shrl	$16, %edx
   addl	%ecx, %edx
   andl	$-65536, %edx           # imm = 0xFFFF0000
   subl	%edx, %ecx
   addl	%eax, %ecx
   movw	%cx, (%rax)
   addl	%eax, %ecx
   movzbl	%cl, %eax
   movw	%ax, (%rax)
   retq

while the generated code now is

movl	%eax, %ecx
imull	%eax, %ecx
movl	%ecx, %edx
sarl	$31, %edx
shrl	$16, %edx
addl	%ecx, %edx
andl	$-65536, %edx           # imm = 0xFFFF0000
subl	%edx, %ecx
leal	(%rcx,%rax), %edx
movw	%dx, (%rax)
leal	(%rcx,%rax,2), %eax
movzbl	%al, %eax
movw	%ax, (%rax)
retq

Need to refine this optimization further to avoid such cases since the impact can be substantial if the code is in a hot loop for example

qcolombet added inline comments.Aug 9 2017, 10:58 AM
include/llvm/CodeGen/MachineInstr.h
1291 ↗(On Diff #109147)

Genuine question.
MRI is usually accessibly via other more efficient means.
Do we really need to rely on this one?

jbhateja updated this revision to Diff 111035.EditedAug 14 2017, 11:08 AM
  • Changes to avoid creating costly complex LEAs having scale less than equal to 2 in loops.
  • Strength reduction for simple LEAs with unit scale for better throughput.
  • Pattern matching for DAG folding has been improved to make it generic.
  • Incorporating other review comments.
jbhateja added inline comments.Aug 14 2017, 11:25 AM
include/llvm/CodeGen/MachineInstr.h
1291 ↗(On Diff #109147)

It seems making it public will be useful as one can directly use the function which internally
calls getParent() twice to get to MachineFunction which contains Reg Info.

ping @ Reviewers, I guess I have addressed all comments.

It seems like there are still correctness issues in the patch, I ran the llvm-test-suite and got a couple of runfails :
multisource_applications_alac_encode_alacconvert_encode
multisource_applications_jm_lencod_lencod

please debug those failures.

In general, please consider running execution tests on the patches to discover runtime failures.

jbhateja updated this revision to Diff 112328.Aug 23 2017, 3:37 AM
  • Limiting the scope of DAG operands folding while AM based instruction selection to LEAs.
  • Formatting changes , rebase and lnt failure fix.

Ping @ reviewers. I think all the comments have been resolved.
Do let me know if any other comments.

jbhateja updated this revision to Diff 113356.Aug 30 2017, 10:51 PM

@lamas, @reviewers, comments have been taken care. Let me know if anything else.

lsaba added a comment.EditedSep 4 2017, 12:24 AM

@lamas, @reviewers, comments have been taken care. Let me know if anything else.

There are still functionality issues with the pass, please allow some time to create a reproducer

lsaba added a comment.Sep 4 2017, 8:19 AM

@lamas, @reviewers, comments have been taken care. Let me know if anything else.

The following ll code fails in CodeGen selection, please debug the issue:

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
%struct.S1 = type { i32, i32 }
; Function Attrs: nounwind uwtable
define fastcc void @func(i32 %end) unnamed_addr #0 {
entry:
 br label %while.body
while.body:                                       ; preds = %if.end, %entry
 %a = phi i32 [ %end, %entry ], [ undef, %if.end ]
 br i1 undef, label %if.then, label %if.else
if.then:                                          ; preds = %while.body
 %dec = add nsw i32 %a, -1
 %idx1 = sext i32 %dec to i64
 %idx2 = getelementptr inbounds %struct.S1, %struct.S1* null, i64 %idx1
 %0 = bitcast %struct.S1* %idx2 to i64*
 store i64 0, i64* %0, align 4
 %1 = load [3 x float]*, [3 x float]** undef, align 8 
 %idx3 = getelementptr inbounds [3 x float], [3 x float]* %1, i64 %idx1, i64 0
 %2 = bitcast float* %idx3 to i32*
 %3 = load i32, i32* %2, align 4
 store i32 %3, i32* undef, align 4
 br label %if.end
if.else:                                          ; preds = %while.body
 br label %if.end
if.end:                                           ; preds = %if.else, %if.then
 br label %while.body
}
attributes #0 = { nounwind uwtable "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" }
!llvm.module.flags = !{!0}
!0 = !{i32 1, !"wchar_size", i32 4}
RKSimon edited edge metadata.Sep 4 2017, 8:59 AM

Some style comments, but @lsaba 's comments need to be dealt with first.

include/llvm/CodeGen/MachineInstr.h
1296 ↗(On Diff #113356)

(style) newline before private

include/llvm/CodeGen/SelectionDAG.h
305 ↗(On Diff #113356)

(style) newline before private

lib/Target/X86/X86ISelDAGToDAG.cpp
165

(style) newline

1115

clang-format? If so, commit it as an NFC change

1124

Two hard coded depths like this is weird - better to have a getMaxOperandFoldingDepth() helper?

lib/Target/X86/X86OptimizeLEAs.cpp
174

NFC change - just commit it if you want, but don't pollute a patch with it

316

NFC change

357

clang-format? If so, commit it as an NFC change

650

clang-format? If so, commit it as an NFC change

764

(style) Remove braces

jbhateja updated this revision to Diff 113802.Sep 4 2017, 9:54 PM
  • Fine tuning pattern matching condition.
  • Formatting changes.
lsaba added a comment.Sep 6 2017, 12:13 AM

3-Ops LEA are costly starting target SandyBridge , is there a limitation in the code for the targets this transformation works on? If not I think there should be.
you can check the Slow3OpsLEA feature for the full list of targets.

3-Ops LEA are costly starting target SandyBridge , is there a limitation in the code for the targets this transformation works on? If not I think there should be.
you can check the Slow3OpsLEA feature for the full list of targets.

Yes, this check could be added in pattern matching where i'm avoiding creation of complex LEA with scale less than equal to 2.

Please let me know if anything else you see to save iteration.
Thanks

jbhateja updated this revision to Diff 114326.Sep 8 2017, 2:44 AM
  • Rebasing again.
  • Adding a check for subtarget feature Slow3OpLEA in pattern matching.

ping @ reviewers.

@lsaba, @reviewers , waiting for your LGTM or any remaining comments on this.
Thanks

lsaba accepted this revision.Sep 13 2017, 7:13 AM
This revision is now accepted and ready to land.Sep 13 2017, 7:13 AM
jbhateja updated this revision to Diff 115359.Sep 14 2017, 10:18 PM
  • Few synthetic changes.
This revision was automatically updated to reflect the committed changes.
RKSimon reopened this revision.Sep 16 2017, 2:39 AM

Reverted in rL313376 due to PR34629 and PR34634

This revision is now accepted and ready to land.Sep 16 2017, 2:39 AM
RKSimon requested changes to this revision.Sep 16 2017, 2:42 AM

PR34629 and PR34634 need to be addressed

This revision now requires changes to proceed.Sep 16 2017, 2:42 AM
jbhateja updated this revision to Diff 115561.Sep 17 2017, 12:24 AM
jbhateja edited edge metadata.
  • Undefining result operand of factored statement to preserve SSA nature of Machine IR.
  • This fixes reperted PR 34634 and PR 34629 and build-bot failures reported.
jbhateja added a comment.EditedSep 17 2017, 12:38 AM

@RKSimon, @Reviewers, revision was in accepted state earlier and fix to counter reported issues post commit to trunk has been fixed. Please do let me know if another acceptance is needed to land this again.

jbhateja updated this revision to Diff 115583.Sep 17 2017, 11:36 AM
  • Updating tests for reported PRs for initial patch.
jmolloy requested changes to this revision.Sep 18 2017, 5:04 AM
jmolloy added a subscriber: jmolloy.
jmolloy added inline comments.
lib/Target/X86/X86OptimizeLEAs.cpp
823

This can cause recursion deep enough to cause stack overflows. Please could you refactor this to not use direct recursion? The domtree may be hundreds of nodes deep in degenerate cases.

This revision now requires changes to proceed.Sep 18 2017, 5:04 AM
jbhateja updated this revision to Diff 116144.Sep 21 2017, 12:15 AM
jbhateja edited edge metadata.

@reviewers, required revision change are through, let me know if this can land back.

jbhateja added a comment.EditedSep 27 2017, 4:38 AM

@jmolloy , @RKSimon , this patch has been reviewd and due to regression was opened again for review, required changes have
been made, can this land now in trunk if there are no more observations from any reviewers.

Thanks

jbhateja accepted this revision.EditedOct 3 2017, 1:46 AM

@reviewers, if no more comment I shall be landing this into trunk since required revision changes post acceptance are through.

jbhateja updated this revision to Diff 118948.Oct 13 2017, 11:37 AM
  • Operands of factored LEA must belong to same register class as per Intel's Architecture Manual.
  • Some code reorganization + rebase.
jbhateja updated this revision to Diff 119010.Oct 14 2017, 1:06 AM
  • D35014 : Review comments resolution
  • Removing 2 tests, pulled their latest renamed versions from trunk.
  • [X86] : Factorize LEA, handling for patterns involing SUBREG_TO_REG as LEA operands.
  • Few more changes for LEA factorization.
  • Updating test lea-opt-cse3.ll
  • Formatting changes.
  • Formatting changes
  • Changes to avoid creating costly LEAs in loops, strength reduction for simple LEAs with unit scale
  • Updating test.
  • [X86] Limiting the scope of DAG operands folding while AM based instruction selection to LEAs.
  • Merge from trunk.
  • Extending aggressive AM based folding for LEAs to cover more cases.
  • Updating test post rebase.
  • Formatting changes + fine tuning pattern matching condition.
  • Adding a check for subtarget feature Slow3OpLEA in pattern matching.
  • Few synthetic changes.
  • Undefining result operand of factored statement to preserve SSA nature of Machine IR.
  • Merge branch 'master' of https://github.com/llvm-mirror/llvm
  • Merge branch 'master' of https://github.com/llvm-mirror/llvm
  • Updating tests for reported PRs for initial patch.
  • Merge branch 'master' of https://github.com/llvm-mirror/llvm
  • Pull from trunk.
  • Operands of LEAs must be of same register class.
  • Revert "Operands of LEAs must be of same register class."
jbhateja updated this revision to Diff 119011.Oct 14 2017, 1:17 AM
  • Operands of LEA must be of same register class, this constraint is as per Intel's architecture manual.
  • Remove map entry from LEAs map if value list becomes empty.
  • Rebase.

Patch has been regressed through chrome test sweet.
No issues reported. Thanks to Hans Wennborg (hans@chromium.org) for validating it.

RKSimon added inline comments.Oct 29 2017, 6:12 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
100
bool isLegalScale() const {
187

Is a default argument for a setter a good idea? Especially one that is the inverse of what the setter says it is.

1124

My comment still stands - try to avoid hard coded values embedded in the source - add a getMaxOperandFoldingDepth() helper.

1422

These AM.Scale increments are scary - better to set it with AM.Scale = 2?

lib/Target/X86/X86OptimizeLEAs.cpp
29

Include ordering still broken

238

Do you mean:

bool isInstrErased = !(Opr.isReg() && Opr.getParent()->getParent());
724

Has @lsaba test been added to the patch? I couldn't see it.

770

Really don't like this - write a helper instead like you did in X86ISelDAGToDAG.cpp

auto IsLegalScale = [](int S) {
  return S == 1 || S == 2 || S == 4 || S == 8;
};
776
return Arg1->getOperand(2).getImm() >= Arg2->getOperand(2).getImm();
831

DL is only used here - just use LI1.getDebugLoc() directly?

test/CodeGen/X86/lea-opt-cse2.ll
3 ↗(On Diff #119011)

Why have you changed these tests?

test/CodeGen/X86/lea-opt-cse4.ll
3 ↗(On Diff #119011)

Why have you changed these tests?

jbhateja retitled this revision from [X86] PR32755 : Improvement in CodeGen instruction selection for LEAs. to [X86] Improvement in CodeGen instruction selection for LEAs..Oct 29 2017, 8:36 AM
jbhateja edited the summary of this revision. (Show Details)
jbhateja added inline comments.Oct 29 2017, 9:55 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
100

Fixed.

187

Fixed. We do not need a default argument here, both the calls to this routines is passing an explicit argument.

1124

Helper added.

1422

Increments are triggered only in aggressive folding mode and can fold upto 8 operands (which is a max legal scale). This was intentionally done, initial change was only working for AM.scale = 2 and was very restrictive. Aggressive operand folding is done only for LEAs currently and is enabled instantiating and RAII object of X86AggressiveOperandFolding class.

lib/Target/X86/X86OptimizeLEAs.cpp
238

fixed.

724

We have a similare test case for loop lea-opt-cse2.ll. We are not doing any factorization inside loops, only simplifyLEA can kick in.

724

We have a test case for loops lea-opt-cse2.ll, so not added this.
We are not doing any factorization inside loops, only simplifyLEA can kick in.

770

Fixed

776

Fixed

test/CodeGen/X86/lea-opt-cse4.ll
3 ↗(On Diff #119011)

FixupLEAPass down the pipeline transforms some complex LEA ptterns to simple with add. Optimization, with changes in the patch we will have following

leal 1(%rax,%rcx,4), %eax

which after FixupLEAPass will get converted to

leal (%rax,%rcx,4), %eax
addl $1, %eax

jbhateja updated this revision to Diff 120753.Oct 29 2017, 10:12 AM
  • Rebasing
  • Review comments resolution.

@RKSimon, requested revision changes have been made as per your comments. Can you please validate.

jbhateja updated this revision to Diff 121450.Nov 3 2017, 3:11 AM

1/ Making the factorization alog iterative. This was earlier commited with

Diff : https://reviews.llvm.org/D35014?id=116144
but some how got removed in successive commits.

2/ Rebasing again. All comments are resolved.

@RKSimon, @lsaba , @jmolly , all your comments have been addressed. Kindly verify so that I can land this into trunk.

A few minor comments @lsaba @craig.topper any final comments?

lib/Target/X86/X86ISelDAGToDAG.cpp
191

Make this a const method

1124

I meant make this a class method, but if you don't want to you can leave it here as lambda

lib/Target/X86/X86OptimizeLEAs.cpp
85

CopyLike

189

Again, can we avoid the static?

@RKSimon No more comments from my side

craig.topper edited edge metadata.Nov 26 2017, 12:05 AM

I haven't been following this much so I have no comments either.

RKSimon accepted this revision.Nov 26 2017, 1:21 AM

LGTM

LGTM - with those final few minors I mentioned

jbhateja updated this revision to Diff 124698.Nov 28 2017, 11:02 PM
  • Reivew comment resolution.
  • Rebasing patch.
jbhateja updated this revision to Diff 124699.Nov 28 2017, 11:22 PM
  • Rebasing to resolve incorrect overrideing of register names in kill statements.
jbhateja added inline comments.Nov 28 2017, 11:25 PM
lib/Target/X86/X86OptimizeLEAs.cpp
189

Its used bacause we want MemOpKey for LEA factorization to be indipendent of Scale, keeping it as static avoids recreation of dummy scale.

jbhateja updated this revision to Diff 124757.Nov 29 2017, 8:17 AM
  • Register name case changes due to rebase.
This revision was automatically updated to reflect the committed changes.
RKSimon reopened this revision.Dec 7 2017, 7:19 AM

rL319543 was reverted at rL319591 due to asan bot breakage

Please rebase. Thanks.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 7 2019, 5:06 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:06 AM
Herald added a subscriber: hiraditya. · View Herald Transcript