Download Raw Diff

Details

Reviewers

qcolombet
bruno
nadav
rnk

Commits

rG45b22a4aff47: [X86] Enable RRL part of the LEA optimization pass for -O2.
rL270036: [X86] Enable RRL part of the LEA optimization pass for -O2.

Summary

Enable "Remove Redundant LEAs" part of the LEA optimization pass for -O2.

This gives 6.4% performance improve on Broadwell on nnet benchmark from coremark-pro. There is no significant effect on other benchmarks.

Diff Detail

Repository: rL LLVM

Event Timeline

aturetsk updated this revision to Diff 55411.Apr 28 2016, 7:11 AM

aturetsk retitled this revision from to [X86] Enable RRL part of the LEA optimization pass for -O2.

aturetsk updated this object.

aturetsk added reviewers: nadav, qcolombet, rnk.

aturetsk added subscribers: reames, zinovy.nis, llvm-commits.

aturetsk updated this object.Apr 28 2016, 7:17 AM

Hi Andrey,

What are the other benchmarks you tested this on?

Hi Bruno,
I tried Geekbench, Coremark-Pro, Spec2000 and Spec2006.

Hi Andrey,

IIUC, this patch also makes -Os run removeRedundantLEAs, which it previously didn't. Did you check what's the effect for compile time in -Os after this change?

The RRL part of the LEA pass takes a sane amount of compile time.
Here are the measurements.

-Os, the LEA pass is completely disabled:

real    0m57.797s
user    0m57.448s
sys     0m0.337s

-Os, only the RRL part of the LEA pass is enabled:

real    1m3.238s
user    1m2.868s
sys     0m0.352s

-Os, the LEA pass is fully enabled:

real    1m12.568s
user    1m12.193s
sys     0m0.354s

The test was generated by the script:

$ python gen.py 5000 > test.c
$ cat gen.py

import sys

def foo(n):
  print 'struct { int a, b, c; } arr[1000000];'
  print ''
  print 'int foo(int x) {'
  print '  int r = 0;'
  for i in range(n):
    print '  r += arr[x + %d].a + arr[x + %d].b + arr[x + %d].c;' % (i, i, i);
  print '  switch (r) {'
  print '  case 1:'
  for i in range(n):
    print '    arr[x + %d].b = 111;' % (i);
    print '    arr[x + %d].c = 111;' % (i);
  print '    break;'
  print '  case 2:'
  for i in range(n):
    print '    arr[x + %d].b = 222;' % (i);
    print '    arr[x + %d].c = 222;' % (i);
  print '    break;'
  print '  default:'
  for i in range(n):
    # Make the LEAs irreplaceable, so that no LEAs would be removed by the LEA
    # pass and thus there would be no compile-time improvement because of the
    # reduced number of instructions which need to be processed by the
    # compiler in other passes
    print '    arr[x + %d].b = (int) &arr[x + %d].b;' % (i, i);
    print '    arr[x + %d].c = (int) &arr[x + %d].c;' % (i, i);
  print '    break;'
  print '  }'
  print '  return r;'
  print '}'

if __name__ == '__main__':
  foo(int(sys.argv[1]))

The run command:

time ./bin/clang -Os -S test.c

Note that the generated test is really LEA-specific, the majority of machine instructions gets modified by the pass. That's why the LEA pass takes ~25% of total compile time in this test.

Hi Andrey,

What change for the algorithm such that now we think it is also beneficial for performances whereas it was not previously?

Is it “just” because we did not benchmark it so far?

Cheers,
-Quentin

Hi Quentin,

Yes. When I was implementing the pass my primary target was code size, so I didn't check performance impact extensively (I checked only that there was no significant degradation on a couple of benchmarks). The obvious concern was that the RRL part of the pass may increase register pressure and that would hurt performance. The code size impact was really small so it was easier and safer to just enable it only for -Oz.

qcolombet added inline comments.May 16 2016, 6:39 PM

test/CodeGen/X86/lea-opt.ll
1 ↗	(On Diff #55411)	Could you check both with and without the optimization running?
112 ↗	(On Diff #55411)	I don’t get what is the problem with that specific part of the test. If we want to test something else, writing a new test should be fine instead of fixing that one, shouldn’t it?

Add a test for the case when the LEA pass is disabled.

Add a test for the case when the LEA pass is disabled. (Take #2).

aturetsk added inline comments.May 17 2016, 2:05 AM

test/CodeGen/X86/lea-opt.ll
1–2 ↗	(On Diff #57445)	Done.
137 ↗	(On Diff #57445)	There are two LEA instructions in this test. Before this patch, RRL part of the pass was disabled for 'optsize', so both LEAs were preserved and RemoveRedundantAddressCalculation part of the pass had to choose a LEA for substitution in load instructions. And it should choose not the closest one, but the one which would make a resulting displacement fit 1 byte (that's exactly what the test checks). After this patch, RRL part of the pass is enabled for 'optsize'. And without the changes in the test I made it would remove one of the LEAs and thus RRAC part of the pass wouldn't need to choose a LEA for substitution (because only one remained). So the test would become useless.

qcolombet added inline comments.May 17 2016, 9:36 AM

test/CodeGen/X86/lea-opt.ll
1–2 ↗	(On Diff #57445)	Could you factor out the common pattern under a CHECK file and move the rest in two different patterns ENABLED/DISABLED. That way it would be easier to see the effects of the pass. (I also suspect that DISABLED will be a superset of CHECK and ENABLED will be empty.) (Note: you can use several check prefix on the command line with additional —check-prefix options.)
137 ↗	(On Diff #57445)	Thanks for the clarification.

Improve the test.

test/CodeGen/X86/lea-opt.ll
1–3 ↗	(On Diff #57592)	Done.

qcolombet accepted this revision.May 18 2016, 11:07 AM

qcolombet edited edge metadata.

This revision is now accepted and ready to land.May 18 2016, 11:07 AM

Closed by commit rL270036: [X86] Enable RRL part of the LEA optimization pass for -O2. (authored by aturetsk). · Explain WhyMay 19 2016, 3:24 AM

This revision was automatically updated to reflect the committed changes.

Diff 57759

llvm/trunk/lib/Target/X86/X86OptimizeLEAs.cpp

//===-- X86OptimizeLEAs.cpp - optimize usage of LEA instructions ----------===//		//===-- X86OptimizeLEAs.cpp - optimize usage of LEA instructions ----------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file defines the pass that performs some optimizations with LEA		// This file defines the pass that performs some optimizations with LEA
// instructions in order to improve code size.		// instructions in order to improve performance and code size.
// Currently, it does two things:		// Currently, it does two things:
// 1) If there are two LEA instructions calculating addresses which only differ		// 1) If there are two LEA instructions calculating addresses which only differ
// by displacement inside a basic block, one of them is removed.		// by displacement inside a basic block, one of them is removed.
// 2) Address calculations in load and store instructions are replaced by		// 2) Address calculations in load and store instructions are replaced by
// existing LEA def registers where possible.		// existing LEA def registers where possible.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 589 Lines • ▼ Show 20 Lines	bool OptimizeLEAPass::removeRedundantLEAs(MemOpMap &LEAs) {
}		}

return Changed;		return Changed;
}		}

bool OptimizeLEAPass::runOnMachineFunction(MachineFunction &MF) {		bool OptimizeLEAPass::runOnMachineFunction(MachineFunction &MF) {
bool Changed = false;		bool Changed = false;

// Perform this optimization only if we care about code size.		if (DisableX86LEAOpt \|\| skipFunction(*MF.getFunction()))
if (DisableX86LEAOpt \|\| skipFunction(*MF.getFunction()) \|\|
!MF.getFunction()->optForSize())
return false;		return false;

MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
TII = MF.getSubtarget<X86Subtarget>().getInstrInfo();		TII = MF.getSubtarget<X86Subtarget>().getInstrInfo();
TRI = MF.getSubtarget<X86Subtarget>().getRegisterInfo();		TRI = MF.getSubtarget<X86Subtarget>().getRegisterInfo();

// Process all basic blocks.		// Process all basic blocks.
for (auto &MBB : MF) {		for (auto &MBB : MF) {
MemOpMap LEAs;		MemOpMap LEAs;
InstrPos.clear();		InstrPos.clear();

// Find all LEA instructions in basic block.		// Find all LEA instructions in basic block.
findLEAs(MBB, LEAs);		findLEAs(MBB, LEAs);

// If current basic block has no LEAs, move on to the next one.		// If current basic block has no LEAs, move on to the next one.
if (LEAs.empty())		if (LEAs.empty())
continue;		continue;

// Remove redundant LEA instructions. The optimization may have a negative		// Remove redundant LEA instructions.
// effect on performance, so do it only for -Oz.
if (MF.getFunction()->optForMinSize())
Changed \|= removeRedundantLEAs(LEAs);		Changed \|= removeRedundantLEAs(LEAs);

// Remove redundant address calculations.		// Remove redundant address calculations. Do it only for -Os/-Oz since only
		// a code size gain is expected from this part of the pass.
		if (MF.getFunction()->optForSize())
Changed \|= removeRedundantAddrCalc(LEAs);		Changed \|= removeRedundantAddrCalc(LEAs);
}		}

return Changed;		return Changed;
}		}

llvm/trunk/test/CodeGen/X86/lea-opt.ll

; RUN: llc < %s -mtriple=x86_64-linux \| FileCheck %s		; RUN: llc < %s -mtriple=x86_64-linux \| FileCheck %s -check-prefix=CHECK -check-prefix=ENABLED
		; RUN: llc --disable-x86-lea-opt < %s -mtriple=x86_64-linux \| FileCheck %s -check-prefix=CHECK -check-prefix=DISABLED

%struct.anon1 = type { i32, i32, i32 }		%struct.anon1 = type { i32, i32, i32 }
%struct.anon2 = type { i32, [32 x i32], i32 }		%struct.anon2 = type { i32, [32 x i32], i32 }

@arr1 = external global [65 x %struct.anon1], align 16		@arr1 = external global [65 x %struct.anon1], align 16
@arr2 = external global [65 x %struct.anon2], align 16		@arr2 = external global [65 x %struct.anon2], align 16

define void @test1(i64 %x) nounwind {		define void @test1(i64 %x) nounwind {
Show All 23 Lines

sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry		sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry
ret void		ret void
; CHECK-LABEL: test1:		; CHECK-LABEL: test1:
; CHECK: shlq $2, [[REG1:%[a-z]+]]		; CHECK: shlq $2, [[REG1:%[a-z]+]]
; CHECK: movl arr1([[REG1]],[[REG1]],2), {{.*}}		; CHECK: movl arr1([[REG1]],[[REG1]],2), {{.*}}
; CHECK: leaq arr1+4([[REG1]],[[REG1]],2), [[REG2:%[a-z]+]]		; CHECK: leaq arr1+4([[REG1]],[[REG1]],2), [[REG2:%[a-z]+]]
; CHECK: subl arr1+4([[REG1]],[[REG1]],2), {{.*}}		; CHECK: subl arr1+4([[REG1]],[[REG1]],2), {{.*}}
; CHECK: leaq arr1+8([[REG1]],[[REG1]],2), [[REG3:%[a-z]+]]		; DISABLED: leaq arr1+8([[REG1]],[[REG1]],2), [[REG3:%[a-z]+]]
; CHECK: addl arr1+8([[REG1]],[[REG1]],2), {{.*}}		; CHECK: addl arr1+8([[REG1]],[[REG1]],2), {{.*}}
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, ([[REG3]])		; ENABLED: movl ${{[1-4]+}}, 4([[REG2]])
		; DISABLED: movl ${{[1-4]+}}, ([[REG3]])
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, ([[REG3]])		; ENABLED: movl ${{[1-4]+}}, 4([[REG2]])
		; DISABLED: movl ${{[1-4]+}}, ([[REG3]])
}		}

define void @test2(i64 %x) nounwind optsize {		define void @test2(i64 %x) nounwind optsize {
entry:		entry:
%a = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 0		%a = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 0
%tmp = load i32, i32* %a, align 4		%tmp = load i32, i32* %a, align 4
%b = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 1		%b = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 1
%tmp1 = load i32, i32* %b, align 4		%tmp1 = load i32, i32* %b, align 4
Show All 15 Lines	sw.bb.2: ; preds = %entry
store i32 333, i32* %b, align 4		store i32 333, i32* %b, align 4
store i32 444, i32* %c, align 4		store i32 444, i32* %c, align 4
br label %sw.epilog		br label %sw.epilog

sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry		sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry
ret void		ret void
; CHECK-LABEL: test2:		; CHECK-LABEL: test2:
; CHECK: shlq $2, [[REG1:%[a-z]+]]		; CHECK: shlq $2, [[REG1:%[a-z]+]]
		; DISABLED: movl arr1([[REG1]],[[REG1]],2), {{.*}}
; CHECK: leaq arr1+4([[REG1]],[[REG1]],2), [[REG2:%[a-z]+]]		; CHECK: leaq arr1+4([[REG1]],[[REG1]],2), [[REG2:%[a-z]+]]
; CHECK: movl -4([[REG2]]), {{.*}}		; ENABLED: movl -4([[REG2]]), {{.*}}
; CHECK: subl ([[REG2]]), {{.*}}		; ENABLED: subl ([[REG2]]), {{.*}}
; CHECK: leaq arr1+8([[REG1]],[[REG1]],2), [[REG3:%[a-z]+]]		; ENABLED: addl 4([[REG2]]), {{.*}}
; CHECK: addl ([[REG3]]), {{.*}}		; DISABLED: subl arr1+4([[REG1]],[[REG1]],2), {{.*}}
		; DISABLED: leaq arr1+8([[REG1]],[[REG1]],2), [[REG3:%[a-z]+]]
		; DISABLED: addl arr1+8([[REG1]],[[REG1]],2), {{.*}}
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, ([[REG3]])		; ENABLED: movl ${{[1-4]+}}, 4([[REG2]])
		; DISABLED: movl ${{[1-4]+}}, ([[REG3]])
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, ([[REG3]])		; ENABLED: movl ${{[1-4]+}}, 4([[REG2]])
		; DISABLED: movl ${{[1-4]+}}, ([[REG3]])
}		}

; Check that LEA optimization pass takes into account a resultant address		; Check that LEA optimization pass takes into account a resultant address
; displacement when choosing a LEA instruction for replacing a redundant		; displacement when choosing a LEA instruction for replacing a redundant
; address recalculation.		; address recalculation.

define void @test3(i64 %x) nounwind optsize {		define void @test3(i64 %x) nounwind optsize {
entry:		entry:
Show All 9 Lines

sw.bb.1: ; preds = %entry		sw.bb.1: ; preds = %entry
store i32 111, i32* %a, align 4		store i32 111, i32* %a, align 4
store i32 222, i32* %b, align 4		store i32 222, i32* %b, align 4
br label %sw.epilog		br label %sw.epilog

sw.bb.2: ; preds = %entry		sw.bb.2: ; preds = %entry
store i32 333, i32* %a, align 4		store i32 333, i32* %a, align 4
store i32 444, i32* %b, align 4		; Make sure the REG3's definition LEA won't be removed as redundant.
		%cvt = ptrtoint i32* %b to i32
		store i32 %cvt, i32* %b, align 4
br label %sw.epilog		br label %sw.epilog

sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry		sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry
ret void		ret void
; CHECK-LABEL: test3:		; CHECK-LABEL: test3:
; CHECK: imulq {{.*}}, [[REG1:%[a-z]+]]		; CHECK: imulq {{.*}}, [[REG1:%[a-z]+]]
; CHECK: leaq arr2+132([[REG1]]), [[REG2:%[a-z]+]]		; CHECK: leaq arr2+132([[REG1]]), [[REG2:%[a-z]+]]
; CHECK: leaq arr2([[REG1]]), [[REG3:%[a-z]+]]		; CHECK: leaq arr2([[REG1]]), [[REG3:%[a-z]+]]

; REG3's definition is closer to movl than REG2's, but the pass still chooses		; REG3's definition is closer to movl than REG2's, but the pass still chooses
; REG2 because it provides the resultant address displacement fitting 1 byte.		; REG2 because it provides the resultant address displacement fitting 1 byte.

; CHECK: movl ([[REG2]]), {{.*}}		; ENABLED: movl ([[REG2]]), {{.*}}
; CHECK: addl ([[REG3]]), {{.*}}		; ENABLED: addl ([[REG3]]), {{.*}}
		; DISABLED: movl arr2+132([[REG1]]), {{.*}}
		; DISABLED: addl arr2([[REG1]]), {{.*}}
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, ([[REG3]])		; CHECK: movl ${{[1-4]+}}, ([[REG3]])
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, ([[REG3]])		; CHECK: movl {{.*}}, ([[REG3]])
}		}

define void @test4(i64 %x) nounwind minsize {		define void @test4(i64 %x) nounwind minsize {
entry:		entry:
%a = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 0		%a = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 0
%tmp = load i32, i32* %a, align 4		%tmp = load i32, i32* %a, align 4
%b = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 1		%b = getelementptr inbounds [65 x %struct.anon1], [65 x %struct.anon1]* @arr1, i64 0, i64 %x, i32 1
%tmp1 = load i32, i32* %b, align 4		%tmp1 = load i32, i32* %b, align 4
Show All 14 Lines
sw.bb.2: ; preds = %entry		sw.bb.2: ; preds = %entry
store i32 333, i32* %b, align 4		store i32 333, i32* %b, align 4
store i32 444, i32* %c, align 4		store i32 444, i32* %c, align 4
br label %sw.epilog		br label %sw.epilog

sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry		sw.epilog: ; preds = %sw.bb.2, %sw.bb.1, %entry
ret void		ret void
; CHECK-LABEL: test4:		; CHECK-LABEL: test4:
; CHECK: leaq arr1+4({{.*}}), [[REG2:%[a-z]+]]		; CHECK: imulq {{.*}}, [[REG1:%[a-z]+]]
; CHECK: movl -4([[REG2]]), {{.*}}		; DISABLED: movl arr1([[REG1]]), {{.*}}
; CHECK: subl ([[REG2]]), {{.*}}		; CHECK: leaq arr1+4([[REG1]]), [[REG2:%[a-z]+]]
; CHECK: addl 4([[REG2]]), {{.*}}		; ENABLED: movl -4([[REG2]]), {{.*}}
		; ENABLED: subl ([[REG2]]), {{.*}}
		; ENABLED: addl 4([[REG2]]), {{.*}}
		; DISABLED: subl arr1+4([[REG1]]), {{.*}}
		; DISABLED: leaq arr1+8([[REG1]]), [[REG3:%[a-z]+]]
		; DISABLED: addl arr1+8([[REG1]]), {{.*}}
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, 4([[REG2]])		; ENABLED: movl ${{[1-4]+}}, 4([[REG2]])
		; DISABLED: movl ${{[1-4]+}}, ([[REG3]])
; CHECK: movl ${{[1-4]+}}, ([[REG2]])		; CHECK: movl ${{[1-4]+}}, ([[REG2]])
; CHECK: movl ${{[1-4]+}}, 4([[REG2]])		; ENABLED: movl ${{[1-4]+}}, 4([[REG2]])
		; DISABLED: movl ${{[1-4]+}}, ([[REG3]])
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable RRL part of the LEA optimization pass for -O2
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 57759

llvm/trunk/lib/Target/X86/X86OptimizeLEAs.cpp

llvm/trunk/test/CodeGen/X86/lea-opt.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable RRL part of the LEA optimization pass for -O2ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 57759

llvm/trunk/lib/Target/X86/X86OptimizeLEAs.cpp

llvm/trunk/test/CodeGen/X86/lea-opt.ll

[X86] Enable RRL part of the LEA optimization pass for -O2
ClosedPublic