Download Raw Diff

Details

Reviewers

samparker
SjoerdMeijer
dmgreen
olista01
efriedma

Commits

rGdc8e4d856615: [ARM] Rearrange SizeReduction when using -Oz

Summary

Move the Thumb2SizeReduce pass to before IfConversion when optimising
for minimal code size.

Running the Thumb2SizeReduction pass before IfConversion allows T1 instructions
to propagate to the final output, rather than the ifConverter modifying T2
instructions and preventing them from being reduced later.

This change does introduce a regression regarding execution time, so it's only
applied when optimising for size.

Running the LLVM Test Suite with this change produces a geomean
difference of -0.1% for the size..text metric.

LLVM Test suite results:
Tests: 310
Metric: size..text

Program	Initial	Modified	diff
test-suite.../Applications/sgefa/sgefa.test	6152	6096	-0.9%
test-suite...ks/Prolangs-C/agrep/agrep.test	24100	23884	-0.9%
test-suite.../Benchmarks/Stanford/Perm.test	524	520	-0.8%
test-suite...rks/FreeBench/mason/mason.test	1232	1224	-0.6%
test-suite.../Trimaran/enc-rc4/enc-rc4.test	656	652	-0.6%
test-suite...itBench/uudecode/uudecode.test	664	660	-0.6%
test-suite...Benchmarks/Stanford/Oscar.test	1344	1336	-0.6%
test-suite...ks/Shootout/Shootout-hash.test	1360	1352	-0.6%
test-suite...nch/fourinarow/fourinarow.test	4336	4312	-0.6%
test-suite...ch/g721/g721encode/encode.test	3816	3796	-0.5%
test-suite...itBench/uuencode/uuencode.test	796	792	-0.5%
test-suite.../Benchmarks/Dhrystone/dry.test	820	816	-0.5%
test-suite...count/automotive-bitcount.test	1824	1816	-0.4%
test-suite...marks/Olden/bisort/bisort.test	1832	1824	-0.4%
test-suite...CI_Purple/SMG2000/smg2000.test	89904	89512	-0.4%

Geomean difference -0.1%

	Initial	Modified	diff
count	310.000000	310.000000	310.000000
mean	19649.870968	19629.354839	-0.000601
std	48795.053450	48742.740588	0.001539
min	292.000000	292.000000	-0.009103
25%	1090.000000	1090.000000	-0.000648
50%	2770.000000	2772.000000	0.000000
75%	10185.000000	10185.000000	0.000000
max	324040.000000	323540.000000	0.003704

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

NickGuy created this revision.Jun 24 2020, 12:44 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald TranscriptJun 24 2020, 12:44 AM

NickGuy edited the summary of this revision. (Show Details)Jun 24 2020, 1:08 AM

Sounds interesting. I've wanted to do the same thing for different reasons in the past too.

Can you update with full context? -U999999

Reworded/improved summary, and included full patch context

SjoerdMeijer added a reviewer: efriedma.Jun 24 2020, 5:57 AM

Harbormaster completed remote builds in B61536: Diff 273002.Jun 24 2020, 6:27 AM

This sounds good to me. And my results agree with your results that this is a nice improvement.

Ideally I think it would be better to do this all the time, not just for minsize. But my understanding from trying this a long time ago was that would mess up a lot of the schedulers we have that would take quite a bit of work to fix. It might be worth fixing them in the long run (with the way cortex-m cores can dual issue), but for codesize alone that shouldn't block you.

llvm/lib/Target/ARM/ARMTargetMachine.cpp
525–526	Can you combine this comment with the "in v8" one above.
llvm/test/CodeGen/ARM/t2-shrink-ldrpost.ll
32	Can you leave in a comment that says this function shouldn't produce a ldm. And maybe one above that says that we should produce a ldm. It sounds useful to keep that around for future reference.

But my understanding from trying this a long time ago was that would mess up a lot of the schedulers we have that would take quite a bit of work to fix.

You mean, the scheduling models don't handle Thumb1 instructions well? Or there an issue with the way the actual CPUs handle Thumb1 instructions?

You mean, the scheduling models don't handle Thumb1 instructions well?

Yes. Only certain pairs of 16-bit instructions can be dual issued, but we don't know if instructions are Thumb1 or not. So there's room for improvement here.

In D82439#2112588, @efriedma wrote:

You mean, the scheduling models don't handle Thumb1 instructions well? Or there an issue with the way the actual CPUs handle Thumb1 instructions?

The models do not handle thumb1 instructions well because they have not come up when scheduling in the past. You can (I presume) always treat them just like the equivalent thumb2 instruction and get similar results, but it would take time to get right.

The cores can often dual issue certain combinations of thumb1 instructions, so properly scheduling them would be useful. We currently, especially pre-ra, have to try and guess at what might become a thumb1 instruction. For older cores this wasn't super interesting due to the exact instructions that could be dual issued but newer cores are always getting better. It is on my list to potentially do something about this, if I can find enough cases of it going wrong to make it look promising, seeing as it's only post-ra that we can easily fix.

Addressing inline comments

Harbormaster failed remote builds in B61915: Diff 273686!Jun 26 2020, 7:37 AM

LGTM

llvm/lib/Target/ARM/ARMTargetMachine.cpp
521–523	Feel free to make this into more of a sentence too, with Capitalization and a full stop.

This revision is now accepted and ready to land.Jun 29 2020, 1:22 AM

Updates some comments. NFC when compared to prior diffs

NickGuy marked an inline comment as done.Jul 1 2020, 2:05 AM

Harbormaster failed remote builds in B62465: Diff 274722!Jul 1 2020, 3:12 AM

Closed by commit rGdc8e4d856615: [ARM] Rearrange SizeReduction when using -Oz (authored by NickGuy). · Explain WhyJul 2 2020, 1:34 AM

This revision was automatically updated to reflect the committed changes.

NickGuy mentioned this in D83667: [ARM] Fix IT block generation after Thumb2SizeReduce with -Oz.Jul 13 2020, 3:36 AM

NickGuy mentioned this in rG18279a54b5d3: [ARM] Fix IT block generation after Thumb2SizeReduce with -Oz.Aug 3 2020, 5:20 AM

NickGuy mentioned this in D88496: [ARM] Fix IT block generation after Thumb2SizeReduce with -Oz.Sep 29 2020, 8:23 AM

NickGuy mentioned this in rGeb9fe24eaf2d: [ARM] Fix IT block generation after Thumb2SizeReduce with -Oz.Oct 29 2020, 8:17 AM

Diff 275019

llvm/lib/Target/ARM/ARMTargetMachine.cpp

Show First 20 Lines • Show All 512 Lines • ▼ Show 20 Lines	if (getOptLevel() != CodeGenOpt::None) {
addPass(createBreakFalseDeps());		addPass(createBreakFalseDeps());
}		}

// Expand some pseudo instructions into multiple instructions to allow		// Expand some pseudo instructions into multiple instructions to allow
// proper scheduling.		// proper scheduling.
addPass(createARMExpandPseudoPass());		addPass(createARMExpandPseudoPass());

if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
// in v8, IfConversion depends on Thumb instruction widths		// When optimising for size, always run the Thumb2SizeReduction pass before
		// IfConversion. Otherwise, check whether IT blocks are restricted
		// (e.g. in v8, IfConversion depends on Thumb instruction widths)
		dmgreenUnsubmitted Done Reply Inline Actions Feel free to make this into more of a sentence too, with Capitalization and a full stop. dmgreen: Feel free to make this into more of a sentence too, with Capitalization and a full stop.
addPass(createThumb2SizeReductionPass([this](const Function &F) {		addPass(createThumb2SizeReductionPass([this](const Function &F) {
return this->TM->getSubtarget<ARMSubtarget>(F).restrictIT();		return this->TM->getSubtarget<ARMSubtarget>(F).hasMinSize() \|\|
		this->TM->getSubtarget<ARMSubtarget>(F).restrictIT();
		dmgreenUnsubmitted Done Reply Inline Actions Can you combine this comment with the "in v8" one above. dmgreen: Can you combine this comment with the "in v8" one above.
}));		}));

addPass(createIfConverter([](const MachineFunction &MF) {		addPass(createIfConverter([](const MachineFunction &MF) {
return !MF.getSubtarget<ARMSubtarget>().isThumb1Only();		return !MF.getSubtarget<ARMSubtarget>().isThumb1Only();
}));		}));
}		}
addPass(createMVEVPTBlockPass());		addPass(createMVEVPTBlockPass());
addPass(createThumb2ITBlockPass());		addPass(createThumb2ITBlockPass());
Show All 30 Lines

llvm/test/CodeGen/ARM/t2-shrink-ldrpost.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s \| FileCheck %s			; RUN: llc < %s \| FileCheck %s

	target datalayout = "e-m:e-p:32:32-i1:8:32-i8:8:32-i16:16:32-f64:32:64-v64:32:64-v128:32:128-a:0:32-n32-S32"			target datalayout = "e-m:e-p:32:32-i1:8:32-i8:8:32-i16:16:32-f64:32:64-v64:32:64-v128:32:128-a:0:32-n32-S32"
	target triple = "thumbv7m--linux-gnu"			target triple = "thumbv7m--linux-gnu"

	; CHECK-LABEL: f:			; NOTE: When optimising for minimum size, an LDM is expected to be generated
	; CHECK: ldm r{{[0-9]}}!, {r[[x:[0-9]]]}
	; CHECK: add.w r[[x]], r[[x]], #3
	; CHECK: stm r{{[0-9]}}!, {r[[x]]}
	define void @f(i32 %n, i32* nocapture %a, i32* nocapture readonly %b) optsize minsize {			define void @f(i32 %n, i32* nocapture %a, i32* nocapture readonly %b) optsize minsize {
				; CHECK-LABEL: f:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: cmp r0, #1
				; CHECK-NEXT: blt .LBB0_2
				; CHECK-NEXT: .LBB0_1: @ %.lr.ph
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: ldm r2!, {r3}
				; CHECK-NEXT: adds r3, #3
				; CHECK-NEXT: stm r1!, {r3}
				; CHECK-NEXT: subs r0, #1
				; CHECK-NEXT: bne .LBB0_1
				; CHECK-NEXT: .LBB0_2: @ %._crit_edge
				; CHECK-NEXT: bx lr
	%1 = icmp sgt i32 %n, 0			%1 = icmp sgt i32 %n, 0
	br i1 %1, label %.lr.ph, label %._crit_edge			br i1 %1, label %.lr.ph, label %._crit_edge

	.lr.ph: ; preds = %.lr.ph, %0			.lr.ph: ; preds = %.lr.ph, %0
	%i.04 = phi i32 [ %6, %.lr.ph ], [ 0, %0 ]			%i.04 = phi i32 [ %6, %.lr.ph ], [ 0, %0 ]
	%.03 = phi i32* [ %2, %.lr.ph ], [ %b, %0 ]			%.03 = phi i32* [ %2, %.lr.ph ], [ %b, %0 ]
	%.012 = phi i32* [ %5, %.lr.ph ], [ %a, %0 ]			%.012 = phi i32* [ %5, %.lr.ph ], [ %a, %0 ]
	%2 = getelementptr inbounds i32, i32* %.03, i32 1			%2 = getelementptr inbounds i32, i32* %.03, i32 1
	%3 = load i32, i32* %.03, align 4			%3 = load i32, i32* %.03, align 4
	%4 = add nsw i32 %3, 3			%4 = add nsw i32 %3, 3
	%5 = getelementptr inbounds i32, i32* %.012, i32 1			%5 = getelementptr inbounds i32, i32* %.012, i32 1
	store i32 %4, i32* %.012, align 4			store i32 %4, i32* %.012, align 4
	%6 = add nsw i32 %i.04, 1			%6 = add nsw i32 %i.04, 1
	%exitcond = icmp eq i32 %6, %n			%exitcond = icmp eq i32 %6, %n
	br i1 %exitcond, label %._crit_edge, label %.lr.ph			br i1 %exitcond, label %._crit_edge, label %.lr.ph

	._crit_edge: ; preds = %.lr.ph, %0			._crit_edge: ; preds = %.lr.ph, %0
	ret void			ret void
	}			}

	; CHECK-LABEL: f_nominsize:			; NOTE: When not optimising for minimum size, an LDM is expected not to be generated
	; CHECK-NOT: ldm
	dmgreenUnsubmitted Not Done Reply Inline Actions Can you leave in a comment that says this function shouldn't produce a ldm. And maybe one above that says that we should produce a ldm. It sounds useful to keep that around for future reference. dmgreen: Can you leave in a comment that says this function shouldn't produce a ldm. And maybe one above…
	define void @f_nominsize(i32 %n, i32* nocapture %a, i32* nocapture readonly %b) optsize {			define void @f_nominsize(i32 %n, i32* nocapture %a, i32* nocapture readonly %b) optsize {
				; CHECK-LABEL: f_nominsize:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: cmp r0, #1
				; CHECK-NEXT: it lt
				; CHECK-NEXT: bxlt lr
				; CHECK-NEXT: .LBB1_1: @ %.lr.ph
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: ldr r3, [r2], #4
				; CHECK-NEXT: subs r0, #1
				; CHECK-NEXT: add.w r3, r3, #3
				; CHECK-NEXT: str r3, [r1], #4
				; CHECK-NEXT: bne .LBB1_1
				; CHECK-NEXT: @ %bb.2: @ %._crit_edge
				; CHECK-NEXT: bx lr
	%1 = icmp sgt i32 %n, 0			%1 = icmp sgt i32 %n, 0
	br i1 %1, label %.lr.ph, label %._crit_edge			br i1 %1, label %.lr.ph, label %._crit_edge

	.lr.ph: ; preds = %.lr.ph, %0			.lr.ph: ; preds = %.lr.ph, %0
	%i.04 = phi i32 [ %6, %.lr.ph ], [ 0, %0 ]			%i.04 = phi i32 [ %6, %.lr.ph ], [ 0, %0 ]
	%.03 = phi i32* [ %2, %.lr.ph ], [ %b, %0 ]			%.03 = phi i32* [ %2, %.lr.ph ], [ %b, %0 ]
	%.012 = phi i32* [ %5, %.lr.ph ], [ %a, %0 ]			%.012 = phi i32* [ %5, %.lr.ph ], [ %a, %0 ]
	%2 = getelementptr inbounds i32, i32* %.03, i32 1			%2 = getelementptr inbounds i32, i32* %.03, i32 1
	Show All 11 Lines

llvm/test/CodeGen/Thumb2/constant-hoisting.ll

	Show All 31 Lines
	; CHECK-V6M-NEXT: .p2align 2			; CHECK-V6M-NEXT: .p2align 2
	; CHECK-V6M-NEXT: .LCPI0_0:			; CHECK-V6M-NEXT: .LCPI0_0:
	; CHECK-V6M-NEXT: .long 537923600			; CHECK-V6M-NEXT: .long 537923600
	;			;
	; CHECK-V7M-LABEL: test_values:			; CHECK-V7M-LABEL: test_values:
	; CHECK-V7M: mov r2, r0			; CHECK-V7M: mov r2, r0
	; CHECK-V7M-NEXT: ldr r0, .LCPI0_0			; CHECK-V7M-NEXT: ldr r0, .LCPI0_0
	; CHECK-V7M-NEXT: cmp r2, #50			; CHECK-V7M-NEXT: cmp r2, #50
	; CHECK-V7M-NEXT: beq .LBB0_3			; CHECK-V7M-NEXT: beq .LBB0_5
	; CHECK-V7M-NEXT: cmp r2, #1			; CHECK-V7M-NEXT: cmp r2, #1
	; CHECK-V7M-NEXT: ittt eq			; CHECK-V7M-NEXT: beq .LBB0_7
	; CHECK-V7M-NEXT: addeq r0, r1
	; CHECK-V7M-NEXT: addeq r0, #1
	; CHECK-V7M-NEXT: bxeq lr
	; CHECK-V7M-NEXT: cmp r2, #30			; CHECK-V7M-NEXT: cmp r2, #30
	; CHECK-V7M-NEXT: ittt eq			; CHECK-V7M-NEXT: beq .LBB0_8
	; CHECK-V7M-NEXT: addeq r0, r1			; CHECK-V7M-NEXT: cbnz r2, .LBB0_6
	; CHECK-V7M-NEXT: addeq r0, #2
	; CHECK-V7M-NEXT: bxeq lr
	; CHECK-V7M-NEXT: cbnz r2, .LBB0_4
	; CHECK-V7M-NEXT: .LBB0_2:
	; CHECK-V7M-NEXT: add r0, r1			; CHECK-V7M-NEXT: add r0, r1
	; CHECK-V7M-NEXT: bx lr			; CHECK-V7M-NEXT: bx lr
	; CHECK-V7M-NEXT: .LBB0_3:			; CHECK-V7M-NEXT: .LBB0_5:
	; CHECK-V7M-NEXT: add r0, r1			; CHECK-V7M-NEXT: add r0, r1
	; CHECK-V7M-NEXT: adds r0, #4			; CHECK-V7M-NEXT: adds r0, #4
	; CHECK-V7M-NEXT: .LBB0_4:			; CHECK-V7M-NEXT: .LBB0_6:
				; CHECK-V7M-NEXT: bx lr
				; CHECK-V7M-NEXT: .LBB0_7:
				; CHECK-V7M-NEXT: add r0, r1
				; CHECK-V7M-NEXT: adds r0, #1
				; CHECK-V7M-NEXT: bx lr
				; CHECK-V7M-NEXT: .LBB0_8:
				; CHECK-V7M-NEXT: add r0, r1
				; CHECK-V7M-NEXT: adds r0, #2
	; CHECK-V7M-NEXT: bx lr			; CHECK-V7M-NEXT: bx lr
	; CHECK-V7M-NEXT: .p2align 2			; CHECK-V7M-NEXT: .p2align 2
	; CHECK-V7M-NEXT: .LCPI0_0:			; CHECK-V7M-NEXT: .LCPI0_0:
	; CHECK-V7M-NEXT: .long 537923600			; CHECK-V7M-NEXT: .long 537923600
	entry:			entry:
	switch i32 %a, label %return [			switch i32 %a, label %return [
	i32 0, label %sw.bb			i32 0, label %sw.bb
	i32 1, label %sw.bb1			i32 1, label %sw.bb1
	▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Rearrange SizeReduction when using -Oz
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 275019

llvm/lib/Target/ARM/ARMTargetMachine.cpp

llvm/test/CodeGen/ARM/t2-shrink-ldrpost.ll

llvm/test/CodeGen/Thumb2/constant-hoisting.ll

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Rearrange SizeReduction when using -OzClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 275019

llvm/lib/Target/ARM/ARMTargetMachine.cpp

llvm/test/CodeGen/ARM/t2-shrink-ldrpost.ll

llvm/test/CodeGen/Thumb2/constant-hoisting.ll

[ARM] Rearrange SizeReduction when using -Oz
ClosedPublic