This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
2/2
ARMLoadStoreOptimizer.cpp
-
test/CodeGen/Thumb2/
-
CodeGen/
-
Thumb2/
-
ldstopt-addm.ll

Differential D59256

[ARM] Disable LDM with offset for thumb2 cortex-m cpus
Needs ReviewPublic

Authored by dmgreen on Mar 12 2019, 7:55 AM.

Download Raw Diff

Details

Reviewers

efriedma
samparker
fhahn
t.p.northover

Summary

When not optimising for codesize, the extra ADD that can be inserted for the base of an LDM will lead to an extra cycle of latency. Just using LDR's, which are usually pipelined, means fewer total cycles. On Thumb1 cpus, the loads are not pipelined and so it will still be profitable.

This started out as "just turn off the load store optimiser", but has become a little more refined since then. The test case is new, I'm just showing the diff here for clarity.

Diff Detail

Event Timeline

dmgreen created this revision.Mar 12 2019, 7:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 12 2019, 7:55 AM

Herald added subscribers: hiraditya, kristof.beyls, javed.absar. · View Herald Transcript

t.p.northover added inline comments.Mar 12 2019, 8:09 AM

llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
677	I think the check should be `optForMinSize`. Clang interprets -Os in a more performance-oriented way than GCC; something like "don't needlessly bloat code". -Oz is the real option to squash everything as much as possible.

dmgreen marked 2 inline comments as done.Mar 12 2019, 9:35 AM

dmgreen added inline comments.

llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
677	Sure. This does trade size for performance, in that you will get more LDR's, not turned into a single LDM (plus the ADD). Happy to change that though, the example in the test case ends up using an add.w, so I expect the size differences in many cases will not be very large.

Now minsize

Since we're talking about a Thumb2 core, can we form ldrd here?

It may be possibly to create ldrd's, but I don't think that they will be any quicker. An ldrd will take the same time as an ldm (1+N). It could be smaller, depending on whether T1 ldr's are used.

From a quick set of benchmarks I just ran, it looks like on average it did worse with ldrd's than ldr's. Perhaps because of less scheduling freedom?

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMLoadStoreOptimizer.cpp

4 lines

test/

CodeGen/

Thumb2/

ldstopt-addm.ll

5 lines

Diff 190282

llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp

Show First 20 Lines • Show All 667 Lines • ▼ Show 20 Lines	if (Offset == 4 && haveIBAndDA) {
if (NumRegs <= 2)		if (NumRegs <= 2)
return nullptr;		return nullptr;

// On Thumb1, it's not worth materializing a new base register without		// On Thumb1, it's not worth materializing a new base register without
// clobbering the CPSR (i.e. not using ADDS/SUBS).		// clobbering the CPSR (i.e. not using ADDS/SUBS).
if (!SafeToClobberCPSR)		if (!SafeToClobberCPSR)
return nullptr;		return nullptr;

		// On M class cores, the extra add will only increase latency
		if (STI->isMClass() && !isThumb1 && !MF->getFunction().optForMinSize())
		t.p.northoverUnsubmitted Done Reply Inline Actions I think the check should be `optForMinSize`. Clang interprets -Os in a more performance-oriented way than GCC; something like "don't needlessly bloat code". -Oz is the real option to squash everything as much as possible. t.p.northover: I think the check should be `optForMinSize`. Clang interprets -Os in a more performance…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Sure. This does trade size for performance, in that you will get more LDR's, not turned into a single LDM (plus the ADD). Happy to change that though, the example in the test case ends up using an add.w, so I expect the size differences in many cases will not be very large. dmgreen: Sure. This does trade size for performance, in that you will get more LDR's, not turned into a…
		return nullptr;

unsigned NewBase;		unsigned NewBase;
if (isi32Load(Opcode)) {		if (isi32Load(Opcode)) {
// If it is a load, then just use one of the destination registers		// If it is a load, then just use one of the destination registers
// as the new base. Will no longer be writeback in Thumb1.		// as the new base. Will no longer be writeback in Thumb1.
NewBase = Regs[NumRegs-1].first;		NewBase = Regs[NumRegs-1].first;
Writeback = false;		Writeback = false;
} else {		} else {
// Find a free register that we can use as scratch register.		// Find a free register that we can use as scratch register.
▲ Show 20 Lines • Show All 1,774 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/ldstopt-addm.ll

	Show All 16 Lines
	; CHECK-T1-NEXT: ldr r0, [r0, #60]			; CHECK-T1-NEXT: ldr r0, [r0, #60]
	; CHECK-T1-NEXT: adds r0, r3, r0			; CHECK-T1-NEXT: adds r0, r3, r0
	; CHECK-T1-NEXT: adds r1, r1, r2			; CHECK-T1-NEXT: adds r1, r1, r2
	; CHECK-T1-NEXT: adds r0, r1, r0			; CHECK-T1-NEXT: adds r0, r1, r0
	; CHECK-T1-NEXT: bx lr			; CHECK-T1-NEXT: bx lr
	;			;
	; CHECK-LABEL: test:			; CHECK-LABEL: test:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: add.w r3, r0, #48			; CHECK-NEXT: ldr r1, [r0, #48]
	; CHECK-NEXT: ldm r3, {r1, r2, r3}			; CHECK-NEXT: ldr r2, [r0, #52]
				; CHECK-NEXT: ldr r3, [r0, #56]
	; CHECK-NEXT: ldr r0, [r0, #60]			; CHECK-NEXT: ldr r0, [r0, #60]
	; CHECK-NEXT: add r0, r3			; CHECK-NEXT: add r0, r3
	; CHECK-NEXT: add r1, r2			; CHECK-NEXT: add r1, r2
	; CHECK-NEXT: add r0, r1			; CHECK-NEXT: add r0, r1
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	entry:			entry:
	%gep1 = getelementptr i32, i32* %src, i32 12			%gep1 = getelementptr i32, i32* %src, i32 12
	%gep2 = getelementptr i32, i32* %src, i32 13			%gep2 = getelementptr i32, i32* %src, i32 13
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines