This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Check correct instructions for load/store rescheduling.
ClosedPublic

Authored by efriedma on Feb 24 2017, 6:26 PM.

Download Raw Diff

Details

Reviewers

rengolin
t.p.northover
jmolloy
MatzeB

Commits

rG28c2c0e311b3: [ARM] Check correct instructions for load/store rescheduling.
rL296701: [ARM] Check correct instructions for load/store rescheduling.

Summary

See also https://reviews.llvm.org/D30124. I don't think this can actually cause a miscompile, but it's not obvious.

This code starts from the high end of the sorted vector of offsets, and works backwards: it tries to find contiguous offsets, process them, then pops them from the end of the vector. Most of the code agrees with this order of processing, but one loop doesn't: it instead processes elements from the low end of the vector (which are nodes with unrelated offsets). Fix that loop to process the correct elements.

This has a few implications. One, we don't incorrectly return early when processing multiple groups of offsets in the same block (which allows rescheduling prera-ldst-insertpt.mir). Two, we pick the correct insert point for loads, so they're correctly sorted (which affects the scheduling of vldm-liveness.ll). I think it might also impact some of the heuristics slightly.

Diff Detail

Repository: rL LLVM

Event Timeline

efriedma created this revision.Feb 24 2017, 6:26 PM

Herald added a subscriber: aemerson. · View Herald TranscriptFeb 24 2017, 6:26 PM

This change doesn't look obvious to me. Can you add a short description of what's happening here and why this is the fix?

efriedma edited the summary of this revision. (Show Details)Feb 28 2017, 11:25 AM

Sorry; added a better description to the summary. Quoting for the mailing list:

See also https://reviews.llvm.org/D30124. I don't think this can actually cause a miscompile, but it's not obvious.

This code starts from the high end of the sorted vector of offsets, and works backwards: it tries to find contiguous offsets, process them, then pops them from the end of the vector. Most of the code agrees with this order of processing, but one loop doesn't: it instead processes elements from the low end of the vector (which are nodes with unrelated offsets). Fix that loop to process the correct elements.

This has a few implications. One, we don't incorrectly return early when processing multiple groups of offsets in the same block (which allows rescheduling prera-ldst-insertpt.mir). Two, we pick the correct insert point for loads, so they're correctly sorted (which affects the scheduling of vldm-liveness.ll). I think it might also impact some of the heuristics slightly.

Thanks Eli, that makes more sense. A single nit, if you could move the check lines right before/after each test, so that we can easily compare them. Otherwise, LGTM. Thanks!

This revision is now accepted and ready to land.Feb 28 2017, 12:59 PM

Closed by commit rL296701: [ARM] Check correct instructions for load/store rescheduling. (authored by efriedma). · Explain WhyMar 1 2017, 3:08 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

ARM/

ARMLoadStoreOptimizer.cpp

2 lines

test/

CodeGen/

ARM/

prera-ldst-insertpt.mir

99 lines

vldm-liveness.ll

19 lines

vldm-liveness.mir

40 lines

Diff 90242

llvm/trunk/lib/Target/ARM/ARMLoadStoreOptimizer.cpp

Show First 20 Lines • Show All 2,189 Lines • ▼ Show 20 Lines	for (int i = Ops.size() - 1; i >= 0; --i) {
break;		break;
}		}

if (NumMove <= 1)		if (NumMove <= 1)
Ops.pop_back();		Ops.pop_back();
else {		else {
SmallPtrSet<MachineInstr*, 4> MemOps;		SmallPtrSet<MachineInstr*, 4> MemOps;
SmallSet<unsigned, 4> MemRegs;		SmallSet<unsigned, 4> MemRegs;
for (int i = NumMove-1; i >= 0; --i) {		for (size_t i = Ops.size() - NumMove, e = Ops.size(); i != e; ++i) {
MemOps.insert(Ops[i]);		MemOps.insert(Ops[i]);
MemRegs.insert(Ops[i]->getOperand(0).getReg());		MemRegs.insert(Ops[i]->getOperand(0).getReg());
}		}

// Be conservative, if the instructions are too far apart, don't		// Be conservative, if the instructions are too far apart, don't
// move them. We want to limit the increase of register pressure.		// move them. We want to limit the increase of register pressure.
bool DoMove = (LastLoc - FirstLoc) <= NumMove*4; // FIXME: Tune this.		bool DoMove = (LastLoc - FirstLoc) <= NumMove*4; // FIXME: Tune this.
if (DoMove)		if (DoMove)
▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/ARM/prera-ldst-insertpt.mir

				# RUN: llc -run-pass arm-prera-ldst-opt %s -o - \| FileCheck %s
				--- \|
				target triple = "thumbv7---eabi"

				define void @a(i32* nocapture %x, i32 %y, i32 %z) {
				entry:
				ret void
				}

				define void @b(i32* nocapture %x, i32 %y, i32 %z) {
				entry:
				ret void
				}
				...
				---
				# CHECK-LABEL: name: a
				name: a
				alignment: 1
				tracksRegLiveness: true
				liveins:
				- { reg: '%r0', virtual-reg: '%0' }
				- { reg: '%r1', virtual-reg: '%1' }
				- { reg: '%r2', virtual-reg: '%2' }
				body: \|
				bb.0.entry:
				liveins: %r0, %r1, %r2

				%2 : rgpr = COPY %r2
				%1 : rgpr = COPY %r1
				%0 : gpr = COPY %r0
				%3 : rgpr = t2MUL %2, %2, 14, _
				%4 : rgpr = t2MUL %1, %1, 14, _
				%5 : rgpr = t2MOVi32imm -858993459
				%6 : rgpr, %7 : rgpr = t2UMULL killed %3, %5, 14, _
				%8 : rgpr, %9 : rgpr = t2UMULL killed %4, %5, 14, _
				t2STRi12 %1, %0, 0, 14, _ :: (store 4)
				%10 : rgpr = t2LSLri %2, 1, 14, _, _
				t2STRi12 killed %10, %0, 4, 14, _ :: (store 4)
				%11 : rgpr = t2MOVi 55, 14, _, _
				%12 : gprnopc = t2ADDrs %11, killed %7, 19, 14, _, _
				t2STRi12 killed %12, %0, 16, 14, _ :: (store 4)
				%13 : gprnopc = t2ADDrs %11, killed %9, 19, 14, _, _
				t2STRi12 killed %13, %0, 20, 14, _ :: (store 4)

				; Make sure we move the paired stores next to each other.
				; FIXME: Make sure we don't extend the live-range of a store
				; when we don't need to.
				; CHECK: t2STRi12 %1,
				; CHECK-NEXT: t2STRi12 killed %10,
				; CHECK-NEXT: %13 = t2ADDrs %11
				; CHECK-NEXT: t2STRi12 killed %12,
				; CHECK-NEXT: t2STRi12 killed %13,

				tBX_RET 14, _
				---
				# CHECK-LABEL: name: b
				name: b
				alignment: 1
				tracksRegLiveness: true
				liveins:
				- { reg: '%r0', virtual-reg: '%0' }
				- { reg: '%r1', virtual-reg: '%1' }
				- { reg: '%r2', virtual-reg: '%2' }
				body: \|
				bb.0.entry:
				liveins: %r0, %r1, %r2

				%2 : rgpr = COPY %r2
				%1 : rgpr = COPY %r1
				%0 : gpr = COPY %r0
				t2STRi12 %1, %0, 0, 14, _ :: (store 4)
				%10 : rgpr = t2LSLri %2, 1, 14, _, _
				t2STRi12 killed %10, %0, 4, 14, _ :: (store 4)
				%3 : rgpr = t2MUL %2, %2, 14, _
				t2STRi12 %3, %0, 8, 14, _ :: (store 4)
				%4 : rgpr = t2MUL %1, %1, 14, _
				%5 : rgpr = t2MOVi32imm -858993459
				%6 : rgpr, %7 : rgpr = t2UMULL killed %3, %5, 14, _
				%8 : rgpr, %9 : rgpr = t2UMULL killed %4, %5, 14, _
				%10 : rgpr = t2LSLri %2, 1, 14, _, _
				%11 : rgpr = t2MOVi 55, 14, _, _
				%12 : gprnopc = t2ADDrs %11, killed %7, 19, 14, _, _
				t2STRi12 killed %12, %0, 16, 14, _ :: (store 4)
				%13 : gprnopc = t2ADDrs %11, killed %9, 19, 14, _, _
				t2STRi12 killed %13, %0, 20, 14, _ :: (store 4)

				; Make sure we move the paired stores next to each other.
				; FIXME: Make sure we don't extend the live-range of a store
				; when we don't need to.
				; CHECK: t2STRi12 {{.*}}, 0
				; CHECK-NEXT: t2STRi12 {{.*}}, 4
				; CHECK-NEXT: t2STRi12 {{.*}}, 8
				; CHECK-NEXT: t2ADDrs
				; CHECK-NEXT: t2STRi12 {{.*}}, 16
				; CHECK-NEXT: t2STRi12 {{.*}}, 20

				tBX_RET 14, _

				...

llvm/trunk/test/CodeGen/ARM/vldm-liveness.ll

	; RUN: llc -mtriple thumbv7-apple-ios -verify-machineinstrs -o - %s \| FileCheck %s			; RUN: llc -mtriple thumbv7-apple-ios -verify-machineinstrs -o - %s \| FileCheck %s

	; ARM load store optimizer was dealing with a sequence like:			; Make sure we emit the loads in ascending order, and form a vldmia.
	; s1 = VLDRS [r0, 1], Q0<imp-def>
	; s3 = VLDRS [r0, 2], Q0<imp-use,kill>, Q0<imp-def>
	; s0 = VLDRS [r0, 0], Q0<imp-use,kill>, Q0<imp-def>
	; s2 = VLDRS [r0, 4], Q0<imp-use,kill>, Q0<imp-def>
	;			;
	; It decided to combine the {s0, s1} loads into a single instruction in the			; See vldm-liveness.mir for the bug this file originally testing.
	; third position. However, this leaves the instruction defining s3 with a stray
	; imp-use of Q0, which is undefined.
	;
	; The verifier catches this, so this test just makes sure that appropriate
	; liveness flags are added.
	;
	; I believe the change will be tested as long as the vldmia is not the first of
	; the loads. Earlier optimisations may perturb the output over time, but
	; fiddling the indices should be sufficient to restore the test.

	define arm_aapcs_vfpcc <4 x float> @foo(float* %ptr) {			define arm_aapcs_vfpcc <4 x float> @foo(float* %ptr) {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: vldr s3, [r0, #8]
	; CHECK: vldmia r0, {s0, s1}			; CHECK: vldmia r0, {s0, s1}
				; CHECK: vldr s3, [r0, #8]
	; CHECK: vldr s2, [r0, #16]			; CHECK: vldr s2, [r0, #16]
	%off0 = getelementptr float, float* %ptr, i32 0			%off0 = getelementptr float, float* %ptr, i32 0
	%val0 = load float, float* %off0			%val0 = load float, float* %off0
	%off1 = getelementptr float, float* %ptr, i32 1			%off1 = getelementptr float, float* %ptr, i32 1
	%val1 = load float, float* %off1			%val1 = load float, float* %off1
	%off4 = getelementptr float, float* %ptr, i32 4			%off4 = getelementptr float, float* %ptr, i32 4
	%val4 = load float, float* %off4			%val4 = load float, float* %off4
	%off2 = getelementptr float, float* %ptr, i32 2			%off2 = getelementptr float, float* %ptr, i32 2
	Show All 9 Lines

llvm/trunk/test/CodeGen/ARM/vldm-liveness.mir

				# RUN: llc -run-pass arm-ldst-opt -verify-machineinstrs %s -o - \| FileCheck %s
				# ARM load store optimizer was dealing with a sequence like:
				# s1 = VLDRS [r0, 1], Q0<imp-def>
				# s3 = VLDRS [r0, 2], Q0<imp-use,kill>, Q0<imp-def>
				# s0 = VLDRS [r0, 0], Q0<imp-use,kill>, Q0<imp-def>
				# s2 = VLDRS [r0, 4], Q0<imp-use,kill>, Q0<imp-def>
				#
				# It decided to combine the {s0, s1} loads into a single instruction in the
				# third position. However, this leaves the instruction defining s3 with a stray
				# imp-use of Q0, which is undefined.
				#
				# The verifier catches this, so this test just makes sure that appropriate
				# liveness flags are added.
				--- \|
				target triple = "thumbv7-apple-ios"
				define arm_aapcs_vfpcc <4 x float> @foo(float* %ptr) {
				ret <4 x float> undef
				}
				...
				---
				name: foo
				alignment: 1
				liveins:
				- { reg: '%r0' }
				body: \|
				bb.0 (%ir-block.0):
				liveins: %r0

				%s1 = VLDRS %r0, 1, 14, _, implicit-def %q0 :: (load 4)
				%s3 = VLDRS %r0, 2, 14, _, implicit killed %q0, implicit-def %q0 :: (load 4)
				; CHECK: %s3 = VLDRS %r0, 2, 14, _, implicit killed undef %q0, implicit-def %q0 :: (load 4)

				%s0 = VLDRS %r0, 0, 14, _, implicit killed %q0, implicit-def %q0 :: (load 4)
				; CHECK: VLDMSIA %r0, 14, _, def %s0, def %s1, implicit-def _

				%s2 = VLDRS killed %r0, 4, 14, _, implicit killed %q0, implicit-def %q0 :: (load 4)
				; CHECK: %s2 = VLDRS killed %r0, 4, 14, _, implicit killed %q0, implicit-def %q0 :: (load 4)

				tBX_RET 14, _, implicit %q0
				...