This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Macro fuse t2LoopDec and t2LoopEnd
AbandonedPublic

Authored by dmgreen on May 11 2020, 11:53 PM.

Download Raw Diff

Details

Reviewers

samparker
SjoerdMeijer
ostannard
efriedma

Summary

Fusing these two pseudo instructions together in the scheduler forces them to try and be closer together in the final assembly. This can help, but doesn't on it's own solve the problems of register allocation going awry and spilling lr in the loop.

Diff Detail

Event Timeline

dmgreen created this revision.May 11 2020, 11:53 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2020, 11:53 PM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls and 2 others. · View Herald Transcript

Seems like a good idea to me, I can't see this causing any harm. but do you have an example of where it enables LOBs?

In D79767#2031292, @samparker wrote:

Seems like a good idea to me, I can't see this causing any harm. but do you have an example of where it enables LOBs?

Not where it's helping LOB's, exactly. It can help keep the live range of cpsr shorter, allowing us to use more t1 instructions, which can help a little. I've not seen it improve the issue with spilling LR between the t2LoopDec and t2LoopEnd. I think the problem there might be that there is only really a single place for the value to spill at, between those two instructions. Depending on how we try to fix that, this might be useful in it's own right or not.

Seems like a good idea to me too. A question on the "fusion" inlined.

llvm/lib/Target/ARM/ARMSubtarget.h
708	Do we need to add LOB here, i.e. are we using that? If so, is "fusing" the right thing? Is the use of LOB here the same as e.g. "fusing AES" instructions?

dmgreen marked an inline comment as done.Jul 21 2020, 3:08 AM

dmgreen added inline comments.

llvm/lib/Target/ARM/ARMSubtarget.h
708	I think the idea is that we only call createARMMacroFusionDAGMutation if we have instructions to fuse, and we would not have ARM::t2LoopEnd if we did not have low overhead branches. If you mean "should we add a hasFuseLOB subtarget feature for this" - I'm not sure. It will depend on how we end up changing the instructions. If we do end up using this method to force t2LoopDec and t2LoopEnd close together in order to reduce the live range of a CPSR, then we should probably do that for all subtargets. But a t2LoopDec and a t2LoopEnd are really modelling a single instruction, and maybe it makes sense to just make them into a single thing? That would obviously make this patch unnecessary.

SjoerdMeijer added inline comments.Jul 21 2020, 3:14 AM

llvm/lib/Target/ARM/ARMSubtarget.h
708	I was really starting to think the same: But a t2LoopDec and a t2LoopEnd are really modelling a single instruction, and maybe it makes sense to just make them into a single thing? What we do here in this patch starts to feel like a convoluted approach, and indeed, since we are modelling a single instruction, shouldn't we go for that then? Or would that have disadvantages?

dmgreen marked an inline comment as done.Jul 21 2020, 3:33 AM

dmgreen added inline comments.

llvm/lib/Target/ARM/ARMSubtarget.h
708	My understanding is that it is a terminator that produces a value, and that does not work very well. If you need to spill the value, you would have not place to spill it. I may be wrong though, I'm not sure entirely why it was done in the way it was and am really just guessing. We had an older downstream implementation that worked that way I think, but always had bugs. Whether they were fixable I'm not sure. It would take some investigation.

My understanding is that it is a terminator that produces a value, and that does not work very well. If you need to spill the value, you would have not place to spill it.

This.
It also could also have the benefit of other instructions using the value in LR too. Maybe it makes reverting easier too, but I can't even remember now... Either way, I think trying to fuse the operations is quite a bit less invasive than trying to re-implement how we do these loops :) and if it can give us some little improvements, let's do it.

This revision is now accepted and ready to land.Aug 3 2020, 3:48 AM

The last time me and Sjoerd talked about it, we figured this wouldn't fix the issue exactly (it only fuses them during scheduling, you can still spill between the t2LoopDec and the End), and as we need a proper solution anyway this might not end up being useful on it's own.

But I will run some tests again and see if this is useful anyway. If it is it may be worth getting in.

We went a different way with this, creating a new combined instruction instead.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMMacroFusion.cpp

12 lines

ARMSubtarget.h

2 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

regalloc.ll

32 lines

Diff 263362

llvm/lib/Target/ARM/ARMMacroFusion.cpp

Show All 39 Lines	static bool isLiteralsPair(const MachineInstr *FirstMI,
// Assume the 1st instr to be a wildcard if it is unspecified.		// Assume the 1st instr to be a wildcard if it is unspecified.
if ((FirstMI == nullptr \|\| FirstMI->getOpcode() == ARM::MOVi16) &&		if ((FirstMI == nullptr \|\| FirstMI->getOpcode() == ARM::MOVi16) &&
SecondMI.getOpcode() == ARM::MOVTi16)		SecondMI.getOpcode() == ARM::MOVTi16)
return true;		return true;

return false;		return false;
}		}

		// Fuse t2LoopDec and t2LoopEnd
		static bool isLoopDecEndPair(const MachineInstr *FirstMI,
		const MachineInstr &SecondMI) {
		// Assume the 1st instr to be a wildcard if it is unspecified.
		return (FirstMI == nullptr \|\| FirstMI->getOpcode() == ARM::t2LoopDec) &&
		SecondMI.getOpcode() == ARM::t2LoopEnd;
		}

/// Check if the instr pair, FirstMI and SecondMI, should be fused		/// Check if the instr pair, FirstMI and SecondMI, should be fused
/// together. Given SecondMI, when FirstMI is unspecified, then check if		/// together. Given SecondMI, when FirstMI is unspecified, then check if
/// SecondMI may be part of a fused pair at all.		/// SecondMI may be part of a fused pair at all.
static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,		static bool shouldScheduleAdjacent(const TargetInstrInfo &TII,
const TargetSubtargetInfo &TSI,		const TargetSubtargetInfo &TSI,
const MachineInstr *FirstMI,		const MachineInstr *FirstMI,
const MachineInstr &SecondMI) {		const MachineInstr &SecondMI) {
const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(TSI);		const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(TSI);

if (ST.hasFuseAES() && isAESPair(FirstMI, SecondMI))		if (ST.hasFuseAES() && isAESPair(FirstMI, SecondMI))
return true;		return true;
if (ST.hasFuseLiterals() && isLiteralsPair(FirstMI, SecondMI))		if (ST.hasFuseLiterals() && isLiteralsPair(FirstMI, SecondMI))
return true;		return true;
		if (ST.hasLOB() && isLoopDecEndPair(FirstMI, SecondMI))
		return true;

return false;		return false;
}		}

std::unique_ptr<ScheduleDAGMutation> createARMMacroFusionDAGMutation () {		std::unique_ptr<ScheduleDAGMutation> createARMMacroFusionDAGMutation() {
return createMacroFusionDAGMutation(shouldScheduleAdjacent);		return createMacroFusionDAGMutation(shouldScheduleAdjacent);
}		}

} // end namespace llvm		} // end namespace llvm

llvm/lib/Target/ARM/ARMSubtarget.h

Show First 20 Lines • Show All 699 Lines • ▼ Show 20 Lines	public:
bool hasFP16() const { return HasFP16; }		bool hasFP16() const { return HasFP16; }
bool hasD32() const { return HasD32; }		bool hasD32() const { return HasD32; }
bool hasFullFP16() const { return HasFullFP16; }		bool hasFullFP16() const { return HasFullFP16; }
bool hasFP16FML() const { return HasFP16FML; }		bool hasFP16FML() const { return HasFP16FML; }

bool hasFuseAES() const { return HasFuseAES; }		bool hasFuseAES() const { return HasFuseAES; }
bool hasFuseLiterals() const { return HasFuseLiterals; }		bool hasFuseLiterals() const { return HasFuseLiterals; }
/// Return true if the CPU supports any kind of instruction fusion.		/// Return true if the CPU supports any kind of instruction fusion.
bool hasFusion() const { return hasFuseAES() \|\| hasFuseLiterals(); }		bool hasFusion() const { return hasFuseAES() \|\| hasFuseLiterals() \|\| hasLOB(); }
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Do we need to add LOB here, i.e. are we using that? If so, is "fusing" the right thing? Is the use of LOB here the same as e.g. "fusing AES" instructions? SjoerdMeijer: Do we need to add LOB here, i.e. are we using that? If so, is "fusing" the right thing? Is the…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I think the idea is that we only call createARMMacroFusionDAGMutation if we have instructions to fuse, and we would not have ARM::t2LoopEnd if we did not have low overhead branches. If you mean "should we add a hasFuseLOB subtarget feature for this" - I'm not sure. It will depend on how we end up changing the instructions. If we do end up using this method to force t2LoopDec and t2LoopEnd close together in order to reduce the live range of a CPSR, then we should probably do that for all subtargets. But a t2LoopDec and a t2LoopEnd are really modelling a single instruction, and maybe it makes sense to just make them into a single thing? That would obviously make this patch unnecessary. dmgreen: I think the idea is that we only call createARMMacroFusionDAGMutation if we have instructions…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I was really starting to think the same: But a t2LoopDec and a t2LoopEnd are really modelling a single instruction, and maybe it makes sense to just make them into a single thing? What we do here in this patch starts to feel like a convoluted approach, and indeed, since we are modelling a single instruction, shouldn't we go for that then? Or would that have disadvantages? SjoerdMeijer: I was really starting to think the same: > But a t2LoopDec and a t2LoopEnd are really…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions My understanding is that it is a terminator that produces a value, and that does not work very well. If you need to spill the value, you would have not place to spill it. I may be wrong though, I'm not sure entirely why it was done in the way it was and am really just guessing. We had an older downstream implementation that worked that way I think, but always had bugs. Whether they were fixable I'm not sure. It would take some investigation. dmgreen: My understanding is that it is a terminator that produces a value, and that does not work very…

bool hasMatMulInt8() const { return HasMatMulInt8; }		bool hasMatMulInt8() const { return HasMatMulInt8; }

const Triple &getTargetTriple() const { return TargetTriple; }		const Triple &getTargetTriple() const { return TargetTriple; }

bool isTargetDarwin() const { return TargetTriple.isOSDarwin(); }		bool isTargetDarwin() const { return TargetTriple.isOSDarwin(); }
bool isTargetIOS() const { return TargetTriple.isiOS(); }		bool isTargetIOS() const { return TargetTriple.isiOS(); }
bool isTargetWatchOS() const { return TargetTriple.isWatchOS(); }		bool isTargetWatchOS() const { return TargetTriple.isWatchOS(); }
▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/regalloc.ll

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: movlt r12, r5			; CHECK-NEXT: movlt r12, r5
	; CHECK-NEXT: ldrsb.w r5, [r4, #2]			; CHECK-NEXT: ldrsb.w r5, [r4, #2]
	; CHECK-NEXT: sxtb.w lr, r12			; CHECK-NEXT: sxtb.w lr, r12
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: addlt r7, r6, #3			; CHECK-NEXT: addlt r7, r6, #3
	; CHECK-NEXT: adds r6, #4			; CHECK-NEXT: adds r6, #4
	; CHECK-NEXT: cmp lr, r5			; CHECK-NEXT: cmp lr, r5
	; CHECK-NEXT: mov lr, r10			; CHECK-NEXT: mov lr, r10
	; CHECK-NEXT: sub.w lr, lr, #1			; CHECK-NEXT: itt lt
	; CHECK-NEXT: add.w r4, r4, #4
	; CHECK-NEXT: mov r10, lr
	; CHECK-NEXT: it lt
	; CHECK-NEXT: movlt r12, r5			; CHECK-NEXT: movlt r12, r5
	; CHECK-NEXT: it lt
	; CHECK-NEXT: movlt r7, r6			; CHECK-NEXT: movlt r7, r6
	; CHECK-NEXT: cmp.w lr, #0			; CHECK-NEXT: adds r4, #4
				; CHECK-NEXT: subs.w lr, lr, #1
				; CHECK-NEXT: mov r10, lr
	; CHECK-NEXT: bne .LBB0_5			; CHECK-NEXT: bne .LBB0_5
	; CHECK-NEXT: b .LBB0_6			; CHECK-NEXT: b .LBB0_6
	; CHECK-NEXT: .LBB0_6: @ %while.end.loopexit.unr-lcssa.loopexit			; CHECK-NEXT: .LBB0_6: @ %while.end.loopexit.unr-lcssa.loopexit
	; CHECK-NEXT: add r0, r6			; CHECK-NEXT: add r0, r6
	; CHECK-NEXT: sub.w r9, r9, r6			; CHECK-NEXT: sub.w r9, r9, r6
	; CHECK-NEXT: cmp.w r8, #0			; CHECK-NEXT: cmp.w r8, #0
	; CHECK-NEXT: beq .LBB0_10			; CHECK-NEXT: beq .LBB0_10
	; CHECK-NEXT: .LBB0_7: @ %while.body.epil			; CHECK-NEXT: .LBB0_7: @ %while.body.epil
	▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: cmphi r3, r0			; CHECK-NEXT: cmphi r3, r0
	; CHECK-NEXT: bhi .LBB1_6			; CHECK-NEXT: bhi .LBB1_6
	; CHECK-NEXT: @ %bb.3: @ %vector.ph			; CHECK-NEXT: @ %bb.3: @ %vector.ph
	; CHECK-NEXT: bic r6, r2, #7			; CHECK-NEXT: bic r6, r2, #7
	; CHECK-NEXT: movs r3, #1			; CHECK-NEXT: movs r3, #1
	; CHECK-NEXT: sub.w r12, r6, #8			; CHECK-NEXT: sub.w r12, r6, #8
	; CHECK-NEXT: and r7, r2, #7			; CHECK-NEXT: and r7, r2, #7
	; CHECK-NEXT: add.w r4, r3, r12, lsr #3			; CHECK-NEXT: add.w r4, r3, r12, lsr #3
	; CHECK-NEXT: add.w r3, r0, r6, lsl #1			; CHECK-NEXT: add.w r12, r0, r6, lsl #1
	; CHECK-NEXT: add.w r12, r1, r6, lsl #1			; CHECK-NEXT: add.w r3, r1, r6, lsl #1
	; CHECK-NEXT: mov r5, r4			; CHECK-NEXT: mov r5, r4
	; CHECK-NEXT: .LBB1_4: @ %vector.body			; CHECK-NEXT: .LBB1_4: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vldrh.u16 q0, [r0], #16			; CHECK-NEXT: vldrh.u16 q0, [r0], #16
	; CHECK-NEXT: mov lr, r5			; CHECK-NEXT: mov lr, r5
	; CHECK-NEXT: subs.w lr, lr, #1
	; CHECK-NEXT: vabs.f16 q0, q0			; CHECK-NEXT: vabs.f16 q0, q0
	; CHECK-NEXT: mov r5, lr
	; CHECK-NEXT: vstrb.8 q0, [r1], #16			; CHECK-NEXT: vstrb.8 q0, [r1], #16
				; CHECK-NEXT: subs.w lr, lr, #1
				; CHECK-NEXT: mov r5, lr
	; CHECK-NEXT: bne .LBB1_4			; CHECK-NEXT: bne .LBB1_4
	; CHECK-NEXT: b .LBB1_5			; CHECK-NEXT: b .LBB1_5
	; CHECK-NEXT: .LBB1_5: @ %middle.block			; CHECK-NEXT: .LBB1_5: @ %middle.block
	; CHECK-NEXT: cmp r6, r2			; CHECK-NEXT: cmp r6, r2
	; CHECK-NEXT: mov lr, r7			; CHECK-NEXT: mov lr, r7
	; CHECK-NEXT: bne .LBB1_7			; CHECK-NEXT: bne .LBB1_7
	; CHECK-NEXT: b .LBB1_9			; CHECK-NEXT: b .LBB1_9
	; CHECK-NEXT: .LBB1_6:			; CHECK-NEXT: .LBB1_6:
	; CHECK-NEXT: mov lr, r2			; CHECK-NEXT: mov lr, r2
	; CHECK-NEXT: mov r3, r0			; CHECK-NEXT: mov r12, r0
	; CHECK-NEXT: mov r12, r1			; CHECK-NEXT: mov r3, r1
	; CHECK-NEXT: .LBB1_7: @ %while.body.preheader18			; CHECK-NEXT: .LBB1_7: @ %while.body.preheader18
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: .LBB1_8: @ %while.body			; CHECK-NEXT: .LBB1_8: @ %while.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vldr.16 s0, [r3]			; CHECK-NEXT: vldr.16 s0, [r12]
	; CHECK-NEXT: add.w r0, r12, #2			; CHECK-NEXT: adds r0, r3, #2
	; CHECK-NEXT: adds r3, #2			; CHECK-NEXT: add.w r12, r12, #2
	; CHECK-NEXT: vabs.f16 s0, s0			; CHECK-NEXT: vabs.f16 s0, s0
	; CHECK-NEXT: vstr.16 s0, [r12]			; CHECK-NEXT: vstr.16 s0, [r3]
	; CHECK-NEXT: mov r12, r0			; CHECK-NEXT: mov r3, r0
	; CHECK-NEXT: le lr, .LBB1_8			; CHECK-NEXT: le lr, .LBB1_8
	; CHECK-NEXT: .LBB1_9: @ %while.end			; CHECK-NEXT: .LBB1_9: @ %while.end
	; CHECK-NEXT: pop {r4, r5, r6, r7, pc}			; CHECK-NEXT: pop {r4, r5, r6, r7, pc}
	entry:			entry:
	%cmp4 = icmp eq i32 %blockSize, 0			%cmp4 = icmp eq i32 %blockSize, 0
	br i1 %cmp4, label %while.end, label %while.body.preheader			br i1 %cmp4, label %while.end, label %while.body.preheader

	while.body.preheader: ; preds = %entry			while.body.preheader: ; preds = %entry
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines