Download Raw Diff

Details

Reviewers

Commits

rGf09b891d4a3f: [SystemZ] Improve foldMemoryOperandImpl()

Summary

A spilled load of an immediate can use MVHI/MVGHI instead.
A compare of a spilled register against an immediate can use CHSI/CGHSI.

On SPEC 17:                          trunk <> patched

chsi           :                53231                59356    +6125
lt             :                19324                14060    -5264
cghsi          :                29368                34598    +5230
ltg            :               166914               161949    -4965
mvhi           :                29323                33923    +4600
lhi            :               262083               257623    -4460
st             :               181993               177559    -4434
mvghi          :                54650                58599    +3949
stg            :               409267               405380    -3887
lghi           :               467915               464077    -3838
l              :               231431               230640     -791
jlh            :               178461               178961     +500
lg             :              1093362              1092985     -377
cijlh          :                83235                82875     -360
je             :               340896               341233     +337
cije           :               111150               110969     -181
jl             :                52685                52808     +123
chi            :                60634                60530     -104
cijl           :                13434                13358      -76

Since LT/LTG and LHI/LGHI use a register write and an extra instruction, while CHSI/CGSI and MVHI/MVGHI do not, this should be a general improvement. I didn't see any big change in spilling/reloading, though (in fact a very slight increase in number of instructions which is probably related to later optimizations).

This is the remaining improvements I could see while looking at imagick. It seems to improve it maybe yet another percent or so.

Diff Detail

Event Timeline

jonpa created this revision.Mar 12 2020, 3:22 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 12 2020, 3:22 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

Ah, that's a good idea. I agree this makes sense.

I'm wondering if there are other memory-and-immediate instructions that could be handled like that. Looking at the list, I see:

logical compares (CLFHSI, CLGHSI) -- those should be trivial to add
addition (ASI, AGSI, potentially also ALSI, ALGSI) -- it should be possible to add them, I think (it may be a bit different since it would replace a read-modify-write)

Then there are a number of instructions that operate on a byte or halfword level in memory:

MVI, MVHHI,
CLI, CHHSI, CLHHSI
NI, OI, XI

I'm not sure if there is an option to exploit those, given that LLVM will never spill 8- or 16-bit values. But maybe there could be some opportunities in special cases? E.g. an OILL with a small constant could be replaced by an OI targeting just the low byte of the in-memory value.

Maybe this would be something to think about for future extensions.

logical compares (CLFHSI, CLGHSI) -- those should be trivial to add

Patch updated to include these as well, with a check for an uint<16> immediate. This gives +1800 CLGHSI and +200 CLFHSI compared to before.

addition (ASI, AGSI, potentially also ALSI, ALGSI) -- it should be possible to add them, I think (it may be a bit different since it would replace a read-modify-write)

These are already mostly handled, right? Exception is AHIMuxK which I couldn't commit since it is causing a test case to get a lot of uncoalesced COPYs as reported on Bugzilla. AGHIK could also be handled the same way.

Then there are a number of instructions that operate on a byte or halfword level in memory:

The only thing I can see is just a few cases of:
NIFMux -> NI (just a few cases with "small" immediate)
OILMux -> OI few cases with small immediate
XIFMux -> XI 10 cases with small immediate

Given it is rare it seems it can probably wait..

What remains (from looking at benchmarks) seems to be:

We can inspect the allocated physregs in foldMemoryOperandImpl(), so we could try to handle W... FP instructions. We could experiment with doing this only if there is a present allocation that fits, or what happens if we constrain it in case it is unallocated (to FP registers 0-15):

WFA[SDX]B op 1/2 -> two address A[DEX]BR if low16 vecregs -> A[DEX]B if dst same...
WFS[SDX]B                       S[DEX]BR                  -> S[DEX]B 
WFC[SDX]B        ->             C[EDX]BR                  -> C[EDX]B
WFD[SD]B         ->             D[SD]BR                   -> D[SD]B
WFMA[SD]B        ->             MA[ED]BR                  -> MA[ED]B
WFMS[SD]B        ->             MS[ED]BR                  -> MS[ED]B
WFM[SD]B         ->             M[ED]BR                   -> M[EE/D]B

MS(G)RKC -> MS(G) reg/mem mapping seems missing (4000 potential cases)
VLVGP (Ops 1, 2) (justa a dozen cases) -> 2 x VLEG
VLVGP32 (Ops 1, 2 same reg) (just ~20 cases) -> VLREPF ?
LFER / LEFR (88 cases)

When where 'Ops.size() == 2', for ops [0, 1]: A(G)RK, S(G)RK, NRK, ORK, WF[AS][DS]B, MS(G)RKC

Many of these have the same virt reg in operands 0 and 1, but not all.
Since we now don't use a reg/mem instruction, we do
<tmp> = reload
<tmp> = ARK <tmp>
store <tmp>

If we could use a temporary register in foldMemoryOperandImpl(), I think we instead could do
<tmp> = A
store <tmp>

Not sure if that's allowed...

Patch extended to also handle

MS(G)RKC -> MS(G)C

LEFR

Vector W... to fp-mem instructions: binary ops, fused ops, compares and unary ops.

The patch checks the other operands of the single-lane (W) vector instruction. If all are already allocated to an FP (<16) register, then a memfold pseudo for the mapped fp memory instruction is used (since the register allocator has already begun to spill, it seems less promising to try to constrain any not yet allocated register to FP registers and do the fold also in such cases).

New mappings are added from W... to FP instructions. Since any instruction that is mapped to a memory instruction in getMemOpcode() has to be correctly treated, I wanted to be sure that we only see expected opcodes by recognizing them all in foldMemoryOperandImpl() and then having a check to see that there are no stray vector instructions showing up. It seems to me now that it would probably be more reasonable to just check any instruction at that point for vector registers allocated >15 and in that case return nullptr. In other words trust the mapping in SystemZInstrFormats.td and allow any folding by it as long as there are no FP16-31 registers allocated. Waiting for some feedback before changing this...

The MemFoldCopies stat is still low and acceptable, I think: 37 on a SPEC 17 build. This is when the regalloc evicts one of the registers of the memfold pseudo and a COPY then later has to be built in SystemZPostRewrite.cpp.

Added commutation flags on WFA/WFM - I hope this is ok and also that there are no unwanted implications regarding fp semantics with this patch.

LIS->getRegUnit() will compute the LiveInterval for CC. This is now needed in two places so it was moved to the top of the function and so always called (should not be a compile time problem). Because of this, the kill flags on CC in int-cmp-56.mir are removed (which seems to always be done during regalloc).

I think this is most of what can be done for memory operand folding at the moment, looking at a SPEC 17 build.

jonpa marked an inline comment as done.Mar 23 2020, 9:11 AM

jonpa added inline comments.

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1125	This handles just some 30-40 cases so I am not sure how useful this is, but I suppose it can be. There are a few more (50?) unhandled cases of LEFR/LFER that appear to require things like new opcodes with special handlings or similar ('%gr64bit = LFER %vr32bit' for instance seems awkward to handle ...)

In D76055#1921620, @jonpa wrote:
logical compares (CLFHSI, CLGHSI) -- those should be trivial to add
Patch updated to include these as well, with a check for an uint<16> immediate. This gives +1800 CLGHSI and +200 CLFHSI compared to before.

OK, good!

addition (ASI, AGSI, potentially also ALSI, ALGSI) -- it should be possible to add them, I think (it may be a bit different since it would replace a read-modify-write)
These are already mostly handled, right? Exception is AHIMuxK which I couldn't commit since it is causing a test case to get a lot of uncoalesced COPYs as reported on Bugzilla. AGHIK could also be handled the same way.

Ah right, I missed those.

A(G)HIK is an interesting case; we already have code to check whether we can replace a three-operand with a two-operand opcode, but that is done *after* the A(G)HI check. I guess in the past, this wasn't very important since most operations were two-operand; but a while back, we changed that to start out everything as three-operand, and only switch back to two-operand later -- this might have caused a regression for foldMemoryOperandImpl ...

Not sure I understand the AHIMuxK problem. In order to do foldMemoryOperandImpl, we'd have do first to the three-operand to two-operand conversion like for A(G)HIK. And then it doesn't matter whether the spilled register is GR32 or GRH32 as we'd always leave it in memory and replace it with an ASI, right?

Then there are a number of instructions that operate on a byte or halfword level in memory:

The only thing I can see is just a few cases of:
NIFMux -> NI (just a few cases with "small" immediate)
OILMux -> OI few cases with small immediate
XIFMux -> XI 10 cases with small immediate

Given it is rare it seems it can probably wait..

Agreed. Thanks for checking!

What remains (from looking at benchmarks) seems to be:

We can inspect the allocated physregs in foldMemoryOperandImpl(), so we could try to handle W... FP instructions. We could experiment with doing this only if there is a present allocation that fits, or what happens if we constrain it in case it is unallocated (to FP registers 0-15):
WFA[SDX]B op 1/2 -> two address A[DEX]BR if low16 vecregs -> A[DEX]B if dst same...

Well ... even for the first step you already need the dst to be the same ... A[DEX]BR themselves are two-operand.

In any case, that seems a similar case to the three- vs. two-operand issue, except that now we'd need to constrain the regclass, so that might be an overall loss. Hard to say. Might be interesting to experiment.

MS(G)RKC -> MS(G) reg/mem mapping seems missing (4000 potential cases)

OK, need to watch out for CC though.

VLVGP (Ops 1, 2) (justa a dozen cases) -> 2 x VLEG

VLVGP32 (Ops 1, 2 same reg) (just ~20 cases) -> VLREPF ?

LFER / LEFR (88 cases)

Not sure whether this is worthwhile, but OK.

When where 'Ops.size() == 2', for ops [0, 1]: A(G)RK, S(G)RK, NRK, ORK, WF[AS][DS]B, MS(G)RKC

Many of these have the same virt reg in operands 0 and 1, but not all.
Since we now don't use a reg/mem instruction, we do
<tmp> = reload
<tmp> = ARK <tmp>
store <tmp>

If we could use a temporary register in foldMemoryOperandImpl(), I think we instead could do
<tmp> = A
store <tmp>

Not sure if that's allowed...

So this is the case where we have a three-operand instruction, and the output and one of the inputs are spilled (but not the same), right?

Can't we do the folding just on the input then, and leave the output as spilled? Not sure how this works with the foldMemoryOperandImpl logic ...

Oops, sorry, missed your update. Will look into that shortly.

The vector -> fp instruction part removed to be posted separately.

See inline comments. Otherwise this looks good, but I'd rather commit this as two separate patches (the memory-immediate changes in one, and the MS(G)RKC changes in the other).

llvm/lib/Target/SystemZ/SystemZInstrFormats.td
3138 ↗	(On Diff #252315)	Ugh, that's annoying. Can you at least move the mnemonic twiddling to the other side, i.e. to MemFoldPseudo? We already have similar twiddling for LOC vs SEL there.
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1125	It seems a bit odd to just handle LEFR/LFER (which are simply special forms of VLVGF/VLGVF) while not handling any of the regular VLVG/VLGV forms ... It's probably not worth having just the special case; maybe we should have logic to handle the generic case. But in any case this should also be a separate patch.
1228	I'm wondering if the should just have getTwoOperandOpcode return something for these instead. But it's probably not worth the complication to special-case the CC handling in ShortenInst ...

uweigand added inline comments.Mar 24 2020, 9:28 AM

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1228	Hmm. Thinking about this some more, I think what we're really testing for here is simple: does the MemOpcode need its output and first input operand to be tied or not. If yes, we need to verify that they can be tied (potentially via commutation). I believe it should be possibly to simply check for that property by looking at the MCInstrDest for MemOpcode.

MS(G)RKC -> MS(G)C part moved to https://reviews.llvm.org/D76771

jonpa added inline comments.Mar 25 2020, 6:58 AM

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1125	OK, removing that for now since that's relatively rare.
1228	That seems to work but then also the MemFoldPseudo needs to be looked-through.

LGTM, thanks!

This revision is now accepted and ready to land.Mar 25 2020, 7:22 AM

Closed by commit rGf09b891d4a3f: [SystemZ] Improve foldMemoryOperandImpl() (authored by jonpa). · Explain WhyMar 25 2020, 8:37 AM

This revision was automatically updated to reflect the committed changes.

Diff 249880

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp

Show First 20 Lines • Show All 1,072 Lines • ▼ Show 20 Lines	MachineInstr *BuiltMI =
BuildMI(*InsertPt->getParent(), InsertPt, MI.getDebugLoc(), get(Opcode))		BuildMI(*InsertPt->getParent(), InsertPt, MI.getDebugLoc(), get(Opcode))
.addFrameIndex(FrameIndex)		.addFrameIndex(FrameIndex)
.addImm(0)		.addImm(0)
.addImm((int8_t)-MI.getOperand(2).getImm());		.addImm((int8_t)-MI.getOperand(2).getImm());
transferDeadCC(&MI, BuiltMI);		transferDeadCC(&MI, BuiltMI);
return BuiltMI;		return BuiltMI;
}		}

		unsigned MemImmOpc = 0;
		switch (Opcode) {
		case SystemZ::LHIMux:
		case SystemZ::LHI: MemImmOpc = SystemZ::MVHI; break;
		case SystemZ::LGHI: MemImmOpc = SystemZ::MVGHI; break;
		case SystemZ::CHIMux:
		case SystemZ::CHI: MemImmOpc = SystemZ::CHSI; break;
		case SystemZ::CGHI: MemImmOpc = SystemZ::CGHSI; break;
		default: break;
		}
		if (MemImmOpc)
		return BuildMI(*InsertPt->getParent(), InsertPt, MI.getDebugLoc(),
		get(MemImmOpc))
		.addFrameIndex(FrameIndex)
		.addImm(0)
		.addImm(MI.getOperand(1).getImm());

if (Opcode == SystemZ::LGDR \|\| Opcode == SystemZ::LDGR) {		if (Opcode == SystemZ::LGDR \|\| Opcode == SystemZ::LDGR) {
bool Op0IsGPR = (Opcode == SystemZ::LGDR);		bool Op0IsGPR = (Opcode == SystemZ::LGDR);
bool Op1IsGPR = (Opcode == SystemZ::LDGR);		bool Op1IsGPR = (Opcode == SystemZ::LDGR);
// If we're spilling the destination of an LDGR or LGDR, store the		// If we're spilling the destination of an LDGR or LGDR, store the
// source register instead.		// source register instead.
if (OpNum == 0) {		if (OpNum == 0) {
unsigned StoreOpcode = Op1IsGPR ? SystemZ::STG : SystemZ::STD;		unsigned StoreOpcode = Op1IsGPR ? SystemZ::STG : SystemZ::STD;
return BuildMI(*InsertPt->getParent(), InsertPt, MI.getDebugLoc(),		return BuildMI(*InsertPt->getParent(), InsertPt, MI.getDebugLoc(),
Show All 11 Lines	if (OpNum == 1) {
get(LoadOpcode))		get(LoadOpcode))
.add(MI.getOperand(0))		.add(MI.getOperand(0))
.addFrameIndex(FrameIndex)		.addFrameIndex(FrameIndex)
.addImm(0)		.addImm(0)
.addReg(0);		.addReg(0);
}		}
}		}

// Look for cases where the source of a simple store or the destination		// Look for cases where the source of a simple store or the destination
		jonpaAuthorUnsubmitted Done Reply Inline Actions This handles just some 30-40 cases so I am not sure how useful this is, but I suppose it can be. There are a few more (50?) unhandled cases of LEFR/LFER that appear to require things like new opcodes with special handlings or similar ('%gr64bit = LFER %vr32bit' for instance seems awkward to handle ...) jonpa: This handles just some 30-40 cases so I am not sure how useful this is, but I suppose it can be.
		uweigandUnsubmitted Done Reply Inline Actions It seems a bit odd to just handle LEFR/LFER (which are simply special forms of VLVGF/VLGVF) while not handling any of the regular VLVG/VLGV forms ... It's probably not worth having just the special case; maybe we should have logic to handle the generic case. But in any case this should also be a separate patch. uweigand: It seems a bit odd to just handle LEFR/LFER (which are simply special forms of VLVGF/VLGVF)…
		jonpaAuthorUnsubmitted Done Reply Inline Actions OK, removing that for now since that's relatively rare. jonpa: OK, removing that for now since that's relatively rare.
// of a simple load is being spilled. Try to use MVC instead.		// of a simple load is being spilled. Try to use MVC instead.
//		//
// Although MVC is in practice a fast choice in these cases, it is still		// Although MVC is in practice a fast choice in these cases, it is still
// logically a bytewise copy. This means that we cannot use it if the		// logically a bytewise copy. This means that we cannot use it if the
// load or store is volatile. We also wouldn't be able to use MVC if		// load or store is volatile. We also wouldn't be able to use MVC if
// the two memories partially overlap, but that case cannot occur here,		// the two memories partially overlap, but that case cannot occur here,
// because we know that one of the memories is a full frame index.		// because we know that one of the memories is a full frame index.
//		//
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	MachineInstrBuilder MIB = BuildMI(*InsertPt->getParent(), InsertPt,
MI.getDebugLoc(), get(MemOpcode));		MI.getDebugLoc(), get(MemOpcode));
if (MI.isCompare()) {		if (MI.isCompare()) {
assert(NumOps == 2 && "Expected 2 register operands for a compare.");		assert(NumOps == 2 && "Expected 2 register operands for a compare.");
MIB.add(MI.getOperand(NeedsCommute ? 1 : 0));		MIB.add(MI.getOperand(NeedsCommute ? 1 : 0));
}		}
else {		else {
MIB.add(MI.getOperand(0));		MIB.add(MI.getOperand(0));
if (NeedsCommute)		if (NeedsCommute)
MIB.add(MI.getOperand(2));		MIB.add(MI.getOperand(2));
		uweigandUnsubmitted Done Reply Inline Actions I'm wondering if the should just have getTwoOperandOpcode return something for these instead. But it's probably not worth the complication to special-case the CC handling in ShortenInst ... uweigand: I'm wondering if the should just have getTwoOperandOpcode return something for these instead.
		uweigandUnsubmitted Done Reply Inline Actions Hmm. Thinking about this some more, I think what we're really testing for here is simple: does the MemOpcode need its output and first input operand to be tied or not. If yes, we need to verify that they can be tied (potentially via commutation). I believe it should be possibly to simply check for that property by looking at the MCInstrDest for MemOpcode. uweigand: Hmm. Thinking about this some more, I think what we're really testing for here is simple: does…
		jonpaAuthorUnsubmitted Done Reply Inline Actions That seems to work but then also the MemFoldPseudo needs to be looked-through. jonpa: That seems to work but then also the MemFoldPseudo needs to be looked-through.
else		else
for (unsigned I = 1; I < OpNum; ++I)		for (unsigned I = 1; I < OpNum; ++I)
MIB.add(MI.getOperand(I));		MIB.add(MI.getOperand(I));
}		}
MIB.addFrameIndex(FrameIndex).addImm(Offset);		MIB.addFrameIndex(FrameIndex).addImm(Offset);
if (MemDesc.TSFlags & SystemZII::HasIndex)		if (MemDesc.TSFlags & SystemZII::HasIndex)
MIB.addReg(0);		MIB.addReg(0);
if (CCOperands) {		if (CCOperands) {
▲ Show 20 Lines • Show All 660 Lines • Show Last 20 Lines

llvm/test/CodeGen/SystemZ/foldmemop-imm.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -O3 -mcpu=z10 \| FileCheck %s
				; RUN: llc < %s -mtriple=s390x-linux-gnu -O3 -mcpu=z14 \| FileCheck %s
				;
				; Test folding of spilled immediate loads and compares.

				define i32 @fun0(i32 *%src, i32 %arg) nounwind {
				; CHECK-LABEL: fun0:
				; CHECK: mvhi 164(%r15), 0 # 4-byte Folded Spill
				; CHECK: mvc 164(4,%r15), 0(%r2) # 4-byte Folded Spill
				; CHECK-LABEL: .LBB0_2:
				; CHECK: chsi 164(%r15), 2 # 4-byte Folded Reload

				entry:
				%cmp = icmp eq i32 %arg, 0
				br i1 %cmp, label %cond, label %exit

				cond:
				%val0 = load i32, i32 *%src
				call void asm sideeffect "", "~{r0},~{r1},~{r2},~{r3},~{r4},~{r5},~{r6},~{r7},~{r8},~{r9},~{r10},~{r11},~{r12},~{r13},~{r14},~{r15}"() nounwind
				br label %exit

				exit:
				%tmp0 = phi i32 [0, %entry], [%val0, %cond]
				call void asm sideeffect "", "~{r0},~{r1},~{r2},~{r3},~{r4},~{r5},~{r6},~{r7},~{r8},~{r9},~{r10},~{r11},~{r12},~{r13},~{r14},~{r15}"() nounwind
				%cmp0 = icmp ne i32 %tmp0, 2
				%zxt0 = zext i1 %cmp0 to i32
				%and0 = and i32 %arg, %zxt0

				ret i32 %and0
				}

				define i64 @fun1(i64 *%src, i64 %arg) nounwind {
				; CHECK-LABEL: fun1:
				; CHECK: mvghi 168(%r15), 0 # 8-byte Folded Spill
				; CHECK: mvc 168(8,%r15), 0(%r2) # 8-byte Folded Spill
				; CHECK-LABEL: .LBB1_2:
				; CHECK: cghsi 168(%r15), 2 # 8-byte Folded Reload
				entry:
				%cmp = icmp eq i64 %arg, 0
				br i1 %cmp, label %cond, label %exit

				cond:
				%val0 = load i64, i64 *%src
				call void asm sideeffect "", "~{r0},~{r1},~{r2},~{r3},~{r4},~{r5},~{r6},~{r7},~{r8},~{r9},~{r10},~{r11},~{r12},~{r13},~{r14},~{r15}"() nounwind
				br label %exit

				exit:
				%tmp0 = phi i64 [0, %entry], [%val0, %cond]
				call void asm sideeffect "", "~{r0},~{r1},~{r2},~{r3},~{r4},~{r5},~{r6},~{r7},~{r8},~{r9},~{r10},~{r11},~{r12},~{r13},~{r14},~{r15}"() nounwind
				%cmp0 = icmp ne i64 %tmp0, 2
				%zxt0 = zext i1 %cmp0 to i64
				%and0 = and i64 %arg, %zxt0

				ret i64 %and0
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Improve foldMemoryOperandImpl().
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 249880

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp

llvm/test/CodeGen/SystemZ/foldmemop-imm.ll

This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Improve foldMemoryOperandImpl().ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 249880

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp

llvm/test/CodeGen/SystemZ/foldmemop-imm.ll

[SystemZ] Improve foldMemoryOperandImpl().
ClosedPublic