This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Improve foldMemoryOperandImpl().
ClosedPublic

Authored by jonpa on Mar 12 2020, 3:22 AM.

Details

Summary

A spilled load of an immediate can use MVHI/MVGHI instead.
A compare of a spilled register against an immediate can use CHSI/CGHSI.

On SPEC 17:                          trunk <> patched

chsi           :                53231                59356    +6125
lt             :                19324                14060    -5264
cghsi          :                29368                34598    +5230
ltg            :               166914               161949    -4965
mvhi           :                29323                33923    +4600
lhi            :               262083               257623    -4460
st             :               181993               177559    -4434
mvghi          :                54650                58599    +3949
stg            :               409267               405380    -3887
lghi           :               467915               464077    -3838
l              :               231431               230640     -791
jlh            :               178461               178961     +500
lg             :              1093362              1092985     -377
cijlh          :                83235                82875     -360
je             :               340896               341233     +337
cije           :               111150               110969     -181
jl             :                52685                52808     +123
chi            :                60634                60530     -104
cijl           :                13434                13358      -76

Since LT/LTG and LHI/LGHI use a register write and an extra instruction, while CHSI/CGSI and MVHI/MVGHI do not, this should be a general improvement. I didn't see any big change in spilling/reloading, though (in fact a very slight increase in number of instructions which is probably related to later optimizations).

This is the remaining improvements I could see while looking at imagick. It seems to improve it maybe yet another percent or so.

Diff Detail

Event Timeline

jonpa created this revision.Mar 12 2020, 3:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 12 2020, 3:22 AM

Ah, that's a good idea. I agree this makes sense.

I'm wondering if there are other memory-and-immediate instructions that could be handled like that. Looking at the list, I see:

  • logical compares (CLFHSI, CLGHSI) -- those should be trivial to add
  • addition (ASI, AGSI, potentially also ALSI, ALGSI) -- it should be possible to add them, I think (it may be a bit different since it would replace a read-modify-write)

Then there are a number of instructions that operate on a byte or halfword level in memory:

  • MVI, MVHHI,
  • CLI, CHHSI, CLHHSI
  • NI, OI, XI

I'm not sure if there is an option to exploit those, given that LLVM will never spill 8- or 16-bit values. But maybe there could be some opportunities in special cases? E.g. an OILL with a small constant could be replaced by an OI targeting just the low byte of the in-memory value.

Maybe this would be something to think about for future extensions.

jonpa updated this revision to Diff 250206.Mar 13 2020, 8:00 AM
logical compares (CLFHSI, CLGHSI) -- those should be trivial to add

Patch updated to include these as well, with a check for an uint<16> immediate. This gives +1800 CLGHSI and +200 CLFHSI compared to before.

addition (ASI, AGSI, potentially also ALSI, ALGSI) -- it should be possible to add them, I think (it may be a bit different since it would replace a read-modify-write)

These are already mostly handled, right? Exception is AHIMuxK which I couldn't commit since it is causing a test case to get a lot of uncoalesced COPYs as reported on Bugzilla. AGHIK could also be handled the same way.

Then there are a number of instructions that operate on a byte or halfword level in memory:

The only thing I can see is just a few cases of:
NIFMux -> NI (just a few cases with "small" immediate)
OILMux -> OI few cases with small immediate
XIFMux -> XI 10 cases with small immediate

Given it is rare it seems it can probably wait..

What remains (from looking at benchmarks) seems to be:

  • We can inspect the allocated physregs in foldMemoryOperandImpl(), so we could try to handle W... FP instructions. We could experiment with doing this only if there is a present allocation that fits, or what happens if we constrain it in case it is unallocated (to FP registers 0-15):
WFA[SDX]B op 1/2 -> two address A[DEX]BR if low16 vecregs -> A[DEX]B if dst same...
WFS[SDX]B                       S[DEX]BR                  -> S[DEX]B 
WFC[SDX]B        ->             C[EDX]BR                  -> C[EDX]B
WFD[SD]B         ->             D[SD]BR                   -> D[SD]B
WFMA[SD]B        ->             MA[ED]BR                  -> MA[ED]B
WFMS[SD]B        ->             MS[ED]BR                  -> MS[ED]B
WFM[SD]B         ->             M[ED]BR                   -> M[EE/D]B
  • MS(G)RKC -> MS(G) reg/mem mapping seems missing (4000 potential cases)
  • VLVGP (Ops 1, 2) (justa a dozen cases) -> 2 x VLEG
  • VLVGP32 (Ops 1, 2 same reg) (just ~20 cases) -> VLREPF ?
  • LFER / LEFR (88 cases)
  • When where 'Ops.size() == 2', for ops [0, 1]: A(G)RK, S(G)RK, NRK, ORK, WF[AS][DS]B, MS(G)RKC

Many of these have the same virt reg in operands 0 and 1, but not all.
Since we now don't use a reg/mem instruction, we do
<tmp> = reload
<tmp> = ARK <tmp>
store <tmp>

If we could use a temporary register in foldMemoryOperandImpl(), I think we instead could do
<tmp> = A
store <tmp>

Not sure if that's allowed...

jonpa updated this revision to Diff 252055.Mar 23 2020, 8:47 AM

Patch extended to also handle

  • MS(G)RKC -> MS(G)C
  • LEFR
  • Vector W... to fp-mem instructions: binary ops, fused ops, compares and unary ops.

The patch checks the other operands of the single-lane (W) vector instruction. If all are already allocated to an FP (<16) register, then a memfold pseudo for the mapped fp memory instruction is used (since the register allocator has already begun to spill, it seems less promising to try to constrain any not yet allocated register to FP registers and do the fold also in such cases).

New mappings are added from W... to FP instructions. Since any instruction that is mapped to a memory instruction in getMemOpcode() has to be correctly treated, I wanted to be sure that we only see expected opcodes by recognizing them all in foldMemoryOperandImpl() and then having a check to see that there are no stray vector instructions showing up. It seems to me now that it would probably be more reasonable to just check any instruction at that point for vector registers allocated >15 and in that case return nullptr. In other words trust the mapping in SystemZInstrFormats.td and allow any folding by it as long as there are no FP16-31 registers allocated. Waiting for some feedback before changing this...

The MemFoldCopies stat is still low and acceptable, I think: 37 on a SPEC 17 build. This is when the regalloc evicts one of the registers of the memfold pseudo and a COPY then later has to be built in SystemZPostRewrite.cpp.

Added commutation flags on WFA/WFM - I hope this is ok and also that there are no unwanted implications regarding fp semantics with this patch.

LIS->getRegUnit() will compute the LiveInterval for CC. This is now needed in two places so it was moved to the top of the function and so always called (should not be a compile time problem). Because of this, the kill flags on CC in int-cmp-56.mir are removed (which seems to always be done during regalloc).

I think this is most of what can be done for memory operand folding at the moment, looking at a SPEC 17 build.

jonpa marked an inline comment as done.Mar 23 2020, 9:11 AM
jonpa added inline comments.
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1134

This handles just some 30-40 cases so I am not sure how useful this is, but I suppose it can be.

There are a few more (50?) unhandled cases of LEFR/LFER that appear to require things like new opcodes with special handlings or similar ('%gr64bit = LFER %vr32bit' for instance seems awkward to handle ...)

logical compares (CLFHSI, CLGHSI) -- those should be trivial to add

Patch updated to include these as well, with a check for an uint<16> immediate. This gives +1800 CLGHSI and +200 CLFHSI compared to before.

OK, good!

addition (ASI, AGSI, potentially also ALSI, ALGSI) -- it should be possible to add them, I think (it may be a bit different since it would replace a read-modify-write)

These are already mostly handled, right? Exception is AHIMuxK which I couldn't commit since it is causing a test case to get a lot of uncoalesced COPYs as reported on Bugzilla. AGHIK could also be handled the same way.

Ah right, I missed those.

A(G)HIK is an interesting case; we already have code to check whether we can replace a three-operand with a two-operand opcode, but that is done *after* the A(G)HI check. I guess in the past, this wasn't very important since most operations were two-operand; but a while back, we changed that to start out everything as three-operand, and only switch back to two-operand later -- this might have caused a regression for foldMemoryOperandImpl ...

Not sure I understand the AHIMuxK problem. In order to do foldMemoryOperandImpl, we'd have do first to the three-operand to two-operand conversion like for A(G)HIK. And then it doesn't matter whether the spilled register is GR32 or GRH32 as we'd always leave it in memory and replace it with an ASI, right?

Then there are a number of instructions that operate on a byte or halfword level in memory:

The only thing I can see is just a few cases of:
NIFMux -> NI (just a few cases with "small" immediate)
OILMux -> OI few cases with small immediate
XIFMux -> XI 10 cases with small immediate

Given it is rare it seems it can probably wait..

Agreed. Thanks for checking!

What remains (from looking at benchmarks) seems to be:

  • We can inspect the allocated physregs in foldMemoryOperandImpl(), so we could try to handle W... FP instructions. We could experiment with doing this only if there is a present allocation that fits, or what happens if we constrain it in case it is unallocated (to FP registers 0-15):
WFA[SDX]B op 1/2 -> two address A[DEX]BR if low16 vecregs -> A[DEX]B if dst same...

Well ... even for the first step you already need the dst to be the same ... A[DEX]BR themselves are two-operand.

In any case, that seems a similar case to the three- vs. two-operand issue, except that now we'd need to constrain the regclass, so that might be an overall loss. Hard to say. Might be interesting to experiment.

  • MS(G)RKC -> MS(G) reg/mem mapping seems missing (4000 potential cases)

OK, need to watch out for CC though.

  • VLVGP (Ops 1, 2) (justa a dozen cases) -> 2 x VLEG
  • VLVGP32 (Ops 1, 2 same reg) (just ~20 cases) -> VLREPF ?
  • LFER / LEFR (88 cases)

Not sure whether this is worthwhile, but OK.

  • When where 'Ops.size() == 2', for ops [0, 1]: A(G)RK, S(G)RK, NRK, ORK, WF[AS][DS]B, MS(G)RKC

Many of these have the same virt reg in operands 0 and 1, but not all.
Since we now don't use a reg/mem instruction, we do
<tmp> = reload
<tmp> = ARK <tmp>
store <tmp>

If we could use a temporary register in foldMemoryOperandImpl(), I think we instead could do
<tmp> = A
store <tmp>

Not sure if that's allowed...

So this is the case where we have a three-operand instruction, and the output and one of the inputs are spilled (but not the same), right?

Can't we do the folding just on the input then, and leave the output as spilled? Not sure how this works with the foldMemoryOperandImpl logic ...

Oops, sorry, missed your update. Will look into that shortly.

jonpa updated this revision to Diff 252315.Mar 24 2020, 7:33 AM

The vector -> fp instruction part removed to be posted separately.

See inline comments. Otherwise this looks good, but I'd rather commit this as two separate patches (the memory-immediate changes in one, and the MS(G)RKC changes in the other).

llvm/lib/Target/SystemZ/SystemZInstrFormats.td
3138 ↗(On Diff #252315)

Ugh, that's annoying. Can you at least move the mnemonic twiddling to the other side, i.e. to MemFoldPseudo? We already have similar twiddling for LOC vs SEL there.

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1134

It seems a bit odd to just handle LEFR/LFER (which are simply special forms of VLVGF/VLGVF) while not handling any of the regular VLVG/VLGV forms ...

It's probably not worth having just the special case; maybe we should have logic to handle the generic case. But in any case this should also be a separate patch.

1228

I'm wondering if the should just have getTwoOperandOpcode return something for these instead.

But it's probably not worth the complication to special-case the CC handling in ShortenInst ...

uweigand added inline comments.Mar 24 2020, 9:28 AM
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1228

Hmm. Thinking about this some more, I think what we're really testing for here is simple: does the MemOpcode need its output and first input operand to be tied or not. If yes, we need to verify that they can be tied (potentially via commutation).

I believe it should be possibly to simply check for that property by looking at the MCInstrDest for MemOpcode.

jonpa updated this revision to Diff 252565.Mar 25 2020, 6:54 AM
jonpa marked 6 inline comments as done.

MS(G)RKC -> MS(G)C part moved to https://reviews.llvm.org/D76771

jonpa added inline comments.Mar 25 2020, 6:58 AM
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1134

OK, removing that for now since that's relatively rare.

1228

That seems to work but then also the MemFoldPseudo needs to be looked-through.

uweigand accepted this revision.Mar 25 2020, 7:22 AM

LGTM, thanks!

This revision is now accepted and ready to land.Mar 25 2020, 7:22 AM
This revision was automatically updated to reflect the committed changes.