This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Improve foldMemoryOperandImpl: vec->FP conversions
ClosedPublic

Authored by jonpa on Mar 24 2020, 7:35 AM.

Details

Summary

(copied from original post:)

The patch checks the other operands of the single-lane (W) vector instruction. If all are already allocated to an FP (<16) register, then a memfold pseudo for the mapped fp memory instruction is used (since the register allocator has already begun to spill, it seems less promising to try to constrain any not yet allocated register to FP registers and do the fold also in such cases).

New mappings are added from W... to FP instructions. Since any instruction that is mapped to a memory instruction in getMemOpcode() has to be correctly treated, I wanted to be sure that we only see expected opcodes by recognizing them all in foldMemoryOperandImpl() and then having a check to see that there are no stray vector instructions showing up. It seems to me now that it would probably be more reasonable to just check any instruction at that point for vector registers allocated >15 and in that case return nullptr. In other words trust the mapping in SystemZInstrFormats.td and allow any folding by it as long as there are no FP16-31 registers allocated. Waiting for some feedback before changing this...

The MemFoldCopies stat is still low and acceptable, I think: 37 on a SPEC 17 build. This is when the regalloc evicts one of the registers of the memfold pseudo and a COPY then later has to be built in SystemZPostRewrite.cpp.

Added commutation flags on WFA/WFM - I hope this is ok and also that there are no unwanted implications regarding fp semantics with this patch.

LIS->getRegUnit() will compute the LiveInterval for CC. This is now needed in two places so it was moved to the top of the function and so always called (should not be a compile time problem). Because of this, the kill flags on CC in int-cmp-56.mir are removed (which seems to always be done during regalloc).

Diff Detail

Event Timeline

jonpa created this revision.Mar 24 2020, 7:35 AM
jonpa added a comment.Mar 24 2020, 7:37 AM

This is a patch on-top of the other part of this improvement, so the diff is not against trunk exclusively, although nearly all of it should be...

jonpa updated this revision to Diff 252566.Mar 25 2020, 6:58 AM

Patch rebased

jonpa updated this revision to Diff 253897.Mar 31 2020, 8:27 AM

patch rebased.

jonpa updated this revision to Diff 258929.Apr 21 2020, 2:00 AM
  • Added a new test case that shows the use of a VR128 virtual register operand. This is fun12 in foldmemop-vec-cmp.mir, where a float 1.0 is generated with a VGMF.
  • As discussed, the new mapping between vector instructions and FP instructions is closely related to an already existing mapping in SystemZShortenInst.cpp, and it might be nice to merge them.

The patch currently adds this type of mapping:

WFADB -> ADB_MemFoldPseudo -> ADB
WFCDB                      -> CDB
WFSQDB                     -> SQDB

SystemZShorten uses a mapping to the FP register opcode:

WFADB  -> ADBR
WFCDB  -> CDBR
WFSQDB -> SQDBR

If we added a generated function getRegOpcode(), we could in Shortening use that to do

WFADB -> ADB_MemFoldPseudo -> ADB -> ADBR

If we instead replaced the mapping in Shortening with a generated function getFPOpcode(), foldMemoryOperandImpl() would need a new getPseudoMemOpcode() function to do:

WFADB -> ADBR -> ADB -> ADB_MemFoldPseudo

The existing getTargetMemOpcode() would still be needed in PostRewrite.

On the other hand, looking at this patch it might be possible to skip the added opcode checks and deduce that information for foldMemoryOperandImpl() looking at the register classes and number of operands.

jonpa updated this revision to Diff 260071.Apr 24 2020, 11:45 PM

Instead of checking a lot of newly mapped opcodes, use a general heuristic that simply demands that any operand which is a vector register per the descriptor must be allocated to an FP register. The opcodes for the fused fp instructions are still checked for explicitly, even though it might be ok to count the number of operands and their register classes.

Some minor fixing as well compared to last time.

On SPEC'17, I see:

ld             :                70304                67606    -2698
adb            :                13118                14401    +1283
lghi           :               463193               464385    +1192
lay            :                71431                70314    -1117
sdb            :                 4039                 5063    +1024
sdbr           :                 7641                 6640    -1001
adbr           :                20458                19473     -985
mdb            :                17530                17927     +397
wfmdb          :                39703                39374     -329
wfadb          :                24564                24266     -298
lde            :                52890                52603     -287
vlrepg         :                19277                19187      -90
aeb            :                10104                10188      +84
meeb           :                18323                18404      +81
ddb            :                  676                  755      +79
ldy            :                 3602                 3526      -76
ddbr           :                 2736                 2661      -75
cdb            :                 3556                 3627      +71
cebr           :                15036                14966      -70
...

, with -ffp-contract=fast:

ld             :                77018                75490    -1528
adb            :                 9949                10576     +627
adbr           :                 8877                 8369     -508
madb           :                 6408                 6879     +471
madbr          :                 9262                 8895     -367
lghi           :               463190               463500     +310
lay            :                72607                72306     -301
lde            :                56246                55963     -283
mdb            :                 7795                 8037     +242
wfmdb          :                23881                23686     -195
wfadb          :                11031                10912     -119
wfmadb         :                18341                18237     -104
cdb            :                 3556                 3638      +82
ddbr           :                 2723                 2641      -82
ddb            :                  676                  758      +82
cdbr           :                12034                11957      -77
ceb            :                 8184                 8257      +73
cebr           :                15044                14971      -73
vlrepg         :                19038                18970      -68
...
jonpa updated this revision to Diff 260631.Apr 28 2020, 7:30 AM

Remove new _r instruction classes. This did not change the resulting tablegen mappings.

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2020, 7:30 AM
jonpa added a comment.Apr 28 2020, 7:39 AM

Maybe there should be a test that checks that a live CC value isn't clobbered by introducing an FP-mem instruction, such as ADB...

Maybe there should be a test that checks that a live CC value isn't clobbered by introducing an FP-mem instruction, such as ADB...

Yes, of course -- but isn't that already in your patch (line 1184)?

Looks good to me now, just one cosmetic comment inline.

And also a question about this:

LIS->getRegUnit() will compute the LiveInterval for CC. This is now needed in two places so it was moved to the top of the function and so always called (should not be a compile time problem). Because of this, the kill flags on CC in int-cmp-56.mir are removed (which seems to always be done during regalloc).

I don't think I understand this -- looking at the code in question, CC does appear to be killed in that instruction. Is this not right? If it is, why does recomputing live intervals remove the flag?

llvm/lib/Target/SystemZ/SystemZInstrFormats.td
2923

So it is a bit weird that the vec->fp renaming is done here for Unary and Compare, but is not done (and instead done in the reverse direction by the MemFold pseudos) for Binary and Ternary. It would be nicer if this were symmetric.

jonpa updated this revision to Diff 260644.Apr 28 2020, 8:17 AM

Added test case for a live CC: foldmemop-vec-cc.mir

jonpa updated this revision to Diff 260688.Apr 28 2020, 10:20 AM

Yes, of course -- but isn't that already in your patch (line 1184)?

Ah - I meant a test as in "test case", which I could not find in the new tests...

I don't think I understand this -- looking at the code in question, CC does appear to be killed in that instruction. Is this not right? If it is, why does recomputing live intervals remove the flag?

It will be eventually added again after regalloc (not until PostRA Machine Scheduler since it is not allocatable), but this particular test case only runs the greedy pass so that will be seen in the output.

So it is a bit weird that the vec->fp renaming is done here for Unary and Compare, but is not done (and instead done in the reverse direction by the MemFold pseudos) for Binary and Ternary. It would be nicer if this were symmetric.

It seems that (as long as we don't want to change existing mappings on the scalar instructions)

WFSQDB has to change the name since SQDBR is already mapped to SQDB:

SQDBR ->OpKey:SQDBR-> SQDB
WFSQDB ->OpKey:SQDBR-> SQDB

So in trying to do the OpKey processing on the register instruction side to make it more symmetric, I find that

ADBR is mapped to ADB, via OpKey:ADBR.

In order to map WFADB to ADB_MemFoldPseudo, I needed to remove "wf" and also use a "MemFold" string, since the plain "ADB" would map it to the target memory instruction.

The ternary instructions also similarly needed a "MemFold" key substring.

Is this better than previously?

> Is this better than previously?

Well, not really :-)

Maybe the cleanest way would be to actually have MemFold pseudos for the *vector* instructions, and have those handle the remapping. So for example:

WFADB -> OpKey:WFADB -> WFADB_MemFold -> MemKey:ADB -> ADB
and then just the same for
WFSQDB ->OpKey:WFSQDB->WFSQDB_MemFold -> MemKey:SQDB -> SQDB

Does this make sense?

jonpa updated this revision to Diff 260925.Apr 29 2020, 8:23 AM

Does this make sense?

I tried this... One drawback is that the flags of the memfold pseudo must now be carefully transferred from the target memory instruction. I found that only ADB and SDB sets CC, while MDB and DDB do not - it was nice before in a way to have all of that defined in only one place. So I had to use one more class that defines a pseudo that sets CC... I also had to cancel out the isCommutableFlag on the memfold-pseudo.

I am not sure exactly which way is the cleanest - what do you think?

jonpa updated this revision to Diff 262103.May 5 2020, 7:13 AM

Reverted back to the first version of new instruction classes providing the mapping from W...-reg -> FP-mem instructions.

Reverted back to the first version of new instruction classes providing the mapping from W...-reg -> FP-mem instructions.

Thanks. Let my try one final attempt to simply this confusing logic :-)

Right now, we have:

  • For UnaryVRRa and CompareVRRa, the OpKey holds the FP-reg instruction name (e.g. wldeb->ldebr or wfsqsb->sqebr), which the matches the "mem" OpKey for the normal FP-mem instruction.
  • For BinaryVRRa and TernaryVRRe, the OpKey holds the W-reg instruction name, which matches the "mem" OpKey of the MemFold-pseudo, which has the FP-reg instruction name as MemKey.

Can we instead move the W->FP mnemonic conversion back into BinaryVRRa/TernaryVRRe instead, so that the OpKey for all these vector instructions consistently holds the FP-reg instruction name? The MemFold-pseudos would then have the same mnemonic as OpKey and MemKey.

Because that conversion is still awkward as it has to handle many not-quite-consistent name combinations, maybe we should then even make this explicit by giving UnaryVRRa and friends an optional "fp_mnemonic" operand, so that the mapping is fully explicit in the main .td file, e.g. like so:

def WLDEB : UnaryVRRa<"wldeb", 0xE7C4, any_fpextend, v64db, v32sb, 2, 8, "ldebr">;
def WFSQSB : UnaryVRRa<"wfsqsb", 0xE7CE, any_fsqrt, v32sb, v32sb, 2, 8, "sqebr">;

Ah, that still runs into the same name collisions with the MemFold pattern as you pointed out earlier. Sorry for missing that!

So if we use an extra modifier in the OpKey for those case, then we're basically back to your version of Apr 28, 7:15 PM, except possibly with the explicit mnemonics in the .td file.

Hmm, talking about MemFold, I'm wondering about this:
​ // Fused multiply and add/sub need to have the same dst and accumulator reg.

Given that this check is necessary, what's then the point of having a MemFold for those ternary instructions in the first place? Can't we then directly emit the target instruction?

jonpa updated this revision to Diff 263157.May 11 2020, 6:29 AM

Can we instead move the W->FP mnemonic conversion back into BinaryVRRa/TernaryVRRe instead, so that the OpKey for all these vector instructions consistently holds the FP-reg instruction name? The MemFold-pseudos would then have the same mnemonic as OpKey and MemKey.
... Ah, that still runs into the same name collisions with the MemFold pattern as you pointed out earlier. Sorry for missing that!

Yeah, WFADB:ADBR would map to ADB, and not ADB_MemFoldPseudo which is what we need...

So if we use an extra modifier in the OpKey for those case, then we're basically back to your version of Apr 28, 7:15 PM, except possibly with the explicit mnemonics in the .td file.

Ok, went back to that version plus added the explicit fp_mnemonic fields, which seems nice to me as they eliminate those ugly string substitutions.

Hmm, talking about MemFold, I'm wondering about this:

​> // Fused multiply and add/sub need to have the same dst and accumulator reg.

Given that this check is necessary, what's then the point of having a MemFold for those ternary instructions in the first place? Can't we then directly emit the target instruction?

I believe this follows the same pattern as the other MemFold cases: A WFMADB has four register operands, which all may be different virtual registers at this point. In the case where we find that DstReg and AccReg has already been allocated to the same physical register, we proceed with a MemFold pseudo. The register allocator may still however evict live ranges and reallocate these operands, which is why we can't go directly to MADB.

uweigand accepted this revision.May 11 2020, 8:20 AM

Can we instead move the W->FP mnemonic conversion back into BinaryVRRa/TernaryVRRe instead, so that the OpKey for all these vector instructions consistently holds the FP-reg instruction name? The MemFold-pseudos would then have the same mnemonic as OpKey and MemKey.
... Ah, that still runs into the same name collisions with the MemFold pattern as you pointed out earlier. Sorry for missing that!

Yeah, WFADB:ADBR would map to ADB, and not ADB_MemFoldPseudo which is what we need...

So if we use an extra modifier in the OpKey for those case, then we're basically back to your version of Apr 28, 7:15 PM, except possibly with the explicit mnemonics in the .td file.

Ok, went back to that version plus added the explicit fp_mnemonic fields, which seems nice to me as they eliminate those ugly string substitutions.

Yes, this does look a lot more reasonable. Thanks for going through all those iterations!

Hmm, talking about MemFold, I'm wondering about this:

​> // Fused multiply and add/sub need to have the same dst and accumulator reg.

Given that this check is necessary, what's then the point of having a MemFold for those ternary instructions in the first place? Can't we then directly emit the target instruction?

I believe this follows the same pattern as the other MemFold cases: A WFMADB has four register operands, which all may be different virtual registers at this point. In the case where we find that DstReg and AccReg has already been allocated to the same physical register, we proceed with a MemFold pseudo. The register allocator may still however evict live ranges and reallocate these operands, which is why we can't go directly to MADB.

Ah, I see. Makes sense.

See one final inline comment, otherwise this now looks good to me.

llvm/lib/Target/SystemZ/SystemZInstrFormats.td
4426

That first line setting OpKey seems superfluous?

This revision is now accepted and ready to land.May 11 2020, 8:20 AM
jonpa marked 2 inline comments as done.May 12 2020, 12:24 AM
jonpa added inline comments.
llvm/lib/Target/SystemZ/SystemZInstrFormats.td
4426

oh, sorry...

Thanks for review!

This revision was automatically updated to reflect the committed changes.
jonpa marked an inline comment as done.