Page MenuHomePhabricator

[SystemZ] Generate XC loop for memset 0 of variable length.
ClosedPublic

Authored by jonpa on Jun 7 2021, 5:45 PM.

Details

Summary

Tried first inserting new MBBs for the EXRL and target instruction, but that has some problems:

  • Need a guarantee that the target instruction is not moved by later passes. hasSideEffects flag and surrounding the target instruction with MemBarrier:s might should do the trick but that is a bit clumsy. And to add hasSideEffects on all MemorySS instructions will cause many files to change since that will cause a new scheduler region to be created.
  • The MBB of the target instruction should be reachable, or I think it might be simply removed by some optimizer. In order to maintain CFG edges between the EXRL MBB and TargetIns MBB, analyzeBranch() needs to detect these blocks and return them as unanalyzable, since EXRL is not really a branch or terminator. This is also a bit of extra work...

I Instead tried using an EXRL_Pseudo that is kept intact all the way to AsmPrinter. My newbie approach of creating symbols and MCInsts and then emit them at the end of the function seems to work fine both for assembly and obj streaming. Not sure if there is another more correct way...

A new method emitMemMemLoopVarLen() which shares code with emitMemMemWrapper() via new class MemMemBuilder (since there is no public default constructor for MachineOperand the DestBase and SrcBase must be passed to its constructor rather than via a single MI argument).

The target instruction will never be out-of-range for the EXRL instruction with a 32 bit signed range, right?

Pardon the typo-fixes mixed in here (MMB -> MBB). Maybe pre-commit them?

Diff Detail

Event Timeline

jonpa created this revision.Jun 7 2021, 5:45 PM
jonpa requested review of this revision.Jun 7 2021, 5:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2021, 5:45 PM
jonpa added a comment.Jun 11 2021, 5:37 PM

There is something wrong somewhere here.. some benchmarks are seg-faulting. investigating

jonpa updated this revision to Diff 351673.Jun 12 2021, 10:17 AM

I think the problem was that EXRL cannot use %r0 as that means no or:ing will take place.

jonpa added a comment.Jun 15 2021, 6:49 AM

Benchmarking:

Z14:

reduced:

Improvements:
0.984: f511.povray_r 
0.988: f510.parest_r 
0.992: i523.xalancbmk_r 
0.993: i531.deepsjeng_r 
0.993: f508.namd_r 
...

Regressions:
1.027: i557.xz_r 
1.006: f507.cactuBSSN_r 
...

full (4):

Improvements:
0.990: f511.povray_r 
0.991: i557.xz_r 
0.991: f510.parest_r 
0.998: i523.xalancbmk_r

Z15:

reduced:

Improvements:
0.823: i531.deepsjeng_r 
0.985: f538.imagick_r 
0.990: i500.perlbench_r 
0.993: f510.parest_r 

Regressions:
(none)

full (4):

Improvements:
0.994: i531.deepsjeng_r 
0.995: f510.parest_r 
0.997: f538.imagick_r 

Regressions:
1.002: i500.perlbench_r

This looks pretty good to me. One thing that is odd is the extra always-zero parameter to the LoopVarLen insns. This really should be removed.

As a suggestion that might remove a bit of the duplication between emitMemMemWrapper and emitMemMemLoopVarLen: the (old) routine is currently called for both the Sequence and Loop insns, and it distinguishes between them based on the operand count. However, for the LoopVarLen case, your patch now introduces a separate new routine. I think it might be simpler to continue to use the same routine, and just internally distinguish between the cases based on whether the length operand is an immediate or a register.

Finally, I'm wondering if it wouldn't make sense to extend support for the variable-length case to the other block operations, now that we already have the EXRL_Pseudo infrastructure. (I guess that could also be done in a separate patch.)

Pardon the typo-fixes mixed in here (MMB -> MBB). Maybe pre-commit them?

Yes, please pre-commit those.

llvm/lib/Target/SystemZ/SystemZAsmPrinter.h
14

The above clang-format check looks valid (includes should be kept sorted).

29

This as well.

jonpa updated this revision to Diff 353483.Jun 21 2021, 2:11 PM
jonpa marked 2 inline comments as done.

One thing that is odd is the extra always-zero parameter to the LoopVarLen insns. This really should be removed.

Done - it was easy to remove on the MachineInstr, but on the SystemZISD node it seems reasonable to keep a dummy zero there and reuse the existing opcode, or?

As a suggestion that might remove a bit of the duplication between emitMemMemWrapper and emitMemMemLoopVarLen: the (old) routine is currently called for both the Sequence and Loop insns, and it distinguishes between them based on the operand count. However, for the LoopVarLen case, your patch now introduces a separate new routine. I think it might be simpler to continue to use the same routine, and just internally distinguish between the cases based on whether the length operand is an immediate or a register.

OK - I merged the two functions instead.

Finally, I'm wondering if it wouldn't make sense to extend support for the variable-length case to the other block operations, now that we already have the EXRL_Pseudo infrastructure. (I guess that could also be done in a separate patch.)

I would like to do that afterwards then with some more benchmarking as well...

Pardon the typo-fixes mixed in here (MMB -> MBB). Maybe pre-commit them?

Yes, please pre-commit those.

b2cd98d

llvm/lib/Target/SystemZ/SystemZAsmPrinter.h
The above clang-format check looks valid (includes should be kept sorted).
This as well.

done

Also added handling/test for i32 case (test/CodeGen/Generic/ForceStackAlign.ll was actually failing) - if the type of length is <64 bits, it needs zero-extension.

jonpa added a comment.Jun 21 2021, 2:43 PM

On spec, I see now 8869 exrl instructions, each with its own xc target instruction. I wonder if we should try to avoid duplicating target instructions. I see in some files hundreds of identical xc targets (worst case is 845 identical ones in one file: f510.parest_r/build/vectors.s !). 67 files have more than 20 identical xc targets...

One thing that is odd is the extra always-zero parameter to the LoopVarLen insns. This really should be removed.

Done - it was easy to remove on the MachineInstr, but on the SystemZISD node it seems reasonable to keep a dummy zero there and reuse the existing opcode, or?

Aren't there already two opcodes, one with 4 operands and one with 5? Can't you just use the one with 4 operands instead?

That said, maybe it actually is better to use the version with 5 operands, and actually pass in two values: the length (minus one), and the loop trip count (length / 256). These two values can easily be computed at the SelectionDAG level already, and if they are, that might open opportunities to optimize / reuse the values at that level. (Also, then it is more obvious to re-use the same opcode, since the operands do in fact have the same semantics then, it's just that in one case they're immediates and in the other they are registers.)

As a suggestion that might remove a bit of the duplication between emitMemMemWrapper and emitMemMemLoopVarLen: the (old) routine is currently called for both the Sequence and Loop insns, and it distinguishes between them based on the operand count. However, for the LoopVarLen case, your patch now introduces a separate new routine. I think it might be simpler to continue to use the same routine, and just internally distinguish between the cases based on whether the length operand is an immediate or a register.

OK - I merged the two functions instead.

Hmm, that's not quite what I had in mind, sorry. This code still has two completely separate code paths, just now in one function. What I had been thinking of is to actually share the same code path that e.g. emits the loop -- they should be identical except that the trip count is an immediate vs. a register.

The point is to merge the two paths to the extent where there is no longer any point in extracting buildMemMemLoop as a subroutine as it actually is used only once.

On spec, I see now 8869 exrl instructions, each with its own xc target instruction. I wonder if we should try to avoid duplicating target instructions. I see in some files hundreds of identical xc targets (worst case is 845 identical ones in one file: f510.parest_r/build/vectors.s !). 67 files have more than 20 identical xc targets...

That makes sense. It should be straightforward to sort and de-duplicate the target instructions at final emission time.

jonpa updated this revision to Diff 354114.Jun 23 2021, 5:15 PM

Aren't there already two opcodes, one with 4 operands and one with 5? Can't you just use the one with 4 operands instead?

Ah, yes, I guess I was looking at the names: 'Length' vs 'Loop'...

That said, maybe it actually is better to use the version with 5 operands, and actually pass in two values: the length (minus one), and the loop trip count (length / 256). These two values can easily be computed at the SelectionDAG level already, and if they are, that might open opportunities to optimize / reuse the values at that level. (Also, then it is more obvious to re-use the same opcode, since the operands do in fact have the same semantics then, it's just that in one case they're immediates and in the other they are registers.)

I updated the patch to create the length-1 and trip count on the DAG instead.

I see that this means that they are initially placed before the zero-length check, while before they were put in the preheader of the loop. The SRLG is actually moved to the preheader by MachineSink later, and this could probably be done for the AGHI as well if the comparison would be made with the AGHI source against 0.

To keep things simple, I just changed the comparison to be against -1 instead for now. What do you think about that - is the zero-length case rare enough to ignore where the AGHI is placed? We could do the extra work of finding the AGHI/-1 and comparing with 0 of that source reg instead, and expect the AGHI will be sunk later if there are no other users I would hope.

Hmm, that's not quite what I had in mind, sorry. This code still has two completely separate code paths, just now in one function. What I had been thinking of is to actually share the same code path that e.g. emits the loop -- they should be identical except that the trip count is an immediate vs. a register.

The point is to merge the two paths to the extent where there is no longer any point in extracting buildMemMemLoop as a subroutine as it actually is used only once.

I gave it another try: First let the reg/immediate cases set up the needed MBBs and then build the loop afterwards for both cases in one place. I guess that seems more simple than having a separate class for creating the loop...

That makes sense. It should be straightforward to sort and de-duplicate the target instructions at final emission time.

AsmPrinter::EmitToStreamer() calls getSubtargetInfo() which needs an MF, so it did not seem quite simple to emit all the EXRL targets in emitEndOfAsmFile(). That method seems to be intended for other things than emitting instructions. Does it seem right to try to create a new section at that point after all else and emit the target instructions?

For now, I tried just improving the output per function. This saves 4110 XC target instructions (prior count was ~9000). Worst case is now 168 in one file, and 38 files have >20, so it is an improvement, but still some room for improvement if reusing per file...

I updated the patch to create the length-1 and trip count on the DAG instead.

Thanks!

I see that this means that they are initially placed before the zero-length check, while before they were put in the preheader of the loop. The SRLG is actually moved to the preheader by MachineSink later, and this could probably be done for the AGHI as well if the comparison would be made with the AGHI source against 0.

To keep things simple, I just changed the comparison to be against -1 instead for now. What do you think about that - is the zero-length case rare enough to ignore where the AGHI is placed? We could do the extra work of finding the AGHI/-1 and comparing with 0 of that source reg instead, and expect the AGHI will be sunk later if there are no other users I would hope.

I think this is probably OK. GCC does the compare with 0, but there's probably not much difference. It's a pity we cannot move that check up to SystemZSelectionDAGInfo::EmitTargetCodeForMemset, but I believe we cannot create new basic blocks on the DAG level.

Hmm, that's not quite what I had in mind, sorry. This code still has two completely separate code paths, just now in one function. What I had been thinking of is to actually share the same code path that e.g. emits the loop -- they should be identical except that the trip count is an immediate vs. a register.

The point is to merge the two paths to the extent where there is no longer any point in extracting buildMemMemLoop as a subroutine as it actually is used only once.

I gave it another try: First let the reg/immediate cases set up the needed MBBs and then build the loop afterwards for both cases in one place. I guess that seems more simple than having a separate class for creating the loop...

That's better, thanks! I think readability might be even better if you continue to create the MBBs in order, i.e. emit the EXRL at the end instead of at the beginning. (That means a duplicated LengthMO.isReg() test, but that should still be better.)

That makes sense. It should be straightforward to sort and de-duplicate the target instructions at final emission time.

AsmPrinter::EmitToStreamer() calls getSubtargetInfo() which needs an MF, so it did not seem quite simple to emit all the EXRL targets in emitEndOfAsmFile(). That method seems to be intended for other things than emitting instructions. Does it seem right to try to create a new section at that point after all else and emit the target instructions?

Ah, that's unfortunate, but I guess understandable. The assumption is that code emission needs subtarget-specific info, and the subtarget is defined by attributes (like target CPU) that change between functions. At the end of the asm file, we're outside of all functions ...

There are ways around that. We do have a TargetMachine in the AsmPrinter, and there used to be a function getSubtargetImpl() that returns a "default" subtarget for the compilation unit. This has been deprecated (for the above reasons), but could probably be re-activated, and then be used as the subtarget to pass to emitInstruction during emitEndOfAsmFile.

Another option would be to simply "assemble" those instructions by hand, i.e. compute the 6-byte integer value from the constituent fields. Something like:

Opcode =  TargetInsOpc << 40 | LenMinus1Reg << 32 | DestReg << 28 | DestDisp << 16 | SrcReg << 12 | SrcDisp;

and then simply emit that value as 6 bytes of "data" (but still into the .text section, of course). This is also somewhat ugly, but maybe less ugly than the default Subtarget ...

llvm/lib/Target/SystemZ/SystemZAsmPrinter.cpp
568

Ah, if you do it that way, you don't actually have to have multiple labels for the same instruction. Just compute the instruction (or just its opcode, see discussion in the main comment) first, and look it up in the table. If found, simply return its (one) associated symbol; if not, create a new symbol and add the pair of symbol, instruction to the table.

Actually, that table can then just be a std::map (from instruction to symbol) instead of a vector, this implements the lookup in a more efficient way ...

jonpa updated this revision to Diff 355400.Jun 29 2021, 4:52 PM
jonpa marked an inline comment as done.

I think readability might be even better if you continue to create the MBBs in order, i.e. emit the EXRL at the end instead of at the beginning. (That means a duplicated LengthMO.isReg() test, but that should still be better.)

Makes sense... fixed.

Ah, if you do it that way, you don't actually have to have multiple labels for the same instruction. Just compute the instruction (or just its opcode, see discussion in the main comment) first, and look it up in the table. If found, simply return its (one) associated symbol; if not, create a new symbol and add the pair of symbol, instruction to the table.

Actually, that table can then just be a std::map (from instruction to symbol) instead of a vector, this implements the lookup in a more efficient way ...

ah, of course - only one label emitted now. They by this get sorted per the operands in the comparator for the std::map object, but a little detail then is that the labels now get emitted in any order (non-sorted) - but I suppose that's ok, right?

There are ways around that. We do have a TargetMachine in the AsmPrinter, and there used to be a function getSubtargetImpl() that returns a "default" subtarget for the compilation unit. This has been deprecated (for the above reasons), but could probably be re-activated, and then be used as the subtarget to pass to emitInstruction during emitEndOfAsmFile.

Another option would be to simply "assemble" those instructions by hand, i.e. compute the 6-byte integer value from the constituent fields.

I started out with trying to "assemble" by hand per your suggestion, but it seemed like quite some work, especially considering that we also have to be able to print it as text. So I tried instead with adding back the GenericSubtarget, which was simple enough.

This then eliminates another 3786 XC targets: there are no duplicates in any file, and at most I see 13 different XC targets in a single file :-)

SPEC master <> patch

cgije          :               115009               132379   +17370
xc             :                13334                23198    +9864
exrl           :                    0                 8873    +8873
brctg          :                 8356                17228    +8872
srlg           :                54407                63229    +8822
brasl          :               640463               631658    -8805
...

ah, of course - only one label emitted now. They by this get sorted per the operands in the comparator for the std::map object, but a little detail then is that the labels now get emitted in any order (non-sorted) - but I suppose that's ok, right?

That shouldn't matter.

I started out with trying to "assemble" by hand per your suggestion, but it seemed like quite some work, especially considering that we also have to be able to print it as text.

I was thinking of just using emitInt16 for the three words, that will show as .word in the textual assembler output, which may not be pretty but is certainly correct. (You could in addition use an emitRawComment to print a textual representation to make the file more readable).

So I tried instead with adding back the GenericSubtarget, which was simple enough.

See the comments inline on that option.

In the meantime I thought of yet another option: the instructions used via EXRL are really part of the function containing the EXRL, and should therefore be emitted using the same Subtarget that is in effect for that function (this is not *currently* a problem, but it might be a potential issue if we want to use an EXRL target instruction that is only available in some ISA levels).

To correctly model this, one option might be to not have a single EXRLT2SymMap, but one per Subtarget, and sort the target instructions into the appropriate subtarget for the current function as they're added to the map. (Or, even simpler, just make the Subtarget part of the "key" for the map.)

In the common case where all functions use the same Subtarget, this would actually lead to the same result as now.

This then eliminates another 3786 XC targets: there are no duplicates in any file, and at most I see 13 different XC targets in a single file :-)

Excellent!

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
7895

All these changes seem unnecessary now.

llvm/lib/Target/SystemZ/SystemZSelectionDAGInfo.cpp
142

The above suggestion actually makes sense, we don't need a else here.

llvm/lib/Target/SystemZ/SystemZTargetMachine.h
48 ↗(On Diff #355400)

So your new "getGenericSubtarget" does exactly the same that the no-argument getSubtargetImpl used to do. This is a bit odd, in particular given the comment here.

If we do want to go that route, I think we should actually use the getSubtargetImpl name, but change the comment to explain what the "default" subtarget is, and what we use it for.

jonpa updated this revision to Diff 355670.Jun 30 2021, 12:56 PM
jonpa marked 2 inline comments as done.

Updated per review.

I was thinking of just using emitInt16 for the three words, that will show as .word in the textual assembler output, which may not be pretty but is certainly correct. (You could in addition use an emitRawComment to print a textual representation to make the file more readable).

Ah, ok, I did not realize at the time I could just print the instruction as numbers... I tried this instead now, since we don't really want to have a generic subtarget around. Is there any reason to print 3 x Int16? Would it be better to print it as one (or three) hex if possible?

In the meantime I thought of yet another option: the instructions used via EXRL are really part of the function containing the EXRL, and should therefore be emitted using the same Subtarget that is in effect for that function (this is not *currently* a problem, but it might be a potential issue if we want to use an EXRL target instruction that is only available in some ISA levels).

This seems like the logical solution, but it is a bit unfortunate to have to go through all of that trouble just for this case where it is actually not really needed at all.

Not sure if the best thing is to do it the manual way, or to have an extra subtarget map on the side in the SystemZAsmPrinter...

I wonder if perhaps the MC layer could have an emitInstruction() / assembleInstruction() method without the STI argument, that was only very basic that could work in cases like this... It would be nice to reuse the bits encoding already available... On the other hand if we only have EXRL targets of this one format, maybe it's ok to do it on the side...

Updated per review.

I was thinking of just using emitInt16 for the three words, that will show as .word in the textual assembler output, which may not be pretty but is certainly correct. (You could in addition use an emitRawComment to print a textual representation to make the file more readable).

Ah, ok, I did not realize at the time I could just print the instruction as numbers... I tried this instead now, since we don't really want to have a generic subtarget around. Is there any reason to print 3 x Int16? Would it be better to print it as one (or three) hex if possible?

Well, there is no emitInt48 ... You could use one emitInt16 and one emitInt32, but that's not really better I think.

I must admin that manual encoding code is longer and uglier than I had initially thought ...

In the meantime I thought of yet another option: the instructions used via EXRL are really part of the function containing the EXRL, and should therefore be emitted using the same Subtarget that is in effect for that function (this is not *currently* a problem, but it might be a potential issue if we want to use an EXRL target instruction that is only available in some ISA levels).

This seems like the logical solution, but it is a bit unfortunate to have to go through all of that trouble just for this case where it is actually not really needed at all.

I think it wouldn't be all that much effort. You should simply change EXRLT2SymMap to be a map from a pair of (const SystemZSubtarget *, MCInst) to MCSymbol *, add the current subtarget pointer to the key when processing SystemZ::EXRL_Pseudo in the AsmPrinter, and in emitEXRLTargetInstructions just iterate over the map (as now) -- you'll get the pairs of (const SystemZSubtarget *, MCInst) back as key, and simply emit that instruction using that subtarget.

Otherwise, this is now looking good to me.

jonpa updated this revision to Diff 356488.Jul 5 2021, 5:52 AM

Well, there is no emitInt48 ...

(I saw emitIntValueInHex(uint64_t Value, unsigned Size), but haven't tried it...)

I must admin that manual encoding code is longer and uglier than I had initially thought ...
I think it wouldn't be all that much effort. You should simply change EXRLT2SymMap to be a map from a pair of (const SystemZSubtarget *, MCInst) to MCSymbol *, add the current subtarget pointer to the key when processing SystemZ::EXRL_Pseudo in the AsmPrinter, and in emitEXRLTargetInstructions just iterate over the map (as now) -- you'll get the pairs of (const SystemZSubtarget *, MCInst) back as key, and simply emit that instruction using that subtarget.

OK, I tried this now. It is in fact the identical number of target instructions to before, so using different Subtarget do not seem to play a role on SPEC...

uweigand accepted this revision.Jul 6 2021, 4:52 AM

See the minor comment inside. Otherwise, this LGTM now. Thanks!

llvm/lib/Target/SystemZ/SystemZAsmPrinter.h
35

This comparison might involve undefined behavior given that these aren't pointers into the same array. Probably best to convert to uintptr_t before comparing.

This revision is now accepted and ready to land.Jul 6 2021, 4:52 AM
This revision was landed with ongoing or failed builds.Jul 6 2021, 9:07 AM
This revision was automatically updated to reflect the committed changes.
jonpa marked an inline comment as done.

@jonpa @uweigand
It seems that this change causes clang to crash on s390x on Ubuntu bionic:
https://bugs.llvm.org/show_bug.cgi?id=51026