This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Enable MachineCombiner for FP reassociation.
Needs ReviewPublic

Authored by jonpa on Apr 7 2023, 11:36 AM.

Details

Reviewers
uweigand
Summary

Have been experimenting with enabling MachineCombiner.

It seemed very likely better to me to let the MachineCombiner work on reg/reg opcodes as it didn't work at all when the input had ADBs. It is expecting opcodes to be the same LHS/RHS, and of course that didn't work when the RHS was a memory operand.

To do this I tried to first select all FP64 adds directly to WFADB always. The MachineCombiner now could do its work, and after that the Peephole optimizer will help with folding the loads again to ADB:

main <> patched: isel + peep of f64 adds, machine combiner disabled.

adb            :                10796                 9057    -1739
...
wfadb          :                10733                11691     +958
adbr           :                 4345                 5203     +858
...

Spill|Reload   :               642581               641631     -950
Copies         :              1011928              1011420     -508

Most of the loads are folded, but not all (84%). Spilling/copies however improves, so not sure if it's worth further effort to fold everything, or even if that's better.

Preliminary benchmarking shows 20% (!) improvement on LBM - twice as much as with the SLP vectorizer (confirmed with a preliminary "full" run also).

Slight improvement on namd and imagick (~1%), without any reassociation, just with the add reg/mem change, which makes it seem like this worked out pretty well.

A complication, though, with this: changing VL64;WFADB -> ADB introduces a CC clobbering. Scanning from the WFADB and forward many times lead to a quick termination when a new CC def was encountered. There were however also many longer searches.

Measuring compile time for the Peephole pass shows a slight average increase, with a bad worst case or two:

main <> patch (without running machine combiner)

Num stats: 3465         Num stats: 3465
Average Wall: 0.49%   | Average Wall: 0.50%
Wall %   Count          Wall %   Count
4.0      1            | 9.3      1
1.5      1            | 4.3      1
1.4      4            | 3.3      1
                      > 2.7      1
                      > 1.5      2
                      > 1.4      5
1.3      4              1.3      4
1.2      3            | 1.2      2
1.1      4            | 1.1      6
1.0      10           | 1.0      11
0.9      58           | 0.9      69
0.8      152          | 0.8      148
0.7      270          | 0.7      281
0.6      586          | 0.6      611
0.5      894          | 0.5      921
0.4      842          | 0.4      801
0.3      480          | 0.3      460
0.2      139          | 0.2      121
0.1      14           | 0.1      10

This might be acceptable perhaps.

Selecting WFADB (the real instruction, not a pseudo) directly in Select() lead to one file having a few cases of LOC instructions moved around in the block crossing the places where ADB would no longer be possible. Just in one file, but a dozen or so times inside a loop, so not quite ideal. Using an isel pseudo that clobbers CC remedied this as it was the isel scheduler that were causing this difference. With this, there are no cases of CC clobbering that prevents optimizeLoadInstr() on SPEC, but the scan to make sure still has to be made of course.

  • One idea I tried was to simply add the CC def to WFADB, which then would be trivially replacable with ADB. It did cause some changes in ~50 files, but not sure if that would make a difference.
  • A compromise would be to have a WFADB_CC pseudo live all the way through optimizeLoadInst() and then convert to the real opcode before scheduling. That would however require a new pass it seems, as now *all* of them has to be handled, while optimizeLoadInst() only sees the candidates for the load folding. If that new pass would have LiveIntervals available, the scan for CC would not be necessary as that query should now be available.

In short, this seems promising performance-wise even compared to the SLP vectorizer generating reductions, but I'm not quite sure which way to handle the CC clobbering problem is the best fit.

Diff Detail

Unit TestsFailed

Event Timeline

jonpa created this revision.Apr 7 2023, 11:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2023, 11:36 AM
jonpa requested review of this revision.Apr 7 2023, 11:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2023, 11:36 AM
jonpa updated this revision to Diff 518670.May 2 2023, 3:31 AM
  • Using a pseudo reg/reg that pretends to clobber CC to simplify the later reg/mem folding. This seems to work well.
  • Handling all floating point add, sub and mul, which are the set of operations that depend on the reassociable/nsz flags.
  • Having scheduler info for the _CCPseudo:s turned out to be important as MachineCombiner looks at the latencies.
  • New target hook "processFunctionAfterPeepholeOpt()", which lets target do any post-processing after peephole optimizations. This is needed in order to substitute the _CCPseudo instructions that did not get optimized (into reg/mem), with the target instruction without the CC operand.
  • Using a pattern for selecting the _CCPseudo instructions instead of doing it manually in select(). This change made it obvious that the NoFPExcept flag is to be added here, which the pattern based selector does. This looks ok to me, but not quite sure if it would make more sense to instead predicate the reg/mem patterns or not.

FMin, FMax and the integer instructions are the remaining instructions that need to be handled.

Preliminary benchmarking looks good - lbm improved another percent with the subtractions (now 21% improvement).

I have not yet looked further into any effects of this way of doing the instruction selection with later peephole folding of loads. One thing I noticed is some cases where MDEBR is not used, instead WLDEB + WFMDB. It seems two separate instructions is slower in this case. Not sure if it would be worth handling that in processFunctionAfterPeepholeOpt(), or if perhaps just not handling those cases with reassociation (relatively few cases):

main <> patched

mdebr          :                  170                   16     -154
wldeb          :                  661                  793     +132
ldebr          :                 8803                 8790      -13
ldeb           :                 5599                 5596       -3
jonpa updated this revision to Diff 523291.May 18 2023, 1:32 AM
  • MachineSink behaved differently with the new pseudos that clobber CC - fixed with a patch in MachineSink plus making sure to mark the CC def as dead on the newly created instructions in MachineCombiner.
  • Previously tried selecting _CCPseudo:s by using a pattern with added complexity for them, and then also an even higher complexity for MDEBR. Seemed better to instead predicate the reg/mem pattern with "no reassociation flags", and selecting the _CCPseudo in case of "reassociation flags", or else the target instruction, which the patch now does. These two alternatives gave identical output on SPEC.
  • Experimented with MDEBR but it seemed that those cases are rare and there was not any more reassociation done on benchmarks - so I removed the folding I had working in FinalizaeReassocication (ldebr; wfmdb -> mdebr).
  • PeepholeOptimizer does not fold loads across basic blocks but it seems good to fold them in FinalizeReassociation. Tried doing this first with only loads from constant pool, but it seemed to be even better to do it on any load.
  • Removed the check (in optimizeLoad()) when folding into reg/mem that there is no other user in MBB. With this restriction:
                                 main                patch
mdb            :                 9667                 5507    -4160
meeb           :                 8838                 4831    -4007
adb            :                10787                 8591    -2196
aeb            :                 7322                 5534    -1788
sdb            :                 4271                 4409     +138
seb            :                 4706                 4094     -612
Copies         :              1006170               999666    -6504

Without it (as patch is now), the number of reg/mem instructions are much closer to main:

mdb            :                 9667                 9061     -606
meeb           :                 8838                 8210     -628
adb            :                10787                 9467    -1320
aeb            :                 7322                 6992     -330
sdb            :                 4271                 4637     +366
seb            :                 4706                 4697       -9
Copies         :              1006170              1006497     +327

As seen in the number of register moves (copies) in the output, the folding into 2-addres reg/mem has a price of copying the source reg. The lesser number of copies didn't seem to matter in performance. With the extra folding I see a great improvement in f538.imagick_r (~15%), which is probably be the same improvement as if disabling the pre-ra machine-scheduler, so it seems that the increased spilling there is avoided also this way. LBM also gains another 2% with this (now ~20%), so it looks preferable at least the moment. If the scheduler is improved to improve on the register pressure consistently, perhaps this could be reevaluated.

  • With nightly full runs I now see three big improvements on z15:
Improvements:
0.794: f519.lbm_r 
0.855: f538.imagick_r 
0.906: f510.parest_r

Will now give it a try to find further improvements with fused add/sub and multiply.

jonpa updated this revision to Diff 523381.May 18 2023, 7:49 AM

Better not reject reg/mem folding during isel (like adb) for older machines.

jonpa updated this revision to Diff 528098.Jun 3 2023, 5:30 AM

The Add/Sub/Multiply reassociation seems to work well, so this is adding a handling of FMAs on top of that. This is still under development as it is not yet clear exactly what is the best approach.

  • SystemZReassocAdditions.cpp: New experimental optimization that finds chains of FMAs and Adds and reorders computations to minimize stalls. This should help MachineCombiner reassociation but could also perhaps be good on its own. A little unsure of how aggressive this pass should/could be. Right now assuming that if the FmContract flag is passed in addition to FmReassoc and FmNsz, it is ok to take apart FMAs/Adds and form new a new FMA on the top of the chain.
  • Reg/mem handling for WFMADB and WFMASB the same way as for the binary instructions.
  • Experimental FMA patterns added with tests. Two of them are inspired by the PPC ILP patterns. Have not yet tried these on benchmarks so still mostly unclear what will be the best in the end. One issue might be with longer FMA chains where it might be good to have MachineCombiner step back and revisit the resulting chain of Adds. If it could do that, using FMA2 or FM4 should be the same on a chain of 4 FMAs, for example.

The current idea is to let MachineCombiner decide if a transformation is beneficial, looking at the actual depths of the MachineInstrs using the SchedModel. For the binops, TargetInstrInfo reassociation tries two variants to let MachineCombiner decide which one is improving the Critical Path. For FMA patterns, this is not done so it is kind of arbitrary if a pattern is beneficial - it depends on the incoming operand latencies of the multiplication factors. Maybe that could be done on smaller patterns, like having three or four variants of it, or perhaps the target hook could use the BlockTrace to decide that directly instead. Compile time has been mentioned in comments, so perhaps that would be a good idea (if useful). Then again, the ReassocAdditions pass would be the alternative to doing this, if that turns out to be of good use.