This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] New pass for domain reassignment from integer to vector.
Needs ReviewPublic

Authored by jonpa on Aug 25 2020, 3:50 AM.

Details

Reviewers
uweigand
Summary

This was inspired by the X86 pass but adapted to the needs of the SystemZ backend. The aim is to reduce spilling of registers by using free vector registers if available. Other benefits are also achieved, such as avoiding expensive vector element extractions.

Vector lanes management has been added, which means that the pass can for instance extend/truncate a value and still keep track of which element it will live in.

Diff Detail

Event Timeline

jonpa created this revision.Aug 25 2020, 3:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2020, 3:50 AM
jonpa requested review of this revision.Aug 25 2020, 3:50 AM

Note: This patch still has a way to go before being ready to commit...

lkail added a subscriber: lkail.Aug 25 2020, 5:14 AM
jonpa updated this revision to Diff 288841.Aug 30 2020, 1:45 AM

Latest improvements and fixes.

Support added for insertion of register defined by a non-convertible instruction. The rationale is that it might be worth doing a VLVG or two in order to be able to convert a closure. The number of re-assignable closures (SPEC) goes from 170k to 285k with this.

jonpa updated this revision to Diff 297819.Oct 13 2020, 4:10 AM

Latest improvements (still in progress):

  • VirtRegLiveness: A simple MRI-based class for tracking liveness of virtual registers
  • LiveClosuresTracker: A class to use when iterating over the function that keeps track of which of the potentially convertible closures are live at the current point.
  • Pre scan a closure to see if it spans across a call, in which case it is deemed non-profitable as vector registers are always call-clobbered.
  • Run EarlyMachineLICM again afterwards. It would perhaps be possible to adjust insertion points instead when loading immediates.
jonpa updated this revision to Diff 299336.Oct 20 2020, 4:55 AM

Patch still experimental.

Latest improvements include:

  • only do a reassignment when it seems beneficial (given the number of live virtual regs)
  • managed to enable MachineCSE for the generated VLEI instructions.
jonpa updated this revision to Diff 302539.Nov 3 2020, 4:08 AM

Patch still experimental.

Tried

  • Isolate loops so that closures with instruction inside them are reassigned only when necessary inside the loop.
  • Don't load extra immediate operands in tiny blocks.

Saw no difference on benchmarks with these and other experiments...

In order to get some kind of idea of the success of this patch, I improved the debug dumping so that the reasons for not being able to reassign a closure is given explicitly at the point of high register pressure. With this, closures that really would have been reassigned based on register pressure but were not can be categorized (as opposed to just looking at all illegal closures). The results are:

Number of closures reassigned: 17209
Number of closures not reassigned when deemed needed: 288869

=> Only 6% successfully reassigned

Reasons:

Address             : 118922   41%  (Out of the 289k)
Physreg             :  73828   25%
Interfering call    :  47629   17%
RISBGN              :  27953   10%
INSERT_SUBREG       :  14650    5%  (Probably involves quite some work to handle subreg cases)
Insertion(s) needed :  11762    4%
CR                  :   3718
CGR                 :   2352

Address or Physreg  : 183395   63%
Addr/PhysReg/Call   : 206325   71%
Physreg or Call     : 100860   35%

RISBGN, no a/p/c    : 18375     6%

Ideas to try next:

  • Part of the address register operands is due to LA/LAY being used for addition. It could be worth trying to handle those cases.
  • RISBGN:

RISBGN operands:

  ...                       Steps rotated   High bit I4    Interpretation                     Reassignment
 3503  29, 188, 3       5%  3                1
 4117   1, 189, 0       6%  0                1
 4389  63, 191, 0       6%  0                1             copy 63:63, "zero extend LSB"      mask + VN ?
 8903  32, 189, 0      13%  0                1             "Zero hi32, copy 32:61"
21666  62, 191, 0      31%  0                1             "zero extend two lowermost bits"   mask + VN ?
  • Register compares (CR/CGR)
jonpa added a comment.Nov 10 2020, 2:59 AM

I have performed some experiments that does not seem very beneficial, so I include them here in the patch as a separate file instead of updating the patch itself:

This is what I did:

  • LoadAddressReplacer: Check uses of ADDR64/32 register to see if it is really used by a MEMORY operand before deciding it is reassignable or not (as opposed to assuming they never are). (The addresses used by the shift instructions are not used to access memory, but they would not be easily reassigned since the scalar version handles them modulo 64 and the vector element shifts module 'element size'. They have not shown up frequently as a bottleneck either.)
  • ExperimentalReplacer: Enabled with -domain-experiments. This is a quick way of enabling conversions for more instructions to evaluate the impact without going through all the steps of making sure the generated code is actually correct (output with this option is incorrect, so only used for debugging).

Results:

  • There seemed to be only some rare cases of Load Address. For example 'lg; sllg 8; la 255; In total just 78 LAs reassigned with this.
  • Even if *all* RISBGN would be convertable (which they are probably not), adding this opcode does not give a big change - only 62 RISBGNs were reassigned.

(Only 14 closures are found illegal due to vector lane conflicts, so that is not a factor generally.)

Legal closures found: 199223
Reassigned closures:   17286  (~9%)

Closures not possible to reassign at point of query: 200910
Due to
Used as address :  74575
COPY to physreg :  59076
Crossing a call :  42212
Any of above    : 145609   (~ 72%)

Other reasons seen at this point, following the above:
COPY from physreg :  9531
Compare w/0       :  5502
Any of above      : 15033  (~ 7%)

CR                :  1719

Conclusions:

Even with reassignment added for Load Address, RISBGN, and INSERT_SUBREG which were the major missing items seen, there is less than 10% of a chance of reassigning a closure at the points where register pressure seems to indicate a need for it.

The reasons for this seems to be that in the majority of the cases, there is either a call, a value used as address or COPY:ed to a physreg that makes reassignment undesireable.

Only closures containing at least one reassignable instruction are included in the counts above. The statistics should probably be taken with a grain of salt, since the algorithm for finding reassignable closures are dictating how this all turns out in the end. For example, the search stops at the point of a non-legal instruction, but if that instruction would have had a converter the closure would have grown further. So generally the illegal closures are smaller and greater in number since non convertable instructions make up dividing points. However, the results look quite similar to the last time around and the conclusions seem to make sense since it is probably exactly around calls and many memory accesses that register pressure is at its highest.

Just 1846 less spill/reload instructions less in total over benchmarks (0.3%), which is not motivation enough for this new pass at least as it looks now.

jonpa updated this revision to Diff 305744.Nov 17 2020, 4:51 AM

Patch still experimental, but updated with latest changes.

Instead of making a closure illegal at the point of an impossible conversion, keep sets of Insertions and Extractions and allow more cases of inserting and extracting registers. In particular, if a reassigned register is defined inside a loop, an extraction is allowed (only) if the scalar instruction is outside of the loop.

findExtractionInsPt() determines if a closure with needed extraction is legal, and if so where to place the extraction.

Legal closures found: 1450688   (7x from last week)
Reassigned closures:    21415   (+30% from last week, but still very little)
Closures not possible to reassign at point of query: 225077
Due to
Extraction not allowed unless outside of loop: 221314
  - COPY (to physreg) : 25572
Crossing a call :  62587

The picture in the debug dump is now very different since a lot of closures now depend on an extraction of the result..

It seems that it could be worth trying to do more extractions - for instance also if a defining instruction is before the loop and not just for those inside of it...

jonpa updated this revision to Diff 308612.Dec 1 2020, 4:09 AM

Improvements since last update:

  • Allow reassigned registers around calls by keeping them in a callee-saved FP lane.

While working on this, it seemed wrong to either add anew pseudo callee-saved vector physregs or to change the regmask on the call instruction. It also seemed like a lot of work to have duplicate opcodes with FP reg operands - e.g. VAG_FP64, VAG_FP32. Instead, COPYing FP64 out of the defined (reassigned) vector reg and then back into a new vector reg before each user is what was tried.

Copying into FP64 subregs is somewhat problematic: the register coalescer typically simply rewrites that back again to the full vector register, unless those COPYs are specifically checked for and handled in SystemZRegisterInfo::shouldCoalesce(), which was added.

Due to this, I saw more LDRs in the output, which was partially remedied by adding regalloc hints for those cases. That helped in the simple case of one user, but with more users, unfortunately two FP64 regs were used: FP8D was COPYed into V8 at the first use, but for the second use FP9D was now kept since V8 overlaps with FP8D and regalloc is not aware that V8 is only used, it seems. Due to this, only the case of a single user is currently handled.

In order to make this work a new VF128Saved regclass containing V8-V15 was needed, to keep the FP64 regs in the callee-saved ones.

A drawback is that now the function has to save/restore those FP regs even in small functions with no other vector registers in use.

I also added a tuning parameter for this ("vecsavedlim") to avoid increased spilling due to VF128Saved registers if doing too much of this.

The static effect of this was a rise in the number of reassigned closures from 8000 to 9000. 16000 closures were now not marked illegal due to calls.

For this to work, all the reassigned regs (around calls) have to be in the FP lane. Out of the closures rejected due to calls, there were some closures with this reason (15%), but the main problem was when the defining instruction was outside of the loop with a call in the closure loop (+50%). In that case the LDR would remain in the loop if reassigned. That may still be beneficial, but hasn't been tried yet (see "To try next" below).

  • Vector Element Compare

After the experiments with Vector Test under Mask proved to be fruitless, I replaced that to instead use VEC whenever possible (lane G0 / F1). Given that comparisons are typically used before branches and are optimized in SystemZEliminateCompares, and since reassigned arithmetic instructions (e.g. VAG) cannot be used for comparison elimination like scalars (e.g. AGR), I have not tried running this on benchmarks. The idea here was mainly to eliminate VLGVs (extractions) already present in the code after isel. With TM, there were +2000 eliminated (however not giving any speedup), but with VEC only 145 were eliminated so far...

  • Experiments on benchmarks:

I added options to add different types of converters in order to be more selective about what type of instructions are being converted. With -allconv=false, only loads and stores are converted. With -regreg, register-register instructions are also converted, e.g. AGRK -> VAG, and so on.

In order to maximize the saved spilling/reloading, I started experimenting with the register pressure limits and available conversions. The result was a bit surprising...

I started with just load/store conversions (which is what I initially believed would be a simple and good improvement). I could improve a bit on the initial estimates of gprlim=16 ("when this many gpr regs are live, try to reassign a closure"), veclim=28 ("reassign as long as no more than this many vector regs are live"), vecsavedlim=("reassign around calls using fp-lanes as long as no more than this many vector regs are live").

(gprlim / veclim / vecsavedlim)
16 / 28 / 16 : 1305 less spill/reload instructions.
20 / 40 / 12 : 1571 less...

I could not get it better than that, so I then added "reg-reg" converters:

16 / 40 / 12 : -9078

At this point it was also clear that the trick of inserting scalars that had no converter played a key here, and some 8000 move VLVG instructions were present. Without doing insertions, the save in spilling did not happen.

Adding just "shifts" without "reg-reg" actually made things worse, but with "reg-reg" + "shifts" I got

20 / 40 / 12 : -9478

+regreg +shifts +regexts, was for some reason much worse:

20 / 40 / 12 : -3118

+regreg +shifts +fpconv, however was a nice improvement, where it also seemed right to limit the number of scalar insertions per closure to 1 instead of 4 since that gave similar result with less VLGVs:

20 / 40 / 12 / 4 : -13468 (~ +8100 VLVGs)
20 / 40 / 12 / 1 : -12800 (~ +5400 VLVGs)

I was getting ready to add the reg-imm and reg-mem converters, but they did not seem to improve the way I was expecting:

+regreg +shifts +fpconv +reg-imm:
20 / 40 / 12 / 4 : -13614 (~ +8400 VLVGs)

+regreg +shifts +fpconv + reg-mem:
20 / 40 / 12 / 1 : -3724 (~ +3000 VLVGs) !

+regreg +shifts +fpconv +reg-imm +reg-mem:
20 / 40 / 12 / 1 : -3827 (~ +3500 VLVGs) !

reg-imm was improving things just a little, but not enough to motivate the needed handling to support it... And it certainly seems that it is not helping register pressure to reassign reg-mem instructions. I don't know exactly yet why that is, but for some reason it seems to work better to instead do a VLVG after the scalar reg-mem instruction.

Adding compare conversions did not seem to save any register pressure either, on the contrary... It also gave more insertions, so maybe this means that a scalar value was inserted to only then be used in a comparison, which may not really help register pressure...

+regreg +shifts +fpconv +compares
20 / 40 / 12 / 1 : -8200 (~+10000 VLVGs)

LoadAddress conversions also did not seem to be good at all here.

This meant that the best setting I have so far is "+regreg +shifts +fpconv", which saves ~13000 spill/reload instructions, which is about 1.8% (there was also (currently) some more register moves: +1000, or 0.1%, which I haven't looked into more.)

I tried this over-night:

master vs "+regreg +shifts +fpconv":

Improvements:
0.981: f538.imagick_r
0.995: f508.namd_r

Regressions:
1.033: i523.xalancbmk_r
1.016: i500.perlbench_r
1.015: i505.mcf_r
1.010: i520.omnetpp_r
1.009: f526.blender_r
1.009: i541.leela_r
1.007: f507.cactuBSSN_r
...

master vs "+regreg +shifts +fpconv +immloads":
Improvements 2017_dev_loopstats_with_immloads:
0.985: i541.leela_r
0.996: f538.imagick_r

Regressions 2017_dev_loopstats_with_immloads:
1.013: i520.omnetpp_r
1.012: i505.mcf_r
1.010: f526.blender_r
1.010: i500.perlbench_r
1.007: f519.lbm_r
1.006: i523.xalancbmk_r
...

These (and a few more variations inluding only "reduce VLGVs") showed nothing but mixed results :-/

For the functions that got a different spill count with "+regreg +shifts +fpconv, 20 / 40 / 12 / 1", I made a graph:

It shows that there are cases that are actually getting more GPR spilling (to the right), and also some VEC spilling (to the left)...

Ideas to try next:

calls handling:
 - Allow LDR in loop (don't reject closures with a def before loop and call in loop).
 - Allow multiple users (gives more LDRs).

Try to be more agressive in big regions: extractions, comparisons, ..?  If the liveness is long, it may not be so bad to do an extraction...

Try to limit the number of insertions somehow when they are not useful. There are quite a lot of VLGVs added, and I suspect some of those closures using insertion is not really helping register pressure - especially when the register is killed soon after the insertion.

Try to optimize settings for innermost loops. So far only values for whole functions have been used for finding best tuning.

Look more into worst-cases (see graph) for further improvements and tuning.