This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ / TII] Peephole optimization of zero-extension of i1.
AbandonedPublic

Authored by jonpa on Apr 10 2021, 3:12 AM.

Details

Reviewers
uweigand
Summary

This is yet another attempt to eliminate unnecessary loads of immediates in case where it is already known by the preceding comparison (https://reviews.llvm.org/D98905, https://reviews.llvm.org/D100039).

SystemZ:

  • Added isSelect flag on LOCHIMux and LOCGHI.
  • Implemented analyzeSelect() and optimizeSelect() for them.

TargetInstrInfo - analyzeSelect() and optimizeSelect():

Changed the handling of optimizeSelect() so that target can return a modified instrution in which case it is *not* deleted.

If (as it appears to me) PeepholeOptimizer.cpp is the only user of these hooks (and there are no downstream out-of-tree targets that have requested this), maybe we could merge these two hooks? It seems this could more or less be just one 'optimizeSelect()' method as there appears to be no use for the arguments to analyzeSelect(), or?

If the arguments to analyzeSelect() are indeed needed to be filled out, the current patch makes sense, by doing a careful analysis in that method. Otherwise, it is a waste as it has to be redone in optimizeSelect() (It would probably be better to return true from analyzeSelect() from the interesting opcodes and then do the work in optimizeSelect()).

Benchmarks:

I tried four combinations of two options: "single use of compare operand" and "find LHIMux/LGHI via MRI if not found locally" (experimental options in the patch):

master <> "multiple users" + "only cases with local LHIMux/LGHI"
lhi            :               225040               222044    -2996
lghi           :               445603               444910     -693
lr             :                61869                62276     +407
lgr            :               853946               854211     +265
...
master <> "single uer" + "only cases with local LHIMux/LGHI"
lhi            :               225040               222702    -2338
lghi           :               445603               445263     -340
lgr            :               853946               854083     +137
lr             :                61869                61928      +59
...
master <> "multiple users" + "use MRI to find LHIMux/LGHI"
lhi            :               225040               220319    -4721
lghi           :               445603               443104    -2499
lr             :                61869                62808     +939
lgr            :               853946               854436     +490
...
master <> "single user" + "use MRI to find LHIMux/LGHI"
lhi            :               225040               221788    -3252
lghi           :               445603               443556    -2047
lgr            :               853946               854352     +406
lr             :                61869                61942      +73
...

Initial measurements do not show any bigger performance changes either way...

Diff Detail

Event Timeline

jonpa created this revision.Apr 10 2021, 3:12 AM
jonpa requested review of this revision.Apr 10 2021, 3:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2021, 3:12 AM
jonpa edited the summary of this revision. (Show Details)Apr 10 2021, 3:12 AM
jonpa updated this revision to Diff 338071.Apr 16 2021, 5:38 AM
  • Only check for opcodes in analyzeSelect() and avoid common-code changes by returning a new instruction from optimizeSelect().
  • seemed best to use MRI to find the LHIMux / LGHI (as opposed to looking for it in MBB). Even if there is a load-immedate that has several users, the LOC is 2-address so it is still beneficial.
  • I tried a simplistic search to handle cases with multiple users but where the LOC use should be the kill ("last user"). The kill-flags did not really help much, so this is not trivial to handle. This gave just some additional eliminations (~5300 -> ~6000), so it is probably acceptable to just check for a single use and not worry about those extra cases. Possibly some simple check could be used rather than a full CFG-search...
  • I found out that the new extra LGR/LR instructions relates very much to physregs/calls: The LOC-imm serves as a "natural" change of registers if the immediate is loaded into a register, while if the compare operand is reused there will be a need for a COPY if for instance the compare register comes from a COPY of a physreg, while the LOC-def needs to be live across a call. This is also not trivial to detect - I had to use slow experimental functions to determine if the vregs (and connected vregs) crossed a call. This got rid so far of most of the extra moves, but not all.
  • I also tried another idea: instead of detecting the cases to avoid (per previous point), do all cases but return false in SystemZRegisterInfo::shouldCoalesce() for the COPY created by TwoAddress for the LOC(G)HI. In many cases regalloc can then eliminate the COPY without the help of RegisterCoalescer, and in the remaining cases SystemZInstrInfo::copyPhysReg() could then lower the COPY with a L(G)HI instead of L(G)R.

This however didn't seem as good as good as I had hoped:

  • There are with this many cases of CGHI+LGHI which previously became LTGR (not all of those COPYs become coalesced on trunk to begin with, so many of them get the LGR which then becomes an LTGR in SystemZElimCompare).
  • This didn't so far really eliminate the extra COPYs (LGR/LR:s), but it may be possible to investigate further in shouldCoalesce() using LiveIntervals that is available there and fine-tune this even more.

With the patch as it is we trade ~5000 immediate loads for ~500 register moves, which seems good just looking at the instruction count, but not if a register move is potentially more costly than an immediate load..?

In summary:

    • There are relatively few extra cases to be handled if the interesting multiple use cases are searched for.
  • It is relatively hard to get rid of the extra L(G)R:s as it depends on presence of calls in the function (global search).
jonpa updated this revision to Diff 339965.EditedApr 23 2021, 3:26 AM

I did some further experiments with using a 3-address pseudo, utilizing the already in-place regalloc-hints towards 2-address opcodes and then handling the 3-address pseudos during Post RA pseudo expansion.

I am not sure which variant is really the best - see tables below. The "single user only / 3-address pseudo" is nice: it's less aggressive but seems to have only positive effects. Looking at preliminary benchmark runs, the "rewrite register (no pseudo)" versions may be slightly preferred on the other hand... If there really are any benefits, that would be nice, but I suspect these differences are so small that instead we should expect to see no improvements, and then look at the patch/opcode differences mostly.

Opcode diffs, SPEC 2017:

trunk <> patch, multiple users

lhi            :               226044               221311    -4733   // ~7000: highest number of eliminated immediate-loads
lghi           :               445650               443171    -2479
lr             :                61854                62741     +887   // Increase in register moves
lgr            :               853624               854133     +509
ltr            :                 6173                 6387     +214
lochilh        :                 9162                 9361     +199
cih            :                 8103                 7935     -168
ltgr           :                 9394                 9548     +154
chi            :                53571                53420     -151
lochie         :                13917                13794     -123
cghi           :                14071                13954     -117
iihf           :                 4263                 4163     -100
l              :               177406               177487      +81
...

trunk <> patch, single user

lhi            :               226044               222775    -3269  // ~5000
lghi           :               445650               443624    -2026
lgr            :               853624               854050     +426  // some new register moves near calls/argument copys.
cih            :                 8103                 7974     -129
lochilh        :                 9162                 9265     +103
stg            :               375242               375320      +78
...

trunk <> patch, multiple users, emit a 3-address pseudo

lhi            :               226044               222916    -3128  // ~4000
lg             :               986749               985733    -1016  // - Mostly due to a single file where many reloads
lghi           :               445650               444656     -994  //   turned into (slightly fewer) copys instead.
cghsi          :                32839                32395     -444
cih            :                 8103                 7721     -382
cgijle         :                 7698                 8059     +361
jle            :                36186                35849     -337
chsi           :                57211                57445     +234
lochilh        :                 9162                 9317     +155
jlh            :               164726               164574     -152
mvghi          :                57787                57638     -149
l              :               177406               177554     +148
lochie         :                13917                13779     -138
cijlh          :                78622                78745     +123
cgije          :               118679               118798     +119
je             :               336281               336165     -116
ltg            :               157803               157693     -110
lgr            :               853624               853724     +100
risbhg         :                 6313                 6413     +100
...

trunk <> patch, single user, emit a 3-address pseudo

lhi            :               226044               223291    -2753  // ~4000
lghi           :               445650               444558    -1092
lg             :               986749               986531     -218
lgr            :               853624               853467     -157
ltg            :               157803               157663     -140
cgije          :               118679               118792     +113
je             :               336281               336172     -109
...

Benchmark measurements (quick runs):

These show only slight variations in performance. It's hard to say which one is really best, if any. I have made previously full runs with just B and C, where they both seemed slightly profitable and possibly B was the better one...

z14:

Overall results (integral over benchmarks):
Build:                                                                    Result       Improvements Regressions
2017_C_PeepLOCI_peep_multiple_users_false                                 0.986        0.960        1.026
2017_B_PeepLOCI                                                           0.989        0.958        1.031
2017_D_PeepLOCI_peep_pseudo                                               0.996        0.962        1.034
2017_E_PeepLOCI_peep_pseudo_peep_multiple_users_false                     1.007        0.985        1.022

Overall results (by average over benchmarks):
Build:                                                                    Average result
2017_C_PeepLOCI_peep_multiple_users_false                                 99.927 %
2017_B_PeepLOCI                                                           99.942 %
2017_D_PeepLOCI_peep_pseudo                                               99.980 %
2017_E_PeepLOCI_peep_pseudo_peep_multiple_users_false                     100.038 %

z15:

Overall results (integral over benchmarks):
Build:                                                                    Result       Improvements Regressions
2017_D_PeepLOCI_peep_pseudo                                               0.989        0.967        1.021
2017_B_PeepLOCI                                                           0.998        0.966        1.032
2017_E_PeepLOCI_peep_pseudo_peep_multiple_users_false                     0.999        0.980        1.019
2017_C_PeepLOCI_peep_multiple_users_false                                 1.007        0.979        1.028

Overall results (by average over benchmarks):
Build:                                                                    Average result
2017_D_PeepLOCI_peep_pseudo                                               99.941 %
2017_B_PeepLOCI                                                           99.990 %
2017_E_PeepLOCI_peep_pseudo_peep_multiple_users_false                     99.997 %
2017_C_PeepLOCI_peep_multiple_users_false                                 100.039 %
jonpa abandoned this revision.Jun 3 2021, 9:35 AM

Results not so promising..