This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Reuse known zeros/ones after zero-extension of i1.
AbandonedPublic

Authored by jonpa on Mar 18 2021, 4:02 PM.

Details

Reviewers
uweigand
Summary

This is an optimization for zero extensions of i1:s, which resulted from looking into the perl regression against GCC. I noticed a lot of LHI 0, LHI 1, LOC sequences, which gcc did not seem to have.

Basically, after an ICMP NE 0, or an ICMP EQ 1, those known constants are already the ones needed as the CC result, so there is no need to load them with LHI (LGHI).

The i32 cases where quite straightforward, but then doing the same for i64 was a bit more of an effort. Since we always return i32 from getSetCCResultType(), these cases needed different handling depending on the user. If an i64 use is needed, then I chose to promote the setcc result in combineZERO_EXTEND(). For the i32 user (of i64 comparison), the reused operand was instead truncated.

For the case of i64 user, a lot of llgfr:s remained without the handling in combineZERO_EXTEND():

Optimized legalized selection DAG: %bb.0 'prototype_p:bb'
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
  t5: i64,ch = load<(load 8 from `%0** undef`)> t0, undef:i64, undef:i64
        t25: i32 = truncate t5
        t24: i32 = SystemZISD::ICMP t5, Constant:i64<0>, TargetConstant:i32<0>
      t29: i32 = SystemZISD::SELECT_CCMASK Constant:i32<1>, t25, TargetConstant:i32<14>, TargetConstant:i32<6>, t24
    t20: i64 = zero_extend t29
  t12: ch,glue = CopyToReg t0, Register:i64 $r2d, t20
  t13: ch = SystemZISD::RET_FLAG t12, Register:i64 $r2d, t12:1

->
        ltg     %r0, 0(%r1)
        lochilh %r0, 1
        llgfr   %r2, %r0

The general effects on SPEC'17 output:

  1. Just the i32/i32 cases of NE 0:
lhi            :               225081               221486    -3595
lghi           :               445509               444420    -1089
locghilh       :                 3717                 2676    -1041
chsi           :                57297                56385     -912
lt             :                13672                14348     +676
lochilh        :                 8796                 9424     +628
llgfr          :                90010                90533     +523
risbgn         :               137540               137980     +440
tmll           :                53266                53693     +427
ltr            :                 6140                 6550     +410
lgr            :               849527               849890     +363
llc            :                39671                39994     +323
locrlh         :                 1492                 1807     +315
...
  1. Also the i64 cases, compared to (1):
lghi           :               444420               441828    -2592
lhi            :               221486               219782    -1704
lr             :                62223                62878     +655
cghsi          :                32665                32175     -490
tmll           :                53693                54181     +488
llgfr          :                90533                90066     -467
ltg            :               157760               158133     +373
risbgn         :               137980               138334     +354
jne            :                42684                42990     +306
lg             :               982786               982931     +145
je             :               335154               335281     +127
cije           :               107363               107237     -126
lgfr           :                91442                91565     +123
lgr            :               849890               850006     +116
ltgr           :                10951                11067     +116
...
  1. Also EQ 1 (both i32 and i64), compared to (2):
lochilh        :                 9492                 9787     +295
lhi            :               219782               219567     -215
lochie         :                14183                13975     -208
chi            :                53350                53448      +98
chsi           :                56337                56259      -78
lr             :                62878                62950      +72
tmll           :                54181                54243      +62
lghi           :               441828               441770      -58
locghie        :                 7174                 7116      -58
risbgn         :               138334               138383      +49
...

In total, master <> (3):

lhi            :               225081               219567    -5514
lghi           :               445509               441770    -3739
locghilh       :                 3717                 2673    -1044
chsi           :                57297                56259    -1038
lochilh        :                 8796                 9787     +991
tmll           :                53266                54243     +977
risbgn         :               137540               138383     +843
lr             :                62152                62950     +798
lt             :                13672                14343     +671
jne            :                42430                43028     +598
cghsi          :                32644                32170     -474
lgr            :               849527               849994     +467
ltr            :                 6140                 6557     +417
...

I see some more LR:s, which I think is when the reused constant also has another user. I did not manage to avoid these cases when working with the DAGs (local only), so some kind of pseudo-expander might be more powerful here. That probably requires more work, and I am not sure if trading an LHI for an LR is bad, since the comparison does not clobber the register...

There are less comparisons with memory - the value is now loaded, compared and reused (see fun9 below).

Does this seem like a good idea to try?

New tests, master <> patched (skipped functions identical):

fun0:                                   fun0:
        chi     %r2, 0                          chi     %r2, 0
        lhi     %r2, 0                <
        lochilh %r2, 1                          lochilh %r2, 1
                                      >
        br      %r14                            br      %r14


fun4:                                   fun4:
        cghi    %r2, 0                          cghi    %r2, 0
        lhi     %r2, 0                <
        lochilh %r2, 1                          lochilh %r2, 1
                                      >
        br      %r14                            br      %r14


fun5:                                   fun5:
        cghsi   0(%r1), 0             |         ltg     %r0, 0(%r1)
        lghi    %r0, 0                <
        locghilh        %r0, 1                  locghilh        %r0, 1
        stg     %r0, 0(%r1)                     stg     %r0, 0(%r1)
        br      %r14                            br      %r14


fun6:                                   fun6:
        chi     %r2, 1                          chi     %r2, 1
        lhi     %r2, 0                |         lochilh %r2, 0
        lochie  %r2, 1                |
        br      %r14                            br      %r14


fun8:                                   fun8:
        cghi    %r2, 1                          cghi    %r2, 1
        lhi     %r2, 0                |         lochilh %r2, 0
        lochie  %r2, 1                |
        br      %r14                            br      %r14


fun9:                                   fun9:
        cghsi   0(%r1), 1             |         lg      %r0, 0(%r1)
        lghi    %r0, 0                |         cghi    %r0, 1
        locghie %r0, 1                |         locghilh        %r0, 0
        stg     %r0, 0(%r1)                     stg     %r0, 0(%r1)
        br      %r14                            br      %r14

Diff Detail

Event Timeline

jonpa created this revision.Mar 18 2021, 4:02 PM
jonpa requested review of this revision.Mar 18 2021, 4:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2021, 4:02 PM
jonpa updated this revision to Diff 336398.Apr 9 2021, 4:59 AM

Patch improved:

  • Avoid transforming single-use loads - it is probably better to do a compare with memory.
  • Add an AssertZext node on the reused register so that it is known that this is an i1.
  • Avoid some cases where i32 setcc is zero-extended.
  • Option to try single user only

Benchmarking (also showing the sibling patch "ISelCleanup.cpp"):

master <> SETCC (DAG patch)

lhi            :               225040               220226    -4814
lghi           :               445603               443127    -2476
lr             :                61869                62835     +966
lgr            :               853946               854446     +500
...

master <> SETCC (DAG patch), no multiple users

lhi            :               225040               221382    -3658
lghi           :               445603               443220    -2383
lgr            :               853946               854483     +537
lr             :                61869                62170     +301
...

master <> ISelCleanup pass

lhi            :               225040               220233    -4807
lghi           :               445603               443111    -2492
lr             :                61869                62836     +967
lgr            :               853946               854446     +500
...

master <> ISelCleanup pass, no multiple users

lhi            :               225040               221740    -3300
lghi           :               445603               443562    -2041
lgr            :               853946               854384     +438
lr             :                61869                61961      +92
...

Static improvements above show reduced number of immediate loads and some extra register moves. Limiting to only the cases of one user of compare operand gives less new register moves, but still some... (not sure why).

This patch is now closer to the ISelCleanup pass in the output differences above.

Measurements (quick nightly run):
Not much changes - have seen some 1% improvements but also some regressions.

Xalancbmk:
Z14: regresses with a couple of percent except with "IselCleanup, no multiple users", which is the least aggressive change.
Z15: improves with one percent with multiple users, regresses with one percent with no multiple users. Random effect, it seems...

Deepsjeng, Z15: regresses with "DAG patch", but only with "no multiple users", which indicates some random effect since that is a lesser change than with "multiple users"...

The only version so far that has no regressions on both machines (and is if anything a very slight improvement) is the last one "IselCleanup, no multiple users".

Conclusions:

  • There seems to be no clear direct benchmark improvement, although there are a few thousand less immediate loads. This is probably because the immediate loads should generally be invisible on the OOO machine.
  • The patches are non-trivial so it seems perhaps difficult to motivate them given these measurements.

Remaining ideas:

  • Perhaps the ISelCleanup code could be used as a basis to use with the PeepholeOptimizer. I will try that next.
  • Maybe look into more in detail where the remaining extra register moves come from.
jonpa abandoned this revision.Apr 13 2021, 2:51 AM

Seems better to do this with PeeholeOptimizer instead (https://reviews.llvm.org/D100242).