This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Enable AtomicExpandPass for i128
Needs ReviewPublic

Authored by jonpa on Mar 20 2023, 7:54 AM.

Details

Reviewers
uweigand
Summary
  • Set MaxAtomicPromoteWidth and MaxAtomicInlineWidth to 128 in the front end. This will result in LLVM I/R in the aligned cases and libcalls otherwise.
  • Some FE tests for smaller integer types as well.
  • Enable AtomicExpandPass.
  • Set MaxAtomicSizeInBitsSupported to 128.
  • Fix RegCoalescer for these types of loops, which already has a hack for i128 in the SystemZ backend.

Diff Detail

Event Timeline

jonpa created this revision.Mar 20 2023, 7:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2023, 7:54 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
jonpa requested review of this revision.Mar 20 2023, 7:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2023, 7:54 AM
jonpa updated this revision to Diff 506928.Mar 21 2023, 5:33 AM

New tests:

  • gnu-atomic_is_lock_free.c

I wanted to write a test here for different types of alignments as well as for a null pointer, which I expected to return true for 16-byte aligned and null pointer, and false for underaligned pointers (which is not what happened on either compiler).

Clang:
atomic_is_lock_free: emits libcalls with pointer, True with null pointer (same for __c11_is_atomic_lock_free).
atomic_always_lock_free: False with both pointers, True with null pointer. Should return True with 16-byte aligned pointer..?

GCC:
Returns True in all cases except for __c11_is_atomic_lock_free, which it does not seem to recognize.
Should return False/libcall for 8-byte aligned pointer..?

  • atomic-alignment.c

Clang: aligns to 16 bytes.
GCC: aligns to 8 bytes.

Per the above, I see these shortcomings in clang:

  • Not returning True for atomic_ is/always _lock_free in case of 16-byte aligned pointer (not sure if strictly needed for correctness).
  • Clang aligns the atomic int128 to 16 bytes, which is not what GCC is doing (only 8 bytes with GCC).

At least the alignment difference in the latter test case needs to be fixed, right?

jonpa updated this revision to Diff 509957.Mar 31 2023, 3:30 AM
  • Reworked shouldCoalesce()

The code there for i128 was supposed to simply check for clobbered GR128 regs inside the small region of the two combined live ranges. This was limited to only allow coalescing in small regions (in a single MBB), as it seems (still does) that coalescing for there register pairs easily creates spilling.

As it seemed unacceptable to not coalesce away multiple copies inside a CDSG loop, this was at first attempt extended to allow coalescing in these loops specifically as well, but now this has been merged to simply scan the involved LiveIntervals for phys reg clobbers. This handles the register allocator problem still (to not run out of registers), and also enables a bit more coalescing, including the CDSG loops.

However, this "improved" coalescing now seems to create a bit more spilling again:

main <> patched

Spill|Reload   :               642820               643901    +1081
Copies         :              1010858              1010342     -516

In theory, of course it would have been very nice if the register allocator would not run out of registers (still does without this), and also if it could split ("uncoalesce") as needed to minimize spilling. It does unfortunately not, so there remains the option to try to avoid spilling in shouldCoalesce(). Should we keep it simple here, or should we experiment a little and try to at least not increase spilling compared to main? Not sure if there is a natural heuristic that includes the CDSG loop without explicitly checking for that case...

  • Comparing to GCC:

Both clang++ and g++ aligns this to 16 bytes:

std::atomic<__int128> Atomic_int128;         //C++

However, while clang aligns this also to 16 bytes, GCC aligns only to 8, which I suspect is an error by GCC(?):

Atomic __int128 Atomic_int128;                    //C

@Andreas-Krebbel any comments on this discrepancy between C and C++ in GCC? Was this intentional?

Both clang++ and g++ aligns this to 16 bytes:
std::atomic<__int128> Atomic_int128; // C++

However, while clang aligns this also to 16 bytes, GCC aligns only to 8, which I suspect is an error by GCC(?):
Atomic __int128 Atomic_int128; // C

In theory, of course it would have been very nice if the register allocator would not run out of registers (still does without this), and also if it could split ("uncoalesce") as needed to minimize spilling. It does unfortunately not, so there remains the option to try to avoid spilling in shouldCoalesce(). Should we keep it simple here, or should we experiment a little and try to at least not increase spilling compared to main? Not sure if there is a natural heuristic that includes the CDSG loop without explicitly checking for that case...

While the raw spill counts give some indication, of course, it would be good to also look at actual performance impact of the change. A few inline comments about the heuristics ...

Also, shouldn't there be some generic heuristics of whether or not coalescing is worthwhile? It seems to me this would always be a tradeoff for any register class, not just for the 128-bit pairs?

llvm/lib/Target/SystemZ/SystemZRegisterInfo.cpp
397

Does it help tweaking this heuristic a bit? What if we use 4 or 2 instead of 3?

398

This introduces yet another weird heuristics. Is this even necessary at all? What are the compile-time impacts of just not doing this check?

jonpa added a comment.Apr 3 2023, 5:27 AM

In theory, of course it would have been very nice if the register allocator would not run out of registers (still does without this), and also if it could split ("uncoalesce") as needed to minimize spilling. It does unfortunately not, so there remains the option to try to avoid spilling in shouldCoalesce(). Should we keep it simple here, or should we experiment a little and try to at least not increase spilling compared to main? Not sure if there is a natural heuristic that includes the CDSG loop without explicitly checking for that case...

While the raw spill counts give some indication, of course, it would be good to also look at actual performance impact of the change. A few inline comments about the heuristics ...

Also, shouldn't there be some generic heuristics of whether or not coalescing is worthwhile? It seems to me this would always be a tradeoff for any register class, not just for the 128-bit pairs?

Does it help tweaking this heuristic a bit? What if we use 4 or 2 instead of 3?
This introduces yet another weird heuristics. Is this even necessary at all? What are the compile-time impacts of just not doing this check?

I looked into the compile time with various settings, and found that it does not really seem to make much difference what we do in shouldCoalesce():

                                        Average Wall:    3 worst:
main                                    1.75%            38.3%, 20.4%, 19.3%
main, no SystemZ shouldCoalesce()       1.75%            38.0%, 20.2%, 19.4%
Patch (3 / 50)                          1.76%            38.6%, 20.6%, 19.4%
Patch (3 / nolim)                       1.76%            38.2%, 20.3%, 19.5%
Patch (2 / 50)                          1.75%            38.8%, 20.4%, 19.3%
Patch (6 / nolim)                       1.76%            38.3%, 20.3%, 19.4%
Patch (7 / nolim)                       1.76%            38.1%, 20.4%, 19.5%

So it is possible to not do the check for compile time concerns, it seems.

This however also affects the number of coalesces done, and thereby the spilling:

main <> "3 / 50"
Spill|Reload   :               635680               636698    +1018
Copies         :              1010870              1010271     -599

main <> "2 / 50":  Identical to "3 / 50"
main <> "4 / 50":  Identical to "3 / 50"
main <> "5 / 50":  Very little difference to "3 / 50": -41 Spill|Reload / +18 Copies.

main <> "3 / unlim"
Spill|Reload   :               635680               642379    +6699
Copies         :              1010870              1008390    -2480

main <> "6 / unlim"
Spill|Reload   :               635680               640658    +4978
Copies         :              1010870              1009061    -1809

main <> "7 / unlim"
Spill|Reload   :               635680               639023    +3343
Copies         :              1010870              1009719    -1151

It seems like the unlimited allows for more coalescing but also more spill/reload instructions. However, looking at the nightly benchmarking, I see:

Overall results (by average over benchmarks):
Build:                                                                    Average result
2017_D_6_unlim                                                            99.787 %
2017_E_7_unlim                                                            99.892 %
2017_B_3_50                                                               99.953 %
2017_C_3_unlim                                                            100.027 %

Improvements 2017_D_6_unlim:
0.976: i523.xalancbmk_r 
0.981: i502.gcc_r 
0.986: f526.blender_r 

Regressions 2017_D_6_unlim:
1.011: i505.mcf_r

The "3 / 50" is "ok, but there may be some improvement actually by doing something like just having a check against 1 or 2 GR128 clobbers (and no search limit), like "D". This is still preliminary though ("mini").

I did not find anything general per your suggestion about this, but looking at Hexagon, I saw they are doing something similar which I tried:

  const SlotIndexes &Indexes = *LIS.getSlotIndexes();
  auto HasCall = [&Indexes] (const LiveInterval::Segment &S) {
    for (SlotIndex I = S.start.getBaseIndex(), E = S.end.getBaseIndex();
         I != E; I = I.getNextIndex()) {
      if (const MachineInstr *MI = Indexes.getInstructionFromIndex(I))
        if (MI->isCall())
          return true;
    }
    return false;
  };

  LiveInterval &DstInterval = LIS.getInterval(MI->getOperand(0).getReg());
  LiveInterval &SrcInterval = LIS.getInterval(MI->getOperand(1).getReg());
  return !any_of(DstInterval, HasCall) && !any_of(SrcInterval, HasCall);
}

First, just for the i128 case, this gave:

Spill|Reload   :               635680               639026    +3346
Copies         :              1010870              1009736    -1134

This seems then pretty close to "7 / unlim".

Unfortunately however this doesn't help the regalloc enough on its own - it runs out of registers with an inline assembly (CodeGen/SystemZ/regalloc-GR128.ll).

Trying this on all register classes thinking maybe it would work out so that more COPYs remain, but the COPY hints in the regalloc helps enough? The answer is no:

Spill|Reload   :               635680               640701    +5021
Copies         :              1010870              1188763  +177893

Finally, trying the current main heuristic generally, meaning only coalescing intervals within the same MBB (which gives less spilling for GR128 compared to "3 / 50"):

Spill|Reload   :               635680               638633    +2953
Copies         :              1010870              1804531  +793661

As expected this is not a good idea. So the general aspects of this beyond GR128 remains unclear...

jonpa added a comment.Apr 4 2023, 2:54 AM

With full benchmark runs on a few tests I see:

Improvements "3/50":
0.994: i523.xalancbmk_r 

Regressions "3/50":
1.005: f526.blender_r 
1.004: i502.gcc_r 
1.003: i505.mcf_r 
1.002: f507.cactuBSSN_r
Improvements "no preg clobbers / no search lim":
1.000: i502.gcc_r 

Regressions "no preg clobbers / no search lim":
1.006: f526.blender_r 
1.005: f507.cactuBSSN_r 
1.005: i505.mcf_r 
1.004: i523.xalancbmk_r

These are small variations. Looking at the spill/COPY counts, it is a bit of surprise with 'blender', but generally increased spilling is of course not good:

main <> "no preg clobbers / no search lim"

f526.blender_r
Spill|Reload   :                90817                90761      -56
Copies         :               208906               208506     -400

f507.cactuBSSN_r
Spill|Reload   :               148023               148036      +13
Copies         :                59083                59083       +0

i505.mcf_r
Spill|Reload   :                  702                  706       +4
Copies         :                  562                  565       +3

i523.xalancbmk_r
Spill|Reload   :                14979                15029      +50
Copies         :               134345               134185     -160

So it seems that we may get a few very slight regressions by rewriting shouldCoalesce(). Being a little careful with a search limit (at 50) helps a little, but not much. We could:

  • keep it simple and accept the slight regressions above.
  • keep the original version with the special case handling for the CDSG loops added, which have intervals spanning multiple blocks.
  • constrain this new version further to get results closer to the unmodified version.

It depends a little on what behavior we think would be ideal. Normally coalescing is "good", but coalescing a GR128 live-range with a subreg will make the whole liverange 128bits, which costs a GR64 register. On the other hand, that GR64 subreg usage is typically just a short extension (at least per my old notes from 2017) of the live interval, so it may therefore still be better to avoid a register move.

To me, it seems that we probably want to coalesce away the COPY when it is just a matter of a use a few instructions later, which is the common case. We also want to coalesce the 4 COPYs inside the LSDG loop which result from the PHI lowering. Prolonging the GR128 interval any long distance is probably not worthwhile as it costs a full GR64 register. That is why the search limit is a good idea to me not only for the sake of a worst case compile time issue (which there is not on SPEC).