This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Implement orderFrameObjects().
ClosedPublic

Authored by jonpa on Dec 13 2021, 6:11 PM.

Details

Summary

By reordering the objects on the stack frame after looking at the users, a better utilization of displacement operands will result. This means less needed Load Address instructions for the loading/storing of those objects.

This is still a work in progress, but the new test case works where an LAY is eliminated.

(Inspired by X86FrameLowering::orderFrameObjects().)

Diff Detail

Event Timeline

jonpa created this revision.Dec 13 2021, 6:11 PM
jonpa requested review of this revision.Dec 13 2021, 6:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2021, 6:11 PM
jonpa updated this revision to Diff 398776.EditedJan 10 2022, 4:49 PM

I have experimented further with the sorting and improved the patch in several ways.

Common code change: Pass (and run) MachineBlockFrequencyInfo from PrologEpilogInserter to orderFrameObjects(). This helps in prioritizing instructions inside loops.

The "first try", which just did one sort by U12Freq (and secondarily by DPairFreq), gave this improvement on SPEC:

lay            :                62128                55002    -7126
la             :               532263               533346    +1083
lghi           :               446000               445758     -242
ldy            :                 3416                 3234     -182
ld             :                98097                98279     +182
std            :                66248                66412     +164
stdy           :                 1227                 1063     -164
...
OPCDIFFS: -6333

To improve on this I worked on an iterative sorting approach like:

  1. Sort all objects as if they are all in range.
  2. Sort the ones within S20 offset by D12Freq, and add D12Freq objects below U12. (2b. Take extra care with an (U12) overlapping D12Freq object and see if it should remain.)
  3. Sort again within S20 for DPairFreq and add below U12 for shortening.
  4. Now sort above S20 without D12Freq, since no short displacements are in range.
  5. Finally sort the different ranges by alignment.

After all this looked nice I found to my surprise that it was only a slight improvement to the first simple version:

lay            :                62128                54715    -7413
la             :               532263               533630    +1367
...
OPCDIFFS: -6346

The reason became quite clear: There are no frame objects ever out of range for the long displacement, so there could not be any gain by sorting for this. 14334 (sorted) frames were within U12, and then all of the rest (411) where within S20.

I looked into the cases that actually had still improved, and it turned out that these were cases where a big D12Freq object that was not fully within U12 had been moved to let smaller objects closer to SP. In many cases then the offsets used into that big object were small (like 0) and therefore still in range, even though the object was further away from SP.

So I trimmed down my patch again to do the first initial sorting, and also if not all objects fit within U12, "pack" smaller objects below the last object partially within U12. While doing this the bigger object is still kept in range as its max offset is kept track of.

This version (as posted) now gives a slightly better result yet:

lay            :                62128                54005    -8123
la             :               532263               534200    +1937
lghi           :               446000               445756     -244
ldy            :                 3416                 3225     -191
ld             :                98097                98288     +191
stdy           :                 1227                 1056     -171
std            :                66248                66419     +171
...
OPCDIFFS: -6487
Stack size differences:
Byte diff       Count
-72             1
-56             1
-48             2
-40             16
-32             41
-24             96
-16             260
 -8             1141
  0             60788   (This includes the majority of cases that were actually not sorted. IIRC ~15000 had at least two objects and were sorted)
  8             2

This is the best result so far. The code changes seen above all then come from the bigger functions/stacks since if all objects fit within U12, only the stack space savings after sorting by alignment could be the benefit.

The sorting of objects into U12/S20 ranges depends on predicting those ranges accurately. The exact size of the final stack is not easy to predict as it depends on the final order of objects and their alignment requirements. The PrologEpilogInserter pass adds objects in a descending order (starting at highest address), and adds alignments to each object on the way. However, the sorting of the frame objects naturally begins starting from the lowest address, which in this patch is 'BottomMargin'. It includes the outgoing reg save area, the MaxCallFrameSize, and an extra 8 bytes of alignment margin. In addition, each object has its size rounded up to the next multiple of its alignment as a way to simplify. This leads to all 'EstimatedStackSize':es being slightly bigger than the actual final stack adjustments in emitPrologue():

Counts Number of bytes extra
   452 2
  2957 4
   240 6
 11057 8
     5 10
    34 12

(E.g. 11057 stacks were overestimated by 8 bytes).

This is not quite optimal, but the important thing is that it is never underestimated, since the reordering for alignments depends on all objects actually ending up being within the same range as they were in their sorted order. In other words, the most dense object may end up near the U12 range if it has a low alignment value, so it better not end up actually going out of range.

This is checked with assertions in emitPrologue() and eliminateFrameIndex(). The SystemZMachineFunctionInfo is used for these values since orderFrameObjects() is 'const'. These checks could be simplified - for example perhaps a simple thing would be to drop the const from orderFrameObjects() and assert the EstimatedStackSize within SystemZELFFrameLowering.

Typical sizes of the objects (that are actually sorted):

Num     Bytes
  72620 8
  14974 4
   8281 16
   5529 24
   4336 32
   1853 64
    ...

(E.g. 5529 of the sorted objects were 24 bytes big). There were many more sizes, up to 160k.

jonpa added inline comments.
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
190

The effect of the packing of smaller objects below the bigger one is relatively small. Commenting this out I see:

lay            :                54005                54970     +965
la             :               534200               533379     -821
l              :               178549               178356     -193
ly             :                 3574                 3767     +193
sty            :                 3528                 3663     +135
st             :               122536               122401     -135
ley            :                  369                  410      +41
lde            :                68636                68595      -41
...
OPCDIFFS: 139
212

These checks were useful during experiments, but could perhaps be reduced if the patch is considered more trustworthy...

253

This dumping looks nice to me, but is not really necessary...

386

Perhaps this is also no longer needed...?

470

Maybe it would be possible to first sort the objects and then decide if the slot(s) are needed. I guess however this would typically only be for big functions where that wouldn't matter much..?

jonpa added a comment.EditedJan 13 2022, 4:27 PM

Initial benchmarking shows no real big changes either way, nor any particular difference from using the extra "packing" trick (smaller objects below the last big one).

If we do not care about sorting by alignments (not sure how much that space saving is worth, if anything), we could simplify the patch considerably by just doing one sort for D12Freq and DPairFreq...

jonpa updated this revision to Diff 402099.Jan 21 2022, 1:48 PM

Simplified version which is nearly as good as the full patch, but much less work.

Agreed that this version looks much more reasonable. A few remaining comments inline.

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
165

Why two calls to stable_sort here? It seems this could be done in a single call with an appropriate comparison routine.

llvm/test/CodeGen/SystemZ/stackmap.ll
474 ↗(On Diff #402099)

This change seems unexpected - do you understand where this is coming from?

jonpa marked 2 inline comments as done.Jan 24 2022, 1:19 PM
jonpa added inline comments.
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
165

Ah, I guess that's true. For readability I thought it was better to have them separate as the first sort is not relating to anything SystemZ specific.

I guess now that the patch only does one additional sort, it makes sense to merge the two comparators and remove the IE iterator.

llvm/test/CodeGen/SystemZ/stackmap.ll
474 ↗(On Diff #402099)

hmm - sorry that must have gone undetected as I simplified the patch. The test actually passes with and without this test change. I think I probably needed to modify this when the alignments sorting was done previously. Removed.

jonpa updated this revision to Diff 402643.Jan 24 2022, 1:21 PM
jonpa marked 2 inline comments as done.

updated per review.

uweigand accepted this revision.Jan 27 2022, 11:52 AM

LGTM, thanks!

This revision is now accepted and ready to land.Jan 27 2022, 11:52 AM
This revision was landed with ongoing or failed builds.Jan 27 2022, 2:11 PM
This revision was automatically updated to reflect the committed changes.