This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Implement the packed stack layout
ClosedPublic

Authored by jonpa on Nov 28 2019, 6:41 AM.

Details

Reviewers
uweigand
Summary

(In progress)

  • Store GPRs topmost in the Register Save Area, and then any FP arg regs below them. Make sure to update the offset for each register.
  • Do not always increase the stack size with CallFrameSize, but instead with the amount used of the Register save area.
  • Extend getFrameIndexReference() to increase all offsets into the local area so that in effect the Local area now begins below the last used byte of the register save area.

Command line option not yet implemented - I suppose this could be done the same way as with -mnop-mcount?

Diff Detail

Event Timeline

jonpa created this revision.Nov 28 2019, 6:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2019, 6:41 AM

I don't think this approach will really do everything -mpacked-stack is supposed to do. Fundamentally, with the packed stack layout, only some N (< 160) bytes of the incoming register save area are reserved for register save slots, instead of all 160 bytes. Everything below "(incoming SP) + 160 - N" is supposed to be freely usable for any use, including FPR/VR save slots and outgoing function args / register save area, but also just local stack variables and reload spill slots in general.

I don't see where in this patch you actually make this area available for this type of general allocation. I believe the fundamental problem is that you still set the TargetFrameLowering's "LocalAreaOffset" to -160, which still makes all those bytes reserved. You apparently try to still force in some of the special areas (like FPR spill slots or the outgoing register save area), but this doesn't address the general case. (Conversely, if the general case were addressed, you actually wouldn't have to implement anything special for those special cases.

So ideally, you'd want to set LocalAreaOffset to -N (where N depends on the number of GPRs to be saved) instead of -160 for the packed stack layout. Unfortunately, I believe the LocalAreaOffset mechanism doesn't support a variable value for this field. One option would be to find (or create) some way to instantiate a per-function FrameLowering. Another way might be to just always set the LocalAreaOffset to 0 and then force common code to still leave the register save area fully (non-packed layout) or partially (packed layout) empty by placing fixed frame objects there (cf. the logic in PEI::calculateFrameObjectOffsets).

jonpa updated this revision to Diff 231502.EditedNov 29 2019, 2:34 AM

I don't think this approach will really do everything -mpacked-stack is supposed to do. Fundamentally, with the packed stack layout, only some N (< 160) bytes of the incoming register save area are reserved for register save slots, instead of all 160 bytes. Everything below "(incoming SP) + 160 - N" is supposed to be freely usable for any use, including FPR/VR save slots and outgoing function args / register save area, but also just local stack variables and reload spill slots in general.

I think it does achieve that... The local area offset is not changed, but my intention is to instead move up all the slots in the local area with the amount that was packed:

  • First, the GPRs are stored topmost and then any (anyregcc) FP arg regs below them in the reg save area. The amount used of the reg save area is tracked during this with ZFI->incRegSaveAreaUsage(). So any registers which already belong in the reg save area still go there, but with new offets. The amount used is then not 160, but ZFI->getRegSaveAreaUsage().
  • The stack is then grown by this amount instead of CallFrameSize. This means that we should have just enough space to pass over to callee.
  • When Frame indexes are replaced, getFrameIndexReference() still returns the right offsets for all the fixed objects (incoming stack args and regs in the reg save area), since the stack size is part of that computation. The outgoing stack arguments do not have any frame objects, they are hard coded to begin at SP+160 at the call site, so the remaining objects are those that are non-fixed and all now ordinarily begin at CFA-160 (Local Area Offset). So by simply moving up any of these non-fixed objects by the amount that was packed/saved in the newly derived SystemZFrameLowering::getFrameIndexReference(), we end up with the 160 bytes bottom most for callee...

This for me seems like the most natural way to extend the current machinery while changing as little as possible...

(Conversely, if the general case were addressed, you actually wouldn't have to implement anything special for those special cases.

It would have been simpler to just spill everything without any fixed offets, but since we want to do STMG I wasn't really sure that was a good idea...

So ideally, you'd want to set LocalAreaOffset to -N (where N depends on the number of GPRs to be saved) instead of -160 for the packed stack layout. Unfortunately, I believe the LocalAreaOffset mechanism doesn't support a variable value for this field. One option would be to find (or create) some way to instantiate a per-function FrameLowering. Another way might be to just always set the LocalAreaOffset to 0 and then force common code to still leave the register save area fully (non-packed layout) or partially (packed layout) empty by placing fixed frame objects there (cf. the logic in PEI::calculateFrameObjectOffsets).

Perhaps we could just set a new smaller LAO value once we know it in emitPrologue(), instead of adding the amount for each offset in SystemZFrameLowering::getFrameIndexReference()?

Getting rid of the LAO and creating fixed objects as needed also sounds like a nice idea. That way those offsets wouldn't have to be changed again as per this current patch. But then I think we would have to first rework all the places that assume the LAO as per the patch I first suggested...

jonpa added a comment.Nov 29 2019, 2:43 AM

Forgot: Patch was updated with a bugfix so that the RegSaveAreaUsage is set in the constructor already, so that this is always done.

Everything seems NFC now with

   bool usePackedStack(MachineFunction &MF) {
+    return false;
     bool HasPackedStackAttr = true;

I don't think this approach will really do everything -mpacked-stack is supposed to do. Fundamentally, with the packed stack layout, only some N (< 160) bytes of the incoming register save area are reserved for register save slots, instead of all 160 bytes. Everything below "(incoming SP) + 160 - N" is supposed to be freely usable for any use, including FPR/VR save slots and outgoing function args / register save area, but also just local stack variables and reload spill slots in general.

I think it does achieve that... The local area offset is not changed, but my intention is to instead move up all the slots in the local area with the amount that was packed:

Hmm, I missed that. That would probably indeed work, but it does appear confusing to have the frame object offsets not really match what's being emitted in the end ...

First, the GPRs are stored topmost and then any (anyregcc) FP arg regs below them in the reg save area. The amount used of the reg save area is tracked during this with ZFI->incRegSaveAreaUsage(). So any registers which already belong in the reg save area still go there, but with new offets. The amount used is then not 160, but ZFI->getRegSaveAreaUsage().

I still don't understand why the FP special case is necessary. Currently, FP regs are just spilled into freely-allocatable frame objects. Given that you move those frame object (the part I originally missed), shouldn't that simply move the FP regs just the same? (The only difference is the anyregcc FP regs, which should not go into the fixed slot; instead, they can just be spilled into nonfixed frame objects like the others.)

This for me seems like the most natural way to extend the current machinery while changing as little as possible...

Changing as little as possible isn't really a primary goal :-) The primary goal is make the final result as easily understandable as possible ...

It would have been simpler to just spill everything without any fixed offets, but since we want to do STMG I wasn't really sure that was a good idea...

Well, not the GPRs -- they should stay at fixed offsets, of course. But the FPRs don't need fixed offsets.

So ideally, you'd want to set LocalAreaOffset to -N (where N depends on the number of GPRs to be saved) instead of -160 for the packed stack layout. Unfortunately, I believe the LocalAreaOffset mechanism doesn't support a variable value for this field. One option would be to find (or create) some way to instantiate a per-function FrameLowering. Another way might be to just always set the LocalAreaOffset to 0 and then force common code to still leave the register save area fully (non-packed layout) or partially (packed layout) empty by placing fixed frame objects there (cf. the logic in PEI::calculateFrameObjectOffsets).

Perhaps we could just set a new smaller LAO value once we know it in emitPrologue(), instead of adding the amount for each offset in SystemZFrameLowering::getFrameIndexReference()?

I don't think there is currently any way to ever change the LAO value.

Getting rid of the LAO and creating fixed objects as needed also sounds like a nice idea. That way those offsets wouldn't have to be changed again as per this current patch. But then I think we would have to first rework all the places that assume the LAO as per the patch I first suggested...

I don't see why anything here would need to be changed. Frame index offsets would remain relative to the CFA with this approach, we'd just be using a different method to keep the first 160 byte unoccupied. (You first patch changed frame index offsets to be relative to incoming SP, which required all those changes.)

jonpa added a comment.Nov 29 2019, 3:54 AM

I don't think this approach will really do everything -mpacked-stack is supposed to do. Fundamentally, with the packed stack layout, only some N (< 160) bytes of the incoming register save area are reserved for register save slots, instead of all 160 bytes. Everything below "(incoming SP) + 160 - N" is supposed to be freely usable for any use, including FPR/VR save slots and outgoing function args / register save area, but also just local stack variables and reload spill slots in general.

I think it does achieve that... The local area offset is not changed, but my intention is to instead move up all the slots in the local area with the amount that was packed:

Hmm, I missed that. That would probably indeed work, but it does appear confusing to have the frame object offsets not really match what's being emitted in the end ...

Yes, I agree...

First, the GPRs are stored topmost and then any (anyregcc) FP arg regs below them in the reg save area. The amount used of the reg save area is tracked during this with ZFI->incRegSaveAreaUsage(). So any registers which already belong in the reg save area still go there, but with new offets. The amount used is then not 160, but ZFI->getRegSaveAreaUsage().

I still don't understand why the FP special case is necessary. Currently, FP regs are just spilled into freely-allocatable frame objects. Given that you move those frame object (the part I originally missed), shouldn't that simply move the FP regs just the same? (The only difference is the anyregcc FP regs, which should not go into the fixed slot; instead, they can just be spilled into nonfixed frame objects like the others.)

As long as we return our table of fixed offsets from getCalleeSavedSpillSlots() a register included there will get that offset in assignCalleeSavedSpillSlots(), instead of a created on-fixed object. I thought about changing SystemZFrameLowering::getCalleeSavedSpillSlots(), for instance by returning a NumEntries that could just be 4 less since the FP arg regs are last... But that method does currently not have MF as an argument, and I thought maybe it wouldn't be so bad to just put them below the GPRs... This is called before spillCalleeSavedRegisters() which is where we compute the range of GPRs to save...

This for me seems like the most natural way to extend the current machinery while changing as little as possible...

Changing as little as possible isn't really a primary goal :-) The primary goal is make the final result as easily understandable as possible ...

good point :-)

So ideally, you'd want to set LocalAreaOffset to -N (where N depends on the number of GPRs to be saved) instead of -160 for the packed stack layout. Unfortunately, I believe the LocalAreaOffset mechanism doesn't support a variable value for this field. One option would be to find (or create) some way to instantiate a per-function FrameLowering. Another way might be to just always set the LocalAreaOffset to 0 and then force common code to still leave the register save area fully (non-packed layout) or partially (packed layout) empty by placing fixed frame objects there (cf. the logic in PEI::calculateFrameObjectOffsets).

Perhaps we could just set a new smaller LAO value once we know it in emitPrologue(), instead of adding the amount for each offset in SystemZFrameLowering::getFrameIndexReference()?

I don't think there is currently any way to ever change the LAO value.

Getting rid of the LAO and creating fixed objects as needed also sounds like a nice idea. That way those offsets wouldn't have to be changed again as per this current patch. But then I think we would have to first rework all the places that assume the LAO as per the patch I first suggested...

I don't see why anything here would need to be changed. Frame index offsets would remain relative to the CFA with this approach, we'd just be using a different method to keep the first 160 byte unoccupied. (You first patch changed frame index offsets to be relative to incoming SP, which required all those changes.)

We could add a setOffsetOfLocalArea(), to set it to a new smaller value if we also could make sure to always set it correctly for each function. But I think your idea of just adding the fixed objects as needed sounds better, so maybe I should give that a try..?

As long as we return our table of fixed offsets from getCalleeSavedSpillSlots() a register included there will get that offset in assignCalleeSavedSpillSlots(), instead of a created on-fixed object. I thought about changing SystemZFrameLowering::getCalleeSavedSpillSlots(), for instance by returning a NumEntries that could just be 4 less since the FP arg regs are last... But that method does currently not have MF as an argument, and I thought maybe it wouldn't be so bad to just put them below the GPRs... This is called before spillCalleeSavedRegisters() which is where we compute the range of GPRs to save...

I see. Maybe it would be better then to just not use the static "getCalleeSavedSpillSlots" method at all, and instead use the dynamic "assignCalleeSavedSpillSlots" method instead (like most other targets) ...

We could add a setOffsetOfLocalArea(), to set it to a new smaller value if we also could make sure to always set it correctly for each function. But I think your idea of just adding the fixed objects as needed sounds better, so maybe I should give that a try..?

Agreed.

jonpa updated this revision to Diff 231832.Dec 3 2019, 12:16 AM

This version of the patch is the experiment of setting the Local Area Offset to 0, and then creating fixed stack objects to represent the incoming register save area. This means that we are pretending that incoming SP and CFA are the same, and debug dumps show [SP - ...] where SP really is CFA. In order to make this work, the offset between the incoming SP and the CFA now has to be added in SystemZFrameLowering::getFrameIndexReference() instead. I am not sure if this is simpler or not compared to the previous version...

One question I have is about the decision as to when to create of the outgoing Reg Save Area. It seems that on trunk, this is to be done if any stack space is needed outside the incoming register save area. I am a little unsure about the reasoning here, as the only reason I can think of to allocate it in this case is in the case of an exception in which case that routine would need the incoming reg save area. But if we are already using our incoming reg save area, then those registers would be overwritten... I so far just mimic this current behavior and allocate the outgoing reg save area only when CSRs stack space usage includes registers that would not go into the incoming reg save area.

  • In SystemZFrameLowering::determineCalleeSaves() the incoming use of the full reg save area in is always added for the default stack layout.
  • Implement SystemZFrameLowering::assignCalleeSavedSpillSlots() (instead of getCalleeSavedSpillSlots()):
    • Create fixed stack objects for those registers that do not go into the Register Save Area, in the same way as currently is the result of the common code (CSR saves topmost).
    • Remember the RegSaveAreaUsage() with packed stack. This is used later when deciding if to allocate a new Reg save area.
  • processFunctionBeforeFrameFinalized() should now be correct but it is not giving the exact same results on SPEC. In one case it looked like trunk was adding too much, while being conservatively correct:
# Machine code for function scan_one_insn: NoPHIs, TracksLiveness, NoVRegs
Frame Objects:
  fi#-3: size=8, align=8, fixed, at location [SP+120]
  fi#-2: size=8, align=8, fixed, at location [SP+112]
  fi#-1: size=8, align=8, fixed, at location [SP+104]
  fi#0: size=240, align=8, at location [SP+160]
  fi#1: size=120, align=4, at location [SP+160]
  fi#2: size=240, align=8, at location [SP+160]
  fi#3: size=3120, align=4, at location [SP+160]

(gdb) p StackSize
$519 = 3936
(gdb) p MaxArgOffset
$520 = 160
(gdb) p MaxReach
$521 = 4096

StackSize + CallFrameSize = 3776 + 160 = 3936
MFFrame.estimateStackSize(MF) adds 56 to the stack size based on the -56 offset for %R13. This shouldn't be done, since CallFrameSize is added later...

The patch seems to be NFC except for a few cases were spill emergency slots are no longer allocated, per above.

processFunctionBeforeFrameFinalized() should now be correct but it is not giving the exact same results on SPEC.

I believe the original processFunctionBeforeFrameFinalized computation was correct:

// Get the size of our stack frame to be allocated ...
uint64_t StackSize = (MFFrame.estimateStackSize(MF) +
                      SystemZMC::CallFrameSize);

At this point, estimateStackSize is supposed to return the size of the objects allocated in the local area, excluding both incoming and outgoing register save areas. We add back the size of the outgoing register save area (which we are certain to need in all nontrivial cases).

MFFrame.estimateStackSize(MF) adds 56 to the stack size based on the -56 offset for %R13. This shouldn't be done, since CallFrameSize is added later...

This should not have happened in the old code because the 56 was within the "LocalAreaOffset" and therefore did not count. It is true that with the new logic this will have to change now ...

// ... and the maximum offset we may need to reach into the
 // caller's frame to access the save area or stack arguments.
 int64_t MaxArgOffset = SystemZMC::CallFrameSize;
 for (int I = MFFrame.getObjectIndexBegin(); I != 0; ++I)
   if (MFFrame.getObjectOffset(I) >= 0) {
     int64_t ArgOffset = SystemZMC::CallFrameSize +
                         MFFrame.getObjectOffset(I) +
                         MFFrame.getObjectSize(I);
     MaxArgOffset = std::max(MaxArgOffset, ArgOffset);
   }

Given the above, we know current SP + StackSize suffices to reach the bottom of the incoming register save area. MaxArgOffset is now supposed to cover everything we may need to reach beyond that. This includes everything within the incoming register save are (as we might need to restore from there), and all the incoming arguments.

Now, the difference with the new logic is that estimateStackSize will include the incoming register save area already (all of it in the default case, the parts that are used in the -mpacked-stack case). Therefore, we need to change the computation of MaxArgOffset, exactly as your patch already does. Everything else should be the same.

Therefore, I'm not sure why you're seeing any difference in results -- it should be exactly the same.

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
51

Why is the RegSpillOffsets change above needed?

62–74

Since this is now the only user of the SpillOffsetTable, we can simplify it again.

88

This seems strange. The packed-stack case should ignore all predefined offsets and just place the GPRs that need to be saved at the top of the area.

92

The RegSpillOffsets value is only ever valid for the non-packed case, so I think everything would be clearer if you only looked at it in that case.

150

There's already MFI below?

359

Moving the + SystemZMC::CallFrameSize down seems a no-op and doesn't really make anything clearer IMO ...

469

It would be nice if we could avoid the duplicate bookkeeping in RegSaveAreaUsage here.

What we really want to check is whether there is any non-dead, non-fixed object on the stack or not. Maybe something along those lines:

bool haveStackObject = false;
for (unsigned i = 0, e = MFI.getObjectIndexEnd(); i != e; ++i) {
  if (!MFI.isDeadObjectIndex(i)) {
    HaveStackObject = true;
    break;
  }
jonpa added a comment.Dec 6 2019, 9:31 AM

processFunctionBeforeFrameFinalized() should now be correct but it is not giving the exact same results on SPEC.

I believe the original processFunctionBeforeFrameFinalized computation was correct:

// Get the size of our stack frame to be allocated ...
uint64_t StackSize = (MFFrame.estimateStackSize(MF) +
                      SystemZMC::CallFrameSize);

At this point, estimateStackSize is supposed to return the size of the objects allocated in the local area, excluding both incoming and outgoing register save areas. We add back the size of the outgoing register save area (which we are certain to need in all nontrivial cases).

MFFrame.estimateStackSize(MF) adds 56 to the stack size based on the -56 offset for %R13. This shouldn't be done, since CallFrameSize is added later...

This should not have happened in the old code because the 56 was within the "LocalAreaOffset" and therefore did not count. It is true that with the new logic this will have to change now ...

// ... and the maximum offset we may need to reach into the
 // caller's frame to access the save area or stack arguments.
 int64_t MaxArgOffset = SystemZMC::CallFrameSize;
 for (int I = MFFrame.getObjectIndexBegin(); I != 0; ++I)
   if (MFFrame.getObjectOffset(I) >= 0) {
     int64_t ArgOffset = SystemZMC::CallFrameSize +
                         MFFrame.getObjectOffset(I) +
                         MFFrame.getObjectSize(I);
     MaxArgOffset = std::max(MaxArgOffset, ArgOffset);
   }

Given the above, we know current SP + StackSize suffices to reach the bottom of the incoming register save area. MaxArgOffset is now supposed to cover everything we may need to reach beyond that. This includes everything within the incoming register save are (as we might need to restore from there), and all the incoming arguments.

Now, the difference with the new logic is that estimateStackSize will include the incoming register save area already (all of it in the default case, the parts that are used in the -mpacked-stack case). Therefore, we need to change the computation of MaxArgOffset, exactly as your patch already does. Everything else should be the same.

Therefore, I'm not sure why you're seeing any difference in results -- it should be exactly the same.


llc -mtriple=s390x-linux-gnu ./tc_estimatestacksize.ll -o -

This is the reduced test case I was looking at. It seems that PEI::calculateFrameObjectOffsets() is taking the Local Area Offset into account, while in fact the very "similar" MFFrame.estimateStackSize() in fact does not (but probably should).

On entry to SystemZFrameLowering::processFunctionBeforeFrameFinalized(), the MF looks like

(gdb) p MF.dump()
# Machine code for function scan_one_insn: NoPHIs, TracksLiveness, NoVRegs
Frame Objects:
  fi#-3: size=8, align=8, fixed, at location [SP+120]
  fi#-2: size=8, align=8, fixed, at location [SP+112]
  fi#-1: size=8, align=8, fixed, at location [SP+104]
  fi#0: size=240, align=8, at location [SP+160]
  fi#1: size=120, align=4, at location [SP+160]
  fi#2: size=240, align=8, at location [SP+160]
  fi#3: size=3120, align=4, at location [SP+160]

The allocas occupy 3720 bytes, and there are fixed frame objects for %R13-%R15. MachineFrameInfo::estimateStackSize() starts with Offset = 0, then checks the greatest fixed offset, and then adds the remaining object sizes, so instead of 3720 (like you expected), it returns 3776...

It seems that PEI::calculateFrameObjectOffsets() is taking the Local Area Offset into account, while in fact the very "similar" MFFrame.estimateStackSize() in fact does not (but probably should).

Ah, I see. Yes, the current MFFrame.estimateStackSize looks incorrect -- it does not take account the Local Area Offset, but it still accounts for whatever fixed objects happen to be placed inside that offset; that's inconsistent. Either it should always fully include the Local Area Offset in the stack size, or else it should always exclude it, but then it must also exclude objects that happen to be placed there.

It seems this problem wasn't noticed so far since MFFrame.estimateStackSize is never used by common code, it is simply a helper routine that can be used by targets. And all targets that use it (except SystemZ!) just happen to use a zero Local Area Offset, so the problem doesn't occur. And even we didn't use the helper until recently; I didn't notice that problem when I added it.

jonpa updated this revision to Diff 232712.Dec 7 2019, 10:37 AM
jonpa marked 9 inline comments as done.

Patch updated per review.

Instead of caring religiously about which saved CSRs actually belonged in the reg save area and based on that deciding if the outgoing reg save are should be allocated, pach has been simplified per review comments:

  • Save all the CSRs in fixed stack slots
    • packed stack: GPRs on top, then the rest in any order.
    • default: Per the offsets in table, and the rest below the incoming reg save area.
  • Allocate the outgoing reg save area in the precence of any non-dead, non-fixed stack object per your suggestion ("HasStackObject"). This changes a few test cases, since e.g. F8 is now saved on a fixed stack slot instead of a non-fixed one as was done previously. The reason all CSRs now have fixed stack objects relates to MinCSFrameIndex / MaxCSFrameIndex as explained previously. This also makes a dozen or so files on SPEC not get the outgoing reg save area allocated anymore.
  • Don't use the offset table with packed stack, instead rely on register numbers. Assert added to check that this is really ok.
  • A new getOrCreateFramePointerSaveIndex() was created to share with SystemZISelLowering.
  • I am not sure usePackedStack() is complete/correct, but it's a good start...
  • More tests are probably needed.
  • Front End patch yet todo.
jonpa added inline comments.Dec 7 2019, 10:38 AM
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
51

I'm not sure this is needed, but since I am using this table also with CSRs that are not in it, I wanted to make sure that zero is returned in those cases.

IIUC, nullVal_ will by this be initialized to 0, which will then be the initial value of all the elements. Perhaps this is redundant with a default constructor for unsigned?

92

OK - I suppose that if a table isn't needed, then the register numbers themselves can be used to determine the LowGPR and the offsets...?

This changes a few test cases, since e.g. F8 is now saved on a fixed stack slot instead of a non-fixed one as was done previously.

I'm not quite sure why this should be the case: F8 should in fact still be saved in the same location (even though LLVM now considers it a fixed slot, it'll still end up at the same location, right?).

Don't use the offset table with packed stack, instead rely on register numbers. Assert added to check that this is really ok.

Instead of relying on LLVM register numbers, it would be better to use SystemZMC::getFirstReg to map the GPR number to an integer in the range 0..15, and then just use that. Then you don't need the assert either.

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
478

The hasVarSizedObjects check is probably unnecessary now, since if there are any variable-sized stack objects, HasStackObject will now already be true.

llvm/lib/Target/SystemZ/SystemZMachineFunctionInfo.h
30

This seems another bit of double bookkeeping. Can't users simply look up the offset of the frame object associated with LowSavedGPR to determine this?

jonpa updated this revision to Diff 232720.Dec 7 2019, 1:47 PM
jonpa marked an inline comment as done.

Patch updated.

jonpa marked an inline comment as done.Dec 7 2019, 1:54 PM

This changes a few test cases, since e.g. F8 is now saved on a fixed stack slot instead of a non-fixed one as was done previously.

I'm not quite sure why this should be the case: F8 should in fact still be saved in the same location (even though LLVM now considers it a fixed slot, it'll still end up at the same location, right?).

That's because now that they are fixed they don't get considered for HasStackObject, and even though the offset for F8 is the same from CFA, since the outgoing reg save area is in these cases not allocated, the final SP offset will be 160 less.

Don't use the offset table with packed stack, instead rely on register numbers. Assert added to check that this is really ok.

Instead of relying on LLVM register numbers, it would be better to use SystemZMC::getFirstReg to map the GPR number to an integer in the range 0..15, and then just use that. Then you don't need the assert either.

Ah, seems like a good idea...

llvm/lib/Target/SystemZ/SystemZMachineFunctionInfo.h
30

I am not sure there is an easy way to do this... It seems to me then we would have to loop over the frame objects and find the reg. And since it is needed in two different functions that should be a function, so I figured it might be simpler to just save the value like this. Or would it be better to write the loop in both functions?

This now looks all functionally correct to me, just a few more cosmetic issues ...

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
51

Well, it can't hurt, so this is fine with me.

141

This ignores alignment. That's in fact OK here since for all register save slots we only require 8-byte alignment and all register sizes are a multiple of 8, so this is always OK. An assert (or at least a comment) would still be good.

236

I still don't quite like how like logic is now split between two routines. At the very least, the comment needs to be updated (there no longer is any "loop above"). Even better would be to find a way to move the vararg handling to assignCalleeSavedSpillSlots as well ...

llvm/lib/Target/SystemZ/SystemZMachineFunctionInfo.h
15

Having to add all these includes here seems to indicate a bit of a layering violation. This is about the getOrCreateFramePointerSaveIndex routine below? I think that should probably be a SystemZFrameLowering function instead ...

jonpa updated this revision to Diff 233149.Dec 10 2019, 10:54 AM
jonpa marked 6 inline comments as done.

Updated per review. See inline comments...

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
141

Added assert.

236

I just updated the comment for now.
It kind of makes a little sense to me to have the modification of LowGPR and StartSPOffset as it is in spillCalleeSavedRegisters() since this is only a local modification while the unmodified values are used later by restoreCalleeSavedRegisters().
If that was moved into assignCalleeSavedSpillSlots, it seems we would either have to extend SystemZMachineFunctionInfo with e.g. getLowSavedGPR_Varargs() etc, or perhaps do some handling in restoreCalleeSavedRegisters instead.

llvm/lib/Target/SystemZ/SystemZMachineFunctionInfo.h
15

Moved the two new methods to SystemZFrameLowering (which both needed the includes).

uweigand added inline comments.Dec 10 2019, 12:10 PM
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
48

Why does this even need to be a SystemZFrameLowering member function? Could be just static in this file ...

236

Yes, we'd use something like LowSpillGPR/HighSpillGPR/SpillGPROffset and LowRestoreGPR/HighRestoreGPR/RestoreGPROffset (could also be grouped instead of having separate accessors) to indicate the different register ranges & offset for the STMG vs. LMG.

Basically, have assignCalledSavedSpillSlots fully prepare the arguments of STMG and LMG, and then have spillCalleeSavedRegister/restoreCalleeSavedRegister just emit those instructions.

The reason I'm concerned is that the logic below really makes assumptions about how the LowGPR and the offset are associated. In fact, it would not work with the packed layout (which is why this is disabled above for now ...). It just seems cleaner to have this in one place.

llvm/lib/Target/SystemZ/SystemZFrameLowering.h
12

What is this needed for?

jonpa updated this revision to Diff 233201.Dec 10 2019, 3:12 PM
jonpa marked 6 inline comments as done.

Updated per review.

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
48

right.

236

Followed your suggestions including grouping into a struct which seemed neater.

llvm/lib/Target/SystemZ/SystemZFrameLowering.h
12

removed

I like this much better, thanks for making that change. Just two more questions:

  • Why are you calling getOrCreateFramePointerSaveIndex in determineCalleeSaves? Wouldn't it be preferable to move that to assignCalleeSavedSpillSlots? Then all the code depending on usePackedStack would be in one place -- and maybe this wouldn't even to be a seperate subroutine at all, but just a local decision in assignCalleeSavedSpillSlots.
  • Maybe it would be better to move the IsVarArg code even within assignCalleeSavedSpillSlots to inside the usePackedStack block (after all, in the !usePackedStack block that code would be incorrect -- and yes, currently we never use packed stack for varargs, but that's just a limitation of the current code; in principle, it would of course be possible to do so, as GCC does as well). (But that's just a minor issue at this point and could also be addressed later.)
jonpa updated this revision to Diff 233374.Dec 11 2019, 8:18 AM

Varargs part of argument handling in assignCalleeSavedSpillSlots() moved into the !usePackedStack section.

Why are you calling getOrCreateFramePointerSaveIndex in determineCalleeSaves?

This has to always be done, and since the PrologEpilogInserter.cpp:assignCalleeSavedSpillSlots() has an early exit if SavedRegs is empty, TFI->assignCalleeSavedSpillSlots() will not always be reached. So to me that wouldn't work...
Maybe the constructor of SystemZMachineFunctionInfo would actually be the best place, although that would force the usePackedStack() to be moved back there as well...

Varargs part of argument handling in assignCalleeSavedSpillSlots() moved into the !usePackedStack section.

Thanks!

Why are you calling getOrCreateFramePointerSaveIndex in determineCalleeSaves?

This has to always be done, and since the PrologEpilogInserter.cpp:assignCalleeSavedSpillSlots() has an early exit if SavedRegs is empty, TFI->assignCalleeSavedSpillSlots() will not always be reached. So to me that wouldn't work...
Maybe the constructor of SystemZMachineFunctionInfo would actually be the best place, although that would force the usePackedStack() to be moved back there as well...

Ah, I see. I'm still a bit concerned about using determineCalleeSaves, as it is a bit unexpected for this routine to do anything else than filling in the SavedRegs array. For example, there are other callers of determineCalleeSaves like in ShrinkWrap.cpp, I'm not sure if this (or some future caller) might get confused.

Possibly the best choice then is to put it at the top of the processFunctionBeforeFrameFinalized routine? That routine seems to be the place for target-specific stuff.

jonpa updated this revision to Diff 233457.Dec 11 2019, 2:58 PM

getOrCreateFramePointerSaveIndex() call moved to processFunctionBeforeFrameFinalized() per suggestion.

Renamed 'O' arguments to 'Offs' in setSpillGPRRegs/setRestoreGPRRegs.

uweigand accepted this revision.Dec 12 2019, 3:22 AM

Ok, I think now this all looks good to me. Thanks!

This revision is now accepted and ready to land.Dec 12 2019, 3:22 AM
jonpa closed this revision.Dec 12 2019, 10:31 AM

Thanks for review!

61f5ba5