Page MenuHomePhabricator

[SVE] Deal with SVE tuple call arguments correctly when running out of registers

Authored by david-arm on Oct 27 2020, 4:25 AM.



When passing SVE types as arguments to function calls we can run
out of hardware SVE registers. This is normally fine, since we
switch to an indirect mode where we pass a pointer to a SVE stack
object in a GPR. However, if we switch over part-way through
processing a SVE tuple then part of it will be in registers and
the other part will be on the stack.

I've fixed this by ensuring that:

  1. When we don't have enough registers to allocate the whole block we mark any remaining SVE registers temporarily as allocated.
  2. We temporarily remove the InConsecutiveRegs flags from the last tuple part argument and reinvoke the autogenerated calling convention handler. Doing this prevents the code from entering an infinite recursion and, in combination with 1), ensures we switch over to the Indirect mode.
  3. After allocating a GPR register for the pointer to the tuple we then deallocate any SVE registers we marked as allocated in 1). We also set the InConsecutiveRegs flags back how they were before.
  4. I've changed the AArch64ISelLowering LowerCALL and LowerFormalArguments functions to detect the start of a tuple, which involves allocating a single stack object and doing the correct numbers of legal loads and stores.

Diff Detail

Event Timeline

david-arm created this revision.Oct 27 2020, 4:25 AM
Herald added a reviewer: efriedma. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
david-arm requested review of this revision.Oct 27 2020, 4:25 AM
sdesmalen added inline comments.Nov 4 2020, 1:37 PM

Can you add some comments to explain what this code is doing?

From what I understand, it saves the state (i.e. which registers were used/allocated), then marks all registers as used so that when it calls the CCAssignFunction again all Z registers are used, which forces the aggregate to be passed indirectly. When the control flow returns, it restores the state, so that remaining unused Z regs can still be used.


Please address the clang-tidy suggestion here and below.


nit: To avoid (void) Res;, you can write:

if (AssignFn(....))
  llvm_unreachable("Call operand has unhandled type");

You can drop k, and write:

while(!Ins[j + NumParts -1].Flags.isInConsecutiveRegsLast())

That makes it more clear that this code is calculating only NumParts.


Can you rewrite this into a single loop, in a way that it only adds the increment by VScale if NumParts > 1 (and same for increment of ExtraArgLocs). That makes the loop a bit more readable.

Also, a one-line comment describing what this code does would be useful too.


Like above, you can rewrite this to not need k.


For all the callee-tests, I think the tests would be simpler if they'd just store to some pointer, e.g.

define double @foo4(double %x0, double * %ptr1, double * %ptr2, double * %ptr3, <vscale x 8 x double> %x1, <vscale x 8 x double> %x2, <vscale x 2 x double> %x3) {
  %ptr1.bc = bitcast double * %ptr1 to <vscale x 8 x double> *
  store volatile <vscale x 8 x double> %x1, <vscale x 8 x double>* %ptr1.bc
  %ptr2.bc = bitcast double * %ptr2 to <vscale x 8 x double> *
  store volatile <vscale x 8 x double> %x2, <vscale x 8 x double>* %ptr2.bc
  %ptr3.bc = bitcast double * %ptr3 to <vscale x 2 x double> *
  store volatile <vscale x 2 x double> %x3, <vscale x 2 x double>* %ptr3.bc
  ret double undef

Then you can also easily verify that all four values are loaded from the indirect pointer (rather than just testing one of the 4 from the tuple).

david-arm added inline comments.Nov 4 2020, 11:42 PM

OK sure, I thought this was the preferred way in LLVM as I copied this code from AArch64ISelLowering.cpp::LowerCALL/LowerFormalArguments.


Sure I can take a look. However, I did write this originally as a single loop and thought it looked messy due to the fact we have 4 loads and 3 pointer increments, so the loop body needs additional control flow to avoid the final pointer addition. I think other places in SelectionDAG do it this way too.

david-arm updated this revision to Diff 303812.Nov 9 2020, 3:37 AM
  • Added more comments documenting the algorithm in AArch64CallingConvention.cpp.
  • Rewrote loops in AArch64ISelLowering.cpp
david-arm marked 7 inline comments as done.Nov 9 2020, 3:38 AM

Thanks for the changes @david-arm. Just a few more questions/nits and then I'm happy!


Is there a possibility to express j as i when combining it with ExtraArgLocs?


nit: I see you missed my previous suggestion.

Also, a one-line comment describing what this code does would be useful too.


Same question about unsigned j as above.


nit: purely a suggestion, feel free to ignore, but if you add nounwind you don't have to add check lines for the CFI directives.

david-arm added inline comments.Nov 9 2020, 8:12 AM

Sure, the least ugly solution is to change everywhere from 'j' to 'i' except the statement below where we would have:

CCValAssign &VA = ArgLocs[i - ExtraArgLocs];

if this is acceptable?


So here I don't have the benefit of ExtraArgLocs (which was needed in order to keep the assert at the end of the loop). I can certainly get rid of 'j' if I add an extra variable such as ExtraArgLocs as I did for the other loop, but there needs to be two variables I think.

sdesmalen added inline comments.Nov 10 2020, 12:28 AM

Yes, that sounds good to me.


My main concern with j is that the name doesn't tell anything about what it is, and it's not obvious from the code why both j is needed. So adding ExtraArgLocs here too would be better imho.

david-arm updated this revision to Diff 304440.Nov 11 2020, 2:32 AM
david-arm marked 6 inline comments as done.
sdesmalen accepted this revision.Nov 11 2020, 3:53 AM

LGTM! Thanks @david-arm

This revision is now accepted and ready to land.Nov 11 2020, 3:53 AM