This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Supporting lowering of half-precision FP arguments and returns in AArch32's backend
ClosedPublic

Authored by pratlucas on Feb 26 2020, 5:19 AM.

Details

Summary

Half-precision floating point arguments and returns are currently
promoted to either float or int32 in clang's CodeGen and there's
no existing support for the lowering of half arguments and returns
from IR in AArch32's backend.

Such frontend coercions, implemented as coercion through memory
in clang, can cause a series of issues in argument lowering, as causing
arguments to be stored on the wrong bits on big-endian architectures
and incurring in missing overflow detections in the return of certain
functions.

This patch introduces the handling of half-precision arguments and returns in
the backend using the actual "half" type on the IR. Using the "half"
type the backend is able to properly enforce the AAPCS' directions for
those arguments, making sure they are stored on the proper bits of the
registers and performing the necessary floating point convertions.

Diff Detail

Event Timeline

pratlucas created this revision.Feb 26 2020, 5:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 26 2020, 5:20 AM
lattner resigned from this revision.Mar 12 2020, 6:33 PM

I'm not a qualified reviewer for this at this point.

Ugh. I'd hate to introduce yet another weird little tweak to ABIArgInfo that's used on exactly one target. For 16-bit composite types, we seem to coerce to a type like [1 x i32]; would that be okay here?

You don't have a test that checks that you get the IR you want.

Hi @rjmccall,
I agree those kind of tweaks do not look good. The issue here, though, is that argument coercion currently ignores the target's endian information when performing coercion through memory.
This happens for any type that requires memory coercion, so unfortunately using [1 x i32] does not do the trick.
Let me know if you have any other sugestions for handling this, I'd be glad to avoid the ABIArgInfo approach.

Hi @rjmccall,
I agree those kind of tweaks do not look good. The issue here, though, is that argument coercion currently ignores the target's endian information when performing coercion through memory.
This happens for any type that requires memory coercion, so unfortunately using [1 x i32] does not do the trick.

Oh, wait, AAPCS wants half values to be passed in the *least* significant bits of a GPR, even on big-endian machines? That's certainly more convenient, but it's a weird inconsistency with the otherwise iron rule of the calling convention, which that it's exactly as if you laid all of the arguments out in memory and then popped the first four 32-bit values off. We're talking about a calling convention here that literally skips registers in order to "align" arguments.

Can we not just coerce to i16? Will LLVM not pass an i16 in the least-significant bits of a register?

Oh, wait, AAPCS wants half values to be passed in the *least* significant bits of a GPR, even on big-endian machines? That's certainly more convenient, but it's a weird inconsistency with the otherwise iron rule of the calling convention, which that it's exactly as if you laid all of the arguments out in memory and then popped the first four 32-bit values off. We're talking about a calling convention here that literally skips registers in order to "align" arguments.

Can we not just coerce to i16? Will LLVM not pass an i16 in the least-significant bits of a register?

Yes, AAPCS specifies that they should go into the LSBs:

B.2 [...] If the argument is a Half-precision Floating Point Type its size is set to 4 bytes as if it had been copied to the least significant bits of a 32-bit register and the remaining bits filled with unspecified values.

Coercing to i16 solves it for the general case, when the argumetns are going into GPRs, but is not suficient when those arguments are required to go into FP registers - e.g. -mfloat-abi=hard.

Oh, wait, AAPCS wants half values to be passed in the *least* significant bits of a GPR, even on big-endian machines? That's certainly more convenient, but it's a weird inconsistency with the otherwise iron rule of the calling convention, which that it's exactly as if you laid all of the arguments out in memory and then popped the first four 32-bit values off. We're talking about a calling convention here that literally skips registers in order to "align" arguments.

Can we not just coerce to i16? Will LLVM not pass an i16 in the least-significant bits of a register?

Yes, AAPCS specifies that they should go into the LSBs:

B.2 [...] If the argument is a Half-precision Floating Point Type its size is set to 4 bytes as if it had been copied to the least significant bits of a 32-bit register and the remaining bits filled with unspecified values.

Coercing to i16 solves it for the general case, when the argumetns are going into GPRs, but is not suficient when those arguments are required to go into FP registers - e.g. -mfloat-abi=hard.

Why not just make half as an argument do the right thing for that case?

rudkx resigned from this revision.Mar 18 2020, 10:25 AM

I did some refactoring here years ago but I'm not that familiar with the ABIs or the handling in clang.

Why not just make half as an argument do the right thing for that case?

That would be the ideal approach, but currently there's a limitation on the backend's calling convention lowering that gets in the way.
The lowering of calls in SelectionDAGBuilder includes a target-independent step that is responsible for spliting or promoting each argument into "legal registers" and takes place before the targets' calling convention lowering.
As f16 is not a legal type on many of the AAPCS_VFP targets, it gets promoted to f32 before the target's lowering code has a chance to define how to handle it.
Ideally, this stpe should only take place if lowering calling conventions after type legalization - there's a FIXME there already capturing that -, but that would involve a major rewriting that would impact multiple targets.
Inserting a hacky target-dependent fix in this step also didn't look very good.
Do you see other alternatives for handling it? If not, which approach would you suggest?

pratlucas edited reviewers, added: asl; removed: lattner, rudkx.May 5 2020, 2:21 AM

Ping.

Why not just make half as an argument do the right thing for that case?

That would be the ideal approach, but currently there's a limitation on the backend's calling convention lowering that gets in the way.
The lowering of calls in SelectionDAGBuilder includes a target-independent step that is responsible for spliting or promoting each argument into "legal registers" and takes place before the targets' calling convention lowering.
As f16 is not a legal type on many of the AAPCS_VFP targets, it gets promoted to f32 before the target's lowering code has a chance to define how to handle it.
Ideally, this stpe should only take place if lowering calling conventions after type legalization - there's a FIXME there already capturing that -, but that would involve a major rewriting that would impact multiple targets.
Inserting a hacky target-dependent fix in this step also didn't look very good.
Do you see other alternatives for handling it? If not, which approach would you suggest?

Would it be possible to pass a half argument and fix-it-up at CodeGenPrepare?

chill added a subscriber: chill.Jun 2 2020, 8:25 AM
pratlucas updated this revision to Diff 269343.Jun 8 2020, 1:47 PM

Re-writing the handling of fp16 arguments, moving their lowering to be performed
in the backend.

This removes the necesity of coercing the arguments in clang and allows the
backend to have full control over how those arguments should be lowered according
to each calling convention.

To achieve this, a couple of new hooks had to be introduced to allow the target
to have a say on how an argument value should be split or combined together into
a legal type.

There are still two tests failing - one due to the lack of support for the fp16
handling in the GlobalISel flow and one due to an issue lowering the result of
an fp_round library call - but I would like to start getting feedback while
I work on the fix for those.

Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2020, 1:47 PM
pratlucas updated this revision to Diff 269450.Jun 9 2020, 2:21 AM

Splitting the patch into two parts: one for introducing the half-precision
handling into AArch32's backend and one for removing the existing coercion
of those arguments from Clang.

pratlucas retitled this revision from [ARM] Enforcing calling convention for half-precision FP arguments and returns for big-endian AArch32 to [ARM] Supporting lowering of half-precision FP arguments and returns in AArch32's backend.Jun 9 2020, 2:33 AM
pratlucas edited the summary of this revision. (Show Details)
pratlucas updated this revision to Diff 269466.Jun 9 2020, 2:50 AM

Formatting patch.

Love it. This will help optimise these more too. We can currently get into places where we can't prove only the bottom bits of a returned value are demanded so can't remove unnecessary vmovs. Test look good too from my understanding of how fp16 calling should work.

I unfortunately can't really claim to be an expert on calling conventions code or if these new functions are needed. Hopefully someone I added can take a look, and if not I can always go find out how it works.

efriedma added inline comments.Jun 9 2020, 4:06 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
524

I'm not sure I understand why the standard getCopyFromParts/getCopyToParts codepath doesn't work. Is the issue that it uses FP_ROUND/FP_EXTEND to promote from f16 to f32?

llvm/lib/Target/ARM/ARMCallingConv.cpp
294

It isn't obvious to me why you need this; can you not use CCBitConvertToType/CCAssignToReg?

pratlucas marked 2 inline comments as done.

Clean-ups + fixing failure in CodeGen/ARM/half.ll test.

pratlucas marked an inline comment as done.Jun 10 2020, 11:18 AM
pratlucas added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
524

Yes, the issue is the usage of FP_ROUND/FP_EXTEND indeed. Those cause the argument to be converted from f16 into f32 - with a vcvtb.f16.f32 for instance - instead of simply being placed the value in the LSBs as required by the AAPCS.

llvm/lib/Target/ARM/ARMCallingConv.cpp
294

For hard floats, using CCAssingToReg would indeed work well in the majority of the scenarios, but would get in the way of the CMSE handling from D81428. Using the f16 loc type causes the clearing of the top 16 bits to be optimized out in the DAG.
Also, the AAPCS expects the argument sized to be extended to 4 bytes, so using the f32 loc type attends to that rule.

For soft floats, on the other hand, simply convering it to i32 causes the code on ARMISel lowering to be quite cumbersome. The loc info becomes either CCValAssign::BCvt (f16 -> f32 - >i32) or CCValAssign::AExt ( f16 -> i16 -> i32), so checking for when we need to handle things differently for f16 becomes less clear.
Using this flow we have the isCustom flag assigned and can have a more explicit handling of this.

pratlucas marked an inline comment as done.Jun 10 2020, 11:18 AM
efriedma added inline comments.Jun 10 2020, 7:05 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
524

That makes sense.

It feels a little weird to have a TLI method to do the splitting, as opposed to adding an extra check to the shared codepath, but I guess this way is more flexible if someone else needs a similar change in the future.

One other thing to consider is that we could make f16 a "legal" type for all ARM subtargets with floating-point registers, regardless of whether the target actually has native f16 arithmetic instructions. We do this on AArch64. That would reduce the number of different ways to handle f16 values, and I think this change would be unnecessary.

916

(Accidental change?)

pratlucas updated this revision to Diff 270143.Jun 11 2020, 7:55 AM

Fixing failure on CodeGen/ARM/GlobalISel/arm-unsupported.ll and making clang-format happy.

pratlucas marked an inline comment as done.Jun 11 2020, 7:57 AM
SjoerdMeijer added inline comments.Jun 11 2020, 9:33 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
524

One other thing to consider is that we could make f16 a "legal" type for all ARM subtargets with floating-point registers, regardless of whether the target actually has native f16 arithmetic instructions. We do this on AArch64.

I am partly guilty here and there when I added _Float16 and v8.2 FP16 support.
When I did this, life was easy for AArch64, because it doesn't have a soft float ABI support and it has or hasn't got FP16 support, and so everything is straightforward. Life became an awful lot less pleasant for AArch32, because of the hard/soft float and different FP16 support, i.e. there are early FPU versions with limited fp16 support for the storage only type (some conversion), and from v8.2 and up the native fp16 instruction. It's been a few years now so can't remember exactly, which is a bit unhelpful, but somewhere here for these corner cases I got into trouble by treating f16 as a legal type.

But in case you're interested / this might be useful, I think this is the mail that I wrote to the llvm dev list when I got into trouble here (which I actually need to reread too for details):

http://lists.llvm.org/pipermail/llvm-dev/2018-January/120537.html

and as referred in that mail, earlier revisions of this:

https://reviews.llvm.org/D38315

might have taken exactly that approach. Long story short, I am not saying we shouldn't do it, just pointing out some background info. And since we're all a few years wiser now since that happened, perhaps we should try again ;-)

Perhaps we could move to making half a valid type for the arm back-end as follow up patches. Allowing half as argument through the IR is already a step to that direction.
IMO this patch is already quite big and it excels in fixing the bugs it proposed.

stuij added a subscriber: stuij.Jun 12 2020, 4:24 AM

Hi @efriedma,

From @SjoerdMeijer's comment and the links he pointed to, it seems to me that making f16 types legal for all ARM subtargets would be a major undertaking and far from trivial to implement. It's also not clear to me how significant would be the returns of this effort.
My feeling is that we could proceed with the current approach and discuss the possbility of making f16 legal in a separate follow up effort, as mentioned by @dnsampaio.

What's your view on this?

ostannard accepted this revision.Jun 17 2020, 6:26 AM

I don't think it makes sense to make f16 legal for targets which don't have any arithmetic operations on it, since that would be contrary to the definition of "legal". I'd also expect doing so to introduce a lot more complexity than this patch.

I think all of the previous comments have been addressed now, and this LGTM.

This revision is now accepted and ready to land.Jun 17 2020, 6:26 AM
This revision was automatically updated to reflect the committed changes.
plotfi added a subscriber: plotfi.Jun 24 2020, 1:31 AM

@pratlucas @ostannard @rjmccall

I have noticed this change break on the building of the Swift standard library in downstream apple/swift/master-next. I reduced the test case and found that the chain of bitcasts/extends in ARMTargetLowering::splitValueIntoRegisterParts ends up not being legal and causing an assert in SelectionDag.cpp when compiling the following with llc:

target datalayout = "e-m:o-p:32:32-Fi8-f64:32:64-v64:32:64-v128:32:128-a:0:32-n32-S32"
target triple = "thumbv7s-apple-ios7.0.0"

define fastcc { <8 x half>, <8 x half> } @f() {
  ret { <8 x half>, <8 x half> } zeroinitializer
}

This was originally swiftcc, but I changed it to fastcc so it would compile with upstream llvm.org/main while also having a calling convention that would satisfy the conditional here: https://github.com/llvm/llvm-project/blob/bfec030e69afc73b29aa1b66902ae802a448fc19/llvm/lib/Target/ARM/ARMISelLowering.cpp#L4155

The assert is as follows:

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:1307: llvm::SDValue llvm::SelectionDAG::getConstant(const llvm::ConstantInt &, const llvm::SDLoc &, llvm::EVT, bool, bool): Assertion `Elt->getBitWidth() == EltVT.getSizeInBits() && "APInt size does not match type size!"' failed.

The DAG nodes generated before:

t5: f16 = extract_vector_elt t3, Constant:i32<0>
  t3: v8f16,v8f16 = merge_values t2, t2
    t2: v8f16 = BUILD_VECTOR ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>

And After:

t22: f32 = bitcast t21
  t21: i32 = any_extend t20
    t20: i16 = bitcast t5
      t5: f16 = extract_vector_elt t3, Constant:i32<0>
        t3: v8f16,v8f16 = merge_values t2, t2
          t2: v8f16 = BUILD_VECTOR ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>, ConstantFP:f16<APFloat(0)>

I suspect here there could be a the EltVT in the assert was expecting a 16-bit size, but is now getting 32.

I hope this is helpful.

plotfi added inline comments.Jun 24 2020, 1:41 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
4106

I'm not well versed in arm CCs, but I suspect you might want this to check if the CC is arm_apcscc, arm_aapcscc, or arm_aapcs_vfpcc instead of checking merely if it has a CC.

@pratlucas @ostannard @rjmccall

I've posted a diff D82443 to address what I think could be a potential fix for the assert I was seeing on the provided reduced IR.

hans added a subscriber: hans.Aug 25 2020, 7:58 AM

Lucas, this seems to have casued https://bugs.llvm.org/show_bug.cgi?id=47001. Can you take a look? (I would cc you on the bug, but I couldn't find your email in bugzilla.)

Hi @hans , I'll have a look at it!