This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Armv8.2-A FP16 code generation (part 1/3)
ClosedPublic

Authored by SjoerdMeijer on Sep 27 2017, 6:40 AM.

Details

Summary

This is the groundwork for Armv8.2-A FP16 code generation .

Clang passes and returns _Float16 values as floats, together with the required
bitconverts and truncs etc. to implement correct AAPCS behaviour, see D42318.
We will implement half-precision argument passing/returning lowering
in the ARM backend soon, but for now this means that this:

_Float16 sub(_Float16 a, _Float16 b) {

return a + b;

}

gets lowered to this:

define float @sub(float %a.coerce, float %b.coerce) {
entry:

%0 = bitcast float %a.coerce to i32
%tmp.0.extract.trunc = trunc i32 %0 to i16
%1 = bitcast i16 %tmp.0.extract.trunc to half
<SNIP>
%add = fadd half %1, %3
<SNIP>

}

When FullFP16 is *not* supported, we don't make f16 a
legal type, and we get legalization for "free", i.e. nothing changes
and everything works as before. And also f16 argument passing/returning
is handled.

When FullFP16 is supported, we do make f16 a legal type,
and have 2 places that we need to patch up: f16 argument passing and
returning, which involves minor tweaks to avoid unnecessary code generation
for some bitcasts.

As a "demonstrator" that this works for the different FP16, FullFP16, softfp
modes, etc., I've added match rules to the VSUB instruction description showing that
we can codegen this instruction from IR, but more importantly, also to some
conversion instructions. These conversions were causing issue before in the FP16
and FullFP16 cases.

I've also added match rules to the VLDRH and VSTRH desriptions, so that we can
actually compile the entire half-precision sub code example above. This showed that
these loads and stores had the wrong addressing mode specified: AddrMode5 instead
of AddrMode5FP16, which turned out not be implemented at all, so that has also been added.

This is the minimal patch that shows all the different moving parts. In patch 2/3 I will
add some efficient lowering of bitcasts, and in 2/3 I will add the remaining Armv8.2-A
FP16 instruction descriptions.

Diff Detail

Repository
rL LLVM

Event Timeline

SjoerdMeijer created this revision.Sep 27 2017, 6:40 AM

Realised that I am not testing hard float abi without fullfp16 support. And now see that code generation is not really optimal: it is generating eabi calls for the fp16 <-> fp conversions, whereas it should use the normal convert instructions. Fixing this.

samparker edited edge metadata.Oct 3 2017, 12:16 AM

Hi Sjoerd,

Just one inline comment from me, but this approach looks good to me.

cheers,
sam

lib/Target/ARM/ARMCallingConv.td
24 ↗(On Diff #116803)

should this use CCBitConvertToType instead, as you've done for the return values?

127 ↗(On Diff #116803)

Same as above.

SjoerdMeijer edited the summary of this revision. (Show Details)Oct 4 2017, 2:16 AM
SjoerdMeijer added a reviewer: olista01.

I have:

  • removed the f16 type from the calling conventions except for ARM_AAPCS_VFP. It's necessary to add them elsewhere, the default behaviour is what we want.
  • added a test case using the hard float abi, but without fullfp16 support.
  • added some comments about the HPR register class.

Just clarifying a typo in my previous comment:

It's *not* necessary to add f16 elsewhere...

olista01 added inline comments.Oct 4 2017, 2:51 AM
test/CodeGen/ARM/fp16-instructions.ll
14 ↗(On Diff #117638)

This looks like it's returning the result as a 32-bit float, which is wrong. It should be a 16-bit float in the least-significant half of s0.

Also, are there any conversion instructions before the vsub? If so, it would be better to include them in the test, and if not then the arguments are being passed incorrectly too.

SjoerdMeijer added inline comments.Oct 4 2017, 5:41 AM
test/CodeGen/ARM/fp16-instructions.ll
14 ↗(On Diff #117638)

Thanks Oliver. You're right and I do need to change the implementation of the calling conventions. For return values, the aapcs says:
"A Half-precision Floating Point Type is returned in the least significant 16 bits of r0", so yes we want to see an vcvtb.f16.f32 s0, s0 for the return value here.

And there are converts for the arguments at the moment, in fact 2 for each arguments. It thinks the args are passed in f32 regs. So there is a convert to f16, but then there's another convert to f32 because the operations is done in f32 (in this case). The aapcs says that:
"If the argument is a Half-precision Floating Point Type its size is set to 4 bytes as if it had been copied to the least significant bits of a 32-bit
register and the remaining bits filled with unspecified values", and so yes the args passing needs changing too.

SjoerdMeijer retitled this revision from [ARM] Add f16 type support and code generation (part 1/2) to WIP: [ARM] Add f16 type support and code generation (part 1/2).
SjoerdMeijer updated this revision to Diff 129957.EditedJan 16 2018, 7:42 AM
SjoerdMeijer added a reviewer: eli.friedman.
SjoerdMeijer added a subscriber: eli.friedman.

This fixes most issues, now I am working on the only remaining failure
fp16-promote.ll, which is missing an upconvert.

I've written an RFC to list here explaining the general approach:
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119467.html
And as I explained there, I want to add f16 as a legal type, and don't want any
regressions for the storage-only type behaviour. That is the case, with one
exception in fp16-v3.ll, where I accept one minor performance
regression (not correctness) for now. That's a bitcast being codegen'd,
which I should avoid but still need to look into.

I've added @eli.friedman as a reviewer here who commented on the RFC.
And as the patch is becoming big, and while I am fixing the last issue, I was wondering
if @eli.friedman and @olista01 can double-check the approach and look if we think this patch
is how we want to approach this. Of course, any other comments, nit picks are also
welcome too.

And just as another reminder, the motivation to do all this groundwork and plumbing,
was to enable instructions selection for Armv8.2-A FP16 instructions, i.e. let instruction selection work on the f16 types. This patch, just a demonstrator (and a check), includes
instruction selection for some native f16 add and sub instructions.

This looks like a lot of additional complexity to deal with the case where we only have the conversion instructions, but you have made f16 a legal type.

Have you looked into the option of instead adjusting the calling convention lowering to check the original (pre-legalization) type of the argument? This would allow you to leave f16 illegal when we don't have the arithmetic instructions. There is some discussion ongoing at the moment [1] about some backends that already do something like this, and options for making that target-independent.

[1] http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html

test/CodeGen/ARM/fp16-args.ll
36 ↗(On Diff #129957)

This looks like an ABI change: we previously returned an f16 value packed into the bottom half of s0, now we return an f32 value in the whole of s0.

SjoerdMeijer added a comment.EditedJan 17 2018, 4:05 AM

Hi Oliver,
Thanks for pointing this out and this alternative approach to adjust the calling convention lowering looks more robust. What I mean by that is:

  • although that approach does not come for free and involves creating a custom CCState and implementation,
  • it avoids custom lowering of the loads/stores, and the few other corner cases that needed changing. As we don't need this custom lowering anymore, we are not risking having missed another corner case.

So I will start exploring the CCState approach.

SjoerdMeijer retitled this revision from WIP: [ARM] Add f16 type support and code generation (part 1/2) to [ARM] Armv8.2-A FP16 code generation (part 1/2).

This is a rewrite, implementing the new approach:

  1. Clang now passes and returns _Float16 values as floats, together with the required

bitconverts and truncs etc. to implement correct AAPCS behaviour, see also
https://reviews.llvm.org/D42318.
We will implement half-precision argument passing/returning lowering
in the ARM backend soon, but for now this means that this:

_Float16 sub(_Float16 a, _Float16 b) {
   return a + b;
}

gets lowered to this:

define float @sub(float %a.coerce, float %b.coerce)  {
entry:
  %0 = bitcast float %a.coerce to i32
  %tmp.0.extract.trunc = trunc i32 %0 to i16
  %1 = bitcast i16 %tmp.0.extract.trunc to half
  <SNIP>
  %add = fadd half %1, %3
  <SNIP>
}
  1. When FullFP16 is *not* supported, we don't make f16 a

legal type, and we get legalization for "free", i.e. nothing changes
and everything works as before. And also f16 argument passing/returning
is handled (by the Clang patch, see 1. above).

3.1. When FullFP16 is supported, we do make f16 a legal type,
and have 2 places that we need to patch up: f16 argument passing and
returning, which involves minor tweaks to avoid unnecessary code generation
for some bitcasts.

3.2. As a "demonstrator" that this works for the different FP16, FullFP16, softfp
modes, etc., I've added match rules to the VSUB instruction description showing that
we can codegen this instruction from IR, but more importantly, also to some
conversion instructions. These conversions were causing issue before in the FP16
and FullFP16 cases.

3.3 I've also added match rules to the VLDRH and VSTRH desriptions, so that we can
actually compile the entire half-precision sub code example above. This showed that
these loads and stores had the wrong addressing mode specified: AddrMode5 instead
of AddrMode5FP16, which turned out not be implemented at all, so that has also been added.
Splitting this out in a separate doesn't make sense I think, because if it is not used, as
was the case, we're also not testing it.

Therefore, I think this is the minimal patch that shows all the different moving parts.
Once we are happy with this patch, I would like to commit it first, just to make sure
we are happy with this groundwork. And then in part 2/2, I will add the remaining FP16
instruction descriptions.

olista01 added inline comments.Jan 24 2018, 7:26 AM
lib/Target/ARM/ARMCallingConv.td
159 ↗(On Diff #131252)

Are the changes in this file needed if we're still using the clang hack for the calling convention?

lib/Target/ARM/ARMISelLowering.cpp
529 ↗(On Diff #131252)

If this custom lowering is correct and profitable, why not have it always enabled?

2485 ↗(On Diff #131252)

I think it would be better to fix the code-generation for bitcasts generally, rather than putting in these special cases for arguments and returns. Also, this doesn't seem to be working in one of your test cases (we get store/load pairs), do you know why?

4949 ↗(On Diff #131252)

Again, it would be better to fix the code-generation for bitcasts, and let the usual DAG optimisations remove the unnecessary bitcasts/truncates/extends, than to put in special cases for arguments and returns.

lib/Target/ARM/MCTargetDesc/ARMBaseInfo.h
273 ↗(On Diff #131252)

Why is this needed? The new value 17 still fits in 5 bits. If this is actually needed, then do the *Shift values below need to be increased?

test/CodeGen/ARM/fp16-instructions.ll
46 ↗(On Diff #131252)

This looks like we're getting the default legalisation for bitcasts, so we get worse code-gen than the SOFTFP-FP16 case. Could this be fixed by adding some tablegen patterns for bitcast?

SjoerdMeijer added inline comments.Jan 24 2018, 8:02 AM
lib/Target/ARM/ARMCallingConv.td
159 ↗(On Diff #131252)

No, you're right, we don't need it. I will remove it.

lib/Target/ARM/ARMISelLowering.cpp
2485 ↗(On Diff #131252)

I wanted to do this fix up in ExpandBITCAST as well, like I do for the f16 function arguments, but I simply was not able to get it working. Recognising the pattern is easy, but fixing up the subsequent CopyToReg and the ARMISD::RET_FLAG nodes, i.e. replace uses, chains, glues, for these nodes was such a pain and while the rewrite looked okay, I kept running in very late segfaults because there was some funny state left. I think doing the rewrite here is equally fine: it is straightforward and we get generation of the CopyToReg and return nodes for free by this simple rewrite here, the DAG replaces we would have to do are a lot more ugly I think.

This rewrite has to be done here, or in ExpandBITCAST, because otherwise it is too late. Very early in the DAG creation and in the legalizer, this gets legalized to stack Stores and loads. Thus, tablegen rewrite rules to rewrite "bitcasts" is not going to help, because long before we do instruction selection, these bitconverts no longer exist.

I will look into the loads/stores that you mentioned.

lib/Target/ARM/MCTargetDesc/ARMBaseInfo.h
273 ↗(On Diff #131252)

Thanks, will fix.

This addressing the straightforward comments:

  • the unnecessary calling conv change,
  • the mask for the imm8

I agree that the bitcast optimisation is generic, but for now I prefer to have
it working locally first, and we can address this in a follow-up patch.

I am now looking into the SOFTFP-FP16 case which looks like to be doing
default legalisation for bitcasts.

I've looked into it, and yes you're right, default legalisation for bitcasts in the SOFTFP-FULLFP16 is happening
resulting slightly worse codegen than SOFTFP-FP16.
In my previous approach and implementation, I had to custom lower Bitcasts (for a different reason though).
In a quick experiment, I copy-pasted the code, and that does what we want. However, I need to make a few
tweaks to it. For reviewing and testing purposes, i.e. not to change too many things at the same time, I am
suggesting this approach:

  • This groundwork is the 1st of 3 patches, let's see how this passes testing first.
  • Patch 2/3: improve lowering for Bitcasts. FP16 codegen is unaffected by patch 1/3, and the the custom lowering addresses an inefficiency and not a correctness issue for the new FULLFP16 codegen cases.
  • Patch 3/3: fill in the remaining FP16 match rules.

Does that sound ok?

Ok, I agree with the idea of committing this as a starting point and developing it gradually. Just a few nits left.

lib/Target/ARM/ARMCallingConv.td
191 ↗(On Diff #131289)

There are still some changes in this file, are they required for this patch?

lib/Target/ARM/ARMISelLowering.cpp
4939 ↗(On Diff #131289)

The new argument is unused, should the change below be guarded using this?

lib/Target/ARM/ARMInstrVFP.td
712 ↗(On Diff #131289)

Commented-out code

samparker added inline comments.Jan 25 2018, 4:15 AM
lib/Target/ARM/ARMISelDAGToDAG.cpp
935 ↗(On Diff #131289)

Please refactor this with the above function.

lib/Target/ARM/ARMInstrVFP.td
712 ↗(On Diff #131289)

Lines to remove.

All feedback addressed. Thanks for the reviews Sam and Oliver!

samparker accepted this revision.Jan 26 2018, 12:59 AM

Thanks for the changes. I'm dubious about the bitcast handling, but lets get it in and then iterate upon it. LGTM.

cheers!

This revision is now accepted and ready to land.Jan 26 2018, 12:59 AM
SjoerdMeijer retitled this revision from [ARM] Armv8.2-A FP16 code generation (part 1/2) to [ARM] Armv8.2-A FP16 code generation (part 1/3).Jan 26 2018, 1:20 AM
SjoerdMeijer edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.