Half-precision arguments and return values are passed as if it
were an int or float for ARM. This results in truncates and
bitcasts to/from i16 and f16 values, which are legalized very
early to stack stores/loads. When FullFP16 is enabled, we want
to avoid codegen for these bitcasts as it is unnecessary and
inefficient.
Details
Diff Detail
Event Timeline
Have you tried adding tablegen patterns for bitconvert nodes between i16 and f16? That's how it currently works for f32<->i32, see the VMOVRS and VMOVSR instructions in ARMInstrVFP.td.
That should give us a better lowering of bitcasts (not using the default store/load lowering), but we might need some additional optimisations to remove the integer truncations where they are not needed.
I'm also concerned that this code might not be correct if it triggers on code other than that generated by the calling convention lowering. I'm thinking of the case where the source contains bitcasts i32->f16->i32. Would this change optimise away the truncation, changing the behaviour of that code?
lib/Target/ARM/ARMInstrVFP.td | ||
---|---|---|
754 | This doesn't look right - f16_to_fp is a conversion from f16 to f32, but COPY_TO_REGCLASS doesn't do that. |
Have you tried adding tablegen patterns for bitconvert nodes between i16 and f16?
That's how it currently works for f32<->i32, see the VMOVRS and VMOVSR instructions in ARMInstrVFP.td.
Yes, I've tried that. I think this case is different than VMOVRS because there are no illegal types
involved. We are trying to match something like this:
t2: i32,ch = CopyFromReg t0, Register:i32 %0 t7: i16 = truncate t2 t8: f16 = bitcast t7
and the truncate (as an operand of the bitcast) is legalized to a stack store/load very early because i16
is illegal. Thus, instruction selection never get to see i16 <-> f16 bitcast patterns.
And also, because i16 is illegal, any match rule or rewrite pattern involving i16 is not going to work.
But let me know if I'm missing something here.
I'm also concerned that this code might not be correct if it triggers on code other than
that generated by the calling convention lowering.
I share(d) this concern a bit. But this should be a very local, peephole optimisation: just
for function arguments and return value, and I rely on the CopyToReg and CopyFrom
for that (only those patterns matched). But for this reason it is probably better to move
this logic to functions Lower argument and return value.
It might be intended to only apply to function arguments and returns, but those patterns for f16_to_fp and fp_to_f16 could match anywhere.
It might be intended to only apply to function arguments and returns, but those patterns for f16_to_fp and fp_to_f16 could match anywhere.
Just checking to see if I understand this correctly.
I am matching this:
t2: i32,ch = CopyFromReg t0, Register:i32 %0 t7: i16 = truncate t2 t8: f16 = bitcast t7
and with the custom lowering in this patch and using node fp16_to_fp,
I am generating this:
t2: i32,ch = CopyFromReg t0, Register:i32 %0 t18: f32 = fp16_to_fp t2
and using this rewrite pattern:
def : Pat<(f16_to_fp GPR:$a), (f32 (COPY_TO_REGCLASS GPR:$a, HPR))>;
results in moves from int to fp registers:
vmov s0, r1 vmov s2, r0 vadd.f16 s0, s2, s0 ...
That's what I meant with the comment:
// We use FP16_TO_FP just to model a GPR -> HPR move
I got inspiration for this approach from e.g. existing test case:
test/CodeGen/ARM/fp16-args.ll
which generates exactly the same DAG for its incoming half arguments:
t2: i32,ch = CopyFromReg t0, Register:i32 %0 t18: f32 = fp16_to_fp t2
Thus, I am (re)using the same approach, except that I not doing the convert when
FullFP16 is enabled. Is your concern that I am changing the semantics of these nodes because
I am omitting this convert? The "definition" of these nodes read:
/// FP16_TO_FP, FP_TO_FP16 - These operators are used to perform promotions /// and truncation for half-precision (16 bit) floating numbers. These nodes /// form a semi-softened interface for dealing with f16 (as an i16), which /// is often a storage-only type but has native conversions.
I liked the "semi-softened interface" part here, because that's how I am using
it in a new context; I was/am reluctant to introduce a new node here.
Is your concern that I am changing the semantics of these nodes because I am omitting this convert?
Yes, it looks like you are using the same nodes, but giving them the semantics of a bitcast between i16 and f16. That's a problem, because there is existing code that teats them as floating-point extends/truncates between f32 and f16 (represented as i16 for legalisation reasons). Sooner or later, one of these nodes is going to be created by your new code and consumed by existing code, or vice-versa, and we will emit the wrong code.
Okay, got it. Yes, I wanted to stretch the definition a little bit (for the FullFP16 case), but do
see that it might be confusing and not so nice. A new ARM specific node modeling this move
should be easy here, and much cleaner and clearer.
Addressed comments:
- New ARM ISD nodes VMOVhr and VMOVrh have been introduced
to model moving half argument from int to fp registers and vice versa.
For example, for reading half arguments, the DAG looks like this:
t2: i32,ch = CopyFromReg t0, Register:i32 %0 t18: f16 = ARMISD::VMOVhr t2 ...
and for writing the return value:
... t20: i32 = ARMISD::VMOVrh t11 t16: ch,glue = CopyToReg t0, Register:i32 %r0, t20
- Restricted the rewrite of Bitcasts further to avoid it triggering
where it shouldn't by checking the EntryNode and RET_FLAG nodes.
I'm still not convinced about the correctness of this transformation: you are turning code which contains truncates and extends into code that doesn't, without checking whether the top 16 bits could be relevant. This happens to be OK if the value is coming from/going to an fp16 arithmetic instruction, which ignores/clears the top 16 bits, but I don't think it's correct in all cases.
I think a better way to do this would be:
- Define the new DAG nodes as clearing/ignoring the top 16 bits on the i32 side, and lower it to the vmov.i16 instructions which do this.
- Lower bitcasts involving f16 to these DAG nodes, without checking what instructions are around them.
- Add DAG combines to fold zexts and truncates into the new nodes where that is legal.
*) New nodes ARMISD::VMOVhr and ARMISD::VMOVrh are now defined to be clearing
the top 16 bits.
*) The match rules to instruction select vmov.f16 use the ARMISD::VMOVhr and
ARMISD nodes; thus they direct map on vmov.f16.
*) The logic in ExpandBITCAST has been simplified. Node ARMISD::VMOVhr is
created for i32->i16->f16 truncate/bitcast patterns, so this:
t2: i32 = .. t7: i16 = truncate t2 t8: f16 = bitcast t7
now becomes:
t2: i32 = .. t18: f16 = ARMISD::VMOVhr t2
This is what we want for soft-float ABI when half args are passed as ints. And
it is simpler than before we don't have to look at the CopyFromReg node. Thus,
we now generate this for an f16 add that works on 2 half operands that are
passed as integer args:
vmov.f16 s0, r1 vmov.f16 s2, r0 vadd.f16 s0, s2, s0 vmov.f16 r0, s0
*) For hard-float ABI and FullFP16, the initial pattern is a bit different:
t2: f32,ch = CopyFromReg t0, Register:f32 %0 t5: i32 = bitcast t2 t6: i16 = truncate t5 t7: f16 = bitcast t6
This is now a 2-step approach. First, the i32->i16->f16 truncate/bitcast
pattern matches:
t2: f32,ch = CopyFromReg t0, Register:f32 %0 t5: i32 = bitcast t2 t18: f16 = ARMISD::VMOVhr t5
And then, in the 2nd step, the f32->i32 bitcast and move is rewritten to just
this:
f16 = CopyFromReg t0, Register:f32 %1
which is what we need to avoid generating unnecessary moves for hard-float
FullFP16, and just generate the data processing instruction:
vadd.f16 s0, s0, s1
I've kept the logic for this last rewrite also in ExpandBITCAST (as opposed to
moving it DAG combine), because I am matching the bitcast and this looks the
right place to do this.
*) Any missed bitcast rewrite opportunities will get the default legalization
treatment and result in stack stores/loads, thus these missed opportunities are
not correctness issues
This doesn't look right - f16_to_fp is a conversion from f16 to f32, but COPY_TO_REGCLASS doesn't do that.