This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Add FP16 vector insert/extract patterns
ClosedPublic

Authored by miyuki on May 30 2019, 4:42 AM.

Details

Summary

This change adds two FP16 extraction and two insertion patterns
(one per possible vector length).
Extractions are handled by copying a Q/D register into one of VFP2
class registers, where single FP32 sub-registers can be accessed. Then
the extraction of even lanes are simple sub-register extractions
(because we don't care about the top parts of registers for FP16
operations). Odd lanes need an additional VMOVX instruction.

Unfortunately, insertions cannot be handled in the same way, because:

  • There is no instruction to insert FP16 into an even lane (VINS only works with odd lanes)
  • The patterns for odd lanes will have a form of a DAG (not a tree), and will not be implementable in pure tablegen

Because of this insertions are handled in the same way as 16-bit
integer insertions (with conversions between FP registers and GPRs
using VMOVHR instructions).

Without these patterns the ARM backend would sometimes fail during
instruction selection.

This patch also adds patterns which combine:

  • an FP16 element extraction and a store into a single VST1 instruction
  • an FP16 load and insertion into a single VLD1 instruction

Diff Detail

Event Timeline

miyuki created this revision.May 30 2019, 4:42 AM

Could these be done without having to move to the GPR register file and back? The v8.2A FP16 extension added the VINS and VMOVX instructions which move between the top and bottom halves of half of S registers, which look ideal for this.

Extracting an element clearly shouldn't go through GPRs, yes; it can always be done as a no-op copy or a VMOVX.

For inserting an element, I'm not sure there are better sequences in all cases.

miyuki updated this revision to Diff 202412.May 31 2019, 3:55 AM
miyuki edited the summary of this revision. (Show Details)

Changed extraction patterns to avoid using GPRs as intermediate registers.

efriedma accepted this revision.May 31 2019, 1:58 PM

LGTM

We could possibly use a custom inserter to generate the vins sequence, but it would probably involve some benchmarking to make sure there aren't any unexpected performance penalties due to the weird register usage. So I'm happy to put that off for now.

(On a side-note, I think you can insert a float into element zero of a vector with two vext instructions, which is the same number of instructions, but maybe lower latency.)

This revision is now accepted and ready to land.May 31 2019, 1:58 PM
This revision was automatically updated to reflect the committed changes.
miyuki added a comment.Jun 4 2019, 2:53 AM

LGTM

We could possibly use a custom inserter to generate the vins sequence, but it would probably involve some benchmarking to make sure there aren't any unexpected performance penalties due to the weird register usage. So I'm happy to put that off for now.

(On a side-note, I think you can insert a float into element zero of a vector with two vext instructions, which is the same number of instructions, but maybe lower latency.)

Thanks for a suggestion. I've raised a ticket in our internal issue tracking system, so that we can return to it later.