This is an archive of the discontinued LLVM Phabricator instance.

[X86] AVX512FP16 instructions enabling 1/6
ClosedPublic

Authored by pengfei on Jun 30 2021, 10:39 PM.

Details

Summary
  1. Enable FP16 type support and basic declarations used by following patches.
  2. Enable new instructions VMOVW and VMOVSH.

Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html

Diff Detail

Event Timeline

pengfei created this revision.Jun 30 2021, 10:39 PM
pengfei requested review of this revision.Jun 30 2021, 10:39 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 30 2021, 10:39 PM

Could you add a link to a reference?

pengfei edited the summary of this revision. (Show Details)Jul 1 2021, 12:58 AM

Could you add a link to a reference?

Done. Thanks for reminding.

pengfei updated this revision to Diff 355844.Jul 1 2021, 5:11 AM

Minor fixups on comments.

craig.topper added inline comments.Jul 1 2021, 12:17 PM
clang/lib/Headers/avx512fp16intrin.h
52

I think this should be _mm256_undefined_ph(void)

62

I think this should be _mm_undefined_ph(void)

66

I think this should be _mm512_undefined_ph(void)

clang/test/CodeGen/X86/avx512fp16-complex.c
1 ↗(On Diff #355844)

Can we split _Complex out of this patch? This affects other targets that have _Float16 right? So probably needs a different set of reviewers.

clang/test/Sema/Float16.c
13 ↗(On Diff #355844)

It's odd to change behavior and then have a FIXME asking if the old behavior was correct.

llvm/lib/Support/X86TargetParser.cpp
204–208

I think FeaturesICLServer should still be at the beginning of the list. FeatureAVX512FP16 should be alphabetized with the other AVX512 features. Looks like FeatureAVXNNI was already incorrectly alphabetized.

llvm/lib/Target/X86/X86ISelLowering.cpp
18960–18961

I think this comment should mention movsh now.

19110

movsh

23042–23045

This should probably include EltVT==MVT::f16 for the FP16 override?

llvm/lib/Target/X86/X86InstrFragmentsSIMD.td
410

Add a blank line above this to match the original formatting

997

This should be with fp32imm0 and friends.

pengfei updated this revision to Diff 356090.Jul 1 2021, 7:49 PM
pengfei marked 10 inline comments as done.

Address review comments. Thanks Craig!

clang/test/CodeGen/X86/avx512fp16-complex.c
1 ↗(On Diff #355844)

Sure. Split to D105331. Do you know someone who is familiar with or may be interested in it?

llvm/lib/Target/X86/X86ISelLowering.cpp
23042–23045

Maybe we can only check EltVT == MVT::f16 like this?

pengfei updated this revision to Diff 356093.Jul 1 2021, 9:01 PM

Remove complex handing code.

pengfei updated this revision to Diff 356098.Jul 1 2021, 10:08 PM

Remove complex test.

pengfei updated this revision to Diff 356132.Jul 2 2021, 2:08 AM

Update doc and add X86 to the target list that supports _Float16.

RKSimon added inline comments.Jul 2 2021, 6:23 AM
clang/lib/Headers/avx512fp16intrin.h
39

I realize its a lot of work, but is there any chance that we could get doxygen comments to document these intrinsics?

llvm/lib/Target/X86/X86Subtarget.h
748

I'm a little worried this might get confused with hasF16C - am I just being over cautious?

pengfei added inline comments.Jul 2 2021, 9:03 AM
clang/lib/Headers/avx512fp16intrin.h
39

I'm hesitating not only for the work but also the effect. We have about 1K new intrinsics and more than 5K LOC in total in the two header files. Adding the doxygen comments will make the readability worse and increase the difficulty in review. It's also a burden in maintaining the correctness.
Do you think it's feasible to only add a link to intrinsic guide? We have decided to only using link that points intrinsic guide in our product compiler. Using one source is friendly to maintainess. And I think intrinsic guide is also easy to use that doxygen.

llvm/lib/Target/X86/X86Subtarget.h
748

Make sense. How about hasAVX512FP16? I can update the name as a followup patch once these patches merged.

craig.topper added inline comments.Jul 2 2021, 9:10 AM
llvm/lib/Target/X86/X86Subtarget.h
748

That sounds good to me. We should maybe go back and update some of the others. Especially VNNI since we also have AVXVNNI.

RKSimon added inline comments.Jul 2 2021, 9:28 AM
clang/lib/Headers/avx512fp16intrin.h
39

I completely understand where you're coming from. What we do lose is the ability for code editors to display the doxygen when using the intrinsic (or mouseover the code). Are there any particular intrinsics that we could do with having comments closer at hand - ones that take rounding modes that its tricky to remember the enum/defines for or implicit load/store alignments come to mind?

I'm not sure about the idea of linking to external docs for specs - do we have a style guide policy on this?

pengfei updated this revision to Diff 356376.Jul 4 2021, 6:21 AM

Add a few doxygen comments.

pengfei added inline comments.Jul 4 2021, 6:22 AM
clang/lib/Headers/avx512fp16intrin.h
39

Are there any particular intrinsics that we could do with having comments closer at hand

I only found 3 ones from avx512fintrin.h, anyway, I copied here.

ones that take rounding modes that its tricky to remember the enum/defines for or implicit load/store alignments come to mind

Unfortunately, we didn't add doc for them when enabling avx512 intrinsics.

I'm not sure about the idea of linking to external docs for specs - do we have a style guide policy on this?

I was thinking some thing like "See https://llvm.org/LICENSE.txt for license information." in most source files. But I agree doxygen helps for code editors. I didn't think of them simply because I never used them :)
I had some thought about writing a tool to help transporting intrinsic guide info to doxygen, but haven't yet found time to do it.

Anyway, I guess this is not the block issue for this series patches, right?

skan added a subscriber: skan.Jul 5 2021, 11:50 PM
craig.topper added inline comments.Jul 6 2021, 10:17 AM
clang/lib/Headers/avx512fp16intrin.h
255

256-bit

clang/lib/Headers/avx512vlfp16intrin.h
75

Why do we use _mm256_set1_epi32 instead of _mm256_set1_epi16?

79

Same question

llvm/include/llvm/IR/RuntimeLibcalls.def
290

Is this tested in this patch?

llvm/lib/Target/X86/X86FastISel.cpp
58

AVX here should maybe be AVX512, but maybe this is pointing out that this name is bad. Would X86ScalarXMMf* be better?

llvm/lib/Target/X86/X86ISelLowering.cpp
13537

Drop curly braces on these.

pengfei updated this revision to Diff 357695.Jul 9 2021, 10:53 PM
pengfei marked 2 inline comments as done.
  1. Address Craig's comments.
  2. Add more patterns for i16 lowering.
clang/lib/Headers/avx512vlfp16intrin.h
75

There's no difference in assembly for immediate value. https://godbolt.org/z/sMbrM611d. But the latency of vpbroadcastd is better than vpbroadcastw in Skylake according to intrinsic guide. Here the only effect is consist with _mm256_and_epi32. Do you think it's better to use _mm256_set1_epi16?

llvm/include/llvm/IR/RuntimeLibcalls.def
290

No. I'll move it to the 3rd patch and test it there.

llvm/lib/Target/X86/X86FastISel.cpp
58

Maybe we can use X86ScalarSSEf16, here SSE means SSE registers? Especially GCC community proposing to support FP16 since SSE2.

pengfei updated this revision to Diff 357715.Jul 10 2021, 7:06 AM

Fix another regression caused by last update.

pengfei updated this revision to Diff 363946.Aug 3 2021, 7:49 PM

Rebased.

LuoYuanke added inline comments.Aug 4 2021, 7:01 AM
clang/lib/CodeGen/TargetInfo.cpp
3413

float -> half?

clang/lib/Headers/avx512fp16intrin.h
293

Just be curious, why not directly use __W?

320

What is may_alias used for?

351

I see in _mm_mask_load_sh(), we create a __m128h with upper bits zero, not sure we also need it in store intrinsic.

420

Why not return __a[0] directly?

clang/test/CodeGen/X86/avx512fp16-abi.c
90

Any false test case that have padding between a and b?

llvm/include/llvm/IR/Intrinsics.td
315

Not sure about the legacy comments, should it be _Float16 now?

llvm/include/llvm/Target/TargetSelectionDAG.td
1055

I notice it is true for other extload. Is it same to "true"?

llvm/lib/Target/X86/Disassembler/X86Disassembler.cpp
341

This is the same to ((byte1 & 0x8) == 0x0)?

LuoYuanke added inline comments.Aug 5 2021, 6:44 AM
llvm/lib/Target/X86/MCTargetDesc/X86MCCodeEmitter.cpp
801

Add comments for map5 and map6?

llvm/lib/Target/X86/X86.td
189

customize?

llvm/lib/Target/X86/X86FastISel.cpp
2291

Also add it in isCMOVPseudo()?

llvm/lib/Target/X86/X86ISelLowering.cpp
1944

Drop the brace.

10523–10524

Need check Subtarget.hasFP16()?

10525

Why handle i16? Isn't it handled by movw?

10716

Why exclude f16? Is there better choice for fp16?

18961–18963

movss/movsh

LuoYuanke added inline comments.Aug 6 2021, 6:56 AM
llvm/lib/Target/X86/X86InstrAVX512.td
82

indent

3878

Not sure this can be merged to 512 version load/store pattern with muticlass by abstract type info.

4159

Why there is no OptForSize for vmovsh?

4478

Sorry, I forgot what REV stand for. Do you know it?
Is this just encoding difference for register operand compared with VMOVSHZrr? What is it used for?

llvm/lib/Target/X86/X86RegisterInfo.td
570

Given there is only EVEX instructions for fp16, is it necessary to add f16 type to it?

572

Ditto.

LuoYuanke added inline comments.Aug 6 2021, 7:04 AM
llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
374

Why this test case changes? Shall we add -mattr=+avx512fp16 to run?

llvm/test/CodeGen/X86/vector-reduce-fmin-nnan.ll
373

Ditto.

pengfei marked 7 inline comments as done.Aug 6 2021, 9:12 AM

Thanks Yuanke.

clang/lib/Headers/avx512fp16intrin.h
293

First, this is a simple mimic of _mm_mask_load_ss.
I think the reason is the intrinsic requests dst[MAX:16] := 0, while the builtin returns with src[MAX:16].
So we need to explicitly clear the upper bits.

320

This is used for preventing type-based alias analysis.
https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#Common-Type-Attributes

"In the context of section 6.5 paragraph 7 of the C99 standard, an lvalue expression dereferencing such a pointer is treated like having a character type."
"This extension exists to support some vector APIs, in which pointers to one vector type are permitted to alias pointers to a different vector type."

351

Both load and store intrinsics only access 16bit memory, the different is the load intrinsic needs to set up the high bits of the XMM register (because we do return a 128 bits result). We don't need to do that for a store.

420

Because __m128i is defined as <2 x i64>. __a[0] is correct only for i64 type.

clang/test/CodeGen/X86/avx512fp16-abi.c
90

This is the one with padding, since _Float16 aligns to 2 bytes while float aligns to 4.

llvm/include/llvm/IR/Intrinsics.td
315

LLVM IR serves for not only one type. __fp16 is still usable in Clang. Besides, OpenCL half type also use half in IR. And maybe we have other FE types too. So I'd like to keep it as is unless we have a better way to cover all other FE types.

llvm/include/llvm/Target/TargetSelectionDAG.td
1055

Good catch. I noticed it too, but forgot to change it.

llvm/lib/Target/X86/Disassembler/X86Disassembler.cpp
341

Yes, but I'm not sure if this is intentional. Maybe it keeps the shape in & X == X?

llvm/lib/Target/X86/X86.td
189

customise seems correct too. Anyway, I can change it.

llvm/lib/Target/X86/X86ISelLowering.cpp
10523–10524

No, f16 is legal here, so it implies the feature.

10525

No, we don't have a movw instruction.

10716

We prefer to using shuffle vector rather than insert_vector_elt here, because we don't have a insert instruction for half type.

llvm/lib/Target/X86/X86InstrAVX512.td
3878

I think it is probably feasible. We may add a codegen only opcode to reuse VMOVDQU instruction defination.
But that may need careful tune, so I think we can do it as a followup.

4159

Good catch. I think we should add it here.

4478

I think REV is short for revert. Which allows a different encoding when operands order are reverted.
Yes. It's used for a different encoding.

llvm/lib/Target/X86/X86RegisterInfo.td
570

I think so. For example, we may use some i16 instructions which may be or may finally turn into AVX2 ones. Adding to it is useful for them since VR128 is subset of VR128X.

llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
374

Because we allowed one combine after X86ISelLowering.cpp:41180 without check the feature.
Although it seems the code here is correct and better, I'll add the check for feature in case any confusing.
We do have the test for avx512fp16 in D105264.

pengfei updated this revision to Diff 364817.Aug 6 2021, 9:13 AM

Address Yuanke's comments.

craig.topper added inline comments.Aug 6 2021, 9:24 AM
llvm/lib/Target/X86/X86InstrAVX512.td
4478

It is short for "reverse". Meaing the operands are in the reversed order. There are two valid encodings moving from one register to another. This happens because there are separate opcodes for moving register to memory(Store) and moving memory to register(load). The memory operand for both of those opcodes can be a register as well. The assembler and isel always uses the register to register version of the load opcode. The reversed version is only used by the disassembler

There is an exception to that. For VEX encoded AVX/AVX2 instructions, X86MCInstLowering will use an _REV move if it allows a 2 byte VEX prefix instead of a 3 byte VEX prefix. This doesn't apply to any AVX512 instructions though.

pengfei added inline comments.Aug 6 2021, 5:33 PM
llvm/lib/Target/X86/X86InstrAVX512.td
4478

Thanks Craig for the information.

LuoYuanke added inline comments.Aug 6 2021, 5:37 PM
llvm/lib/Target/X86/X86InstrAVX512.td
4478

It is short for "reverse". Meaing the operands are in the reversed order. There are two valid encodings moving from one register to another. This happens because there are separate opcodes for moving register to memory(Store) and moving memory to register(load). The memory operand for both of those opcodes can be a register as well. The assembler and isel always uses the register to register version of the load opcode. The reversed version is only used by the disassembler

There is an exception to that. For VEX encoded AVX/AVX2 instructions, X86MCInstLowering will use an _REV move if it allows a 2 byte VEX prefix instead of a 3 byte VEX prefix. This doesn't apply to any AVX512 instructions though.

I understand now. Thanks, Craig and Pengfei.

pengfei updated this revision to Diff 364960.Aug 7 2021, 7:59 AM

Add missing changes from Yuanke's comments.

pengfei updated this revision to Diff 365019.Aug 8 2021, 7:17 AM

Fix ABI incompatibility issue when a structure has three half or a float and half.

LuoYuanke added inline comments.Aug 8 2021, 7:47 AM
clang/lib/CodeGen/TargetInfo.cpp
3487

For 2 float, return <2xfloat> to be compatible to previous ABI?

pengfei added inline comments.Aug 8 2021, 7:54 AM
clang/lib/CodeGen/TargetInfo.cpp
3487

It is already handled in line 3456.

pengfei updated this revision to Diff 365024.Aug 8 2021, 7:56 AM

Fix a Lint warning.

pengfei updated this revision to Diff 365070.Aug 8 2021, 8:32 PM

Add override for <3 x half>.

LuoYuanke accepted this revision.Aug 9 2021, 1:01 AM

LGTM, but may wait 1 or 2 days for the comments from others.

This revision is now accepted and ready to land.Aug 9 2021, 1:01 AM
craig.topper added inline comments.Aug 9 2021, 9:36 AM
clang/docs/LanguageExtensions.rst
599 ↗(On Diff #365070)

Might be worth mentioning that it requires AVX512FP16 here

clang/lib/CodeGen/TargetInfo.cpp
2825

Merge with the previous if?

2956

Merge with the FloatTy if?

pengfei updated this revision to Diff 365326.Aug 9 2021, 6:02 PM
pengfei marked 3 inline comments as done.

Address review comments. Thanks Craig.

This revision was landed with ongoing or failed builds.Aug 9 2021, 9:46 PM
This revision was automatically updated to reflect the committed changes.
pengfei added inline comments.Aug 11 2021, 12:55 AM
llvm/lib/Target/X86/X86InstrAVX512.td
4159

Sorry, I think we should not add OptForSize here.
This predicate is used to force to select blend instead of mov due to performance consideration.
E.g.: https://godbolt.org/z/W4v38K6va

Since we don't have a blendph instruction, I think we can always select it to movsh. Not sure if using pblendw is beneficial.
I'll change it back in next patch.

I was tracking back a recent ABI break (also failing now in gcc 12, so maybe this irregularity is intentional), and was concerned that this commit is observed to cause the platform ABI to change depending on the feature flags of the current compilation unit. Prior to this change, f16 was always treated as i16 for the purpose of the calling-convention (e.g. returned in %ax). But after this change, the ABI of the value is now inconsistent between compile units. I made a small change to one of the existing tests to show this. Note how the callq result was in %ax without this mattr flag, and in %xmm0 with this mattr flag added. But the function known as "identity.half" is external, and did not change between those two calls to the llvm.

diff --git a/llvm/test/CodeGen/X86/half.ll b/llvm/test/CodeGen/X86/half.ll
index 46179e7d9113..8c1b8c4b76ff 100644
--- a/llvm/test/CodeGen/X86/half.ll
+++ b/llvm/test/CodeGen/X86/half.ll
@@ -5,6 +5,8 @@
 ; RUN:   | FileCheck %s -check-prefixes=CHECK,CHECK-LIBCALL,BWOFF
 ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+f16c -fixup-byte-word-insts=1 \
 ; RUN:    | FileCheck %s -check-prefixes=CHECK,BWON,BWON-F16C
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512fp16 -fixup-byte-word-insts=0 \
+; RUN:    | FileCheck %s -check-prefixes=CHECK-CC
 ; RUN: llc < %s -mtriple=i686-unknown-linux-gnu -mattr +sse2 -fixup-byte-word-insts=0  \
 ; RUN:    | FileCheck %s -check-prefixes=CHECK-I686

@@ -163,16 +199,31 @@ define void @test_trunc32(float %in, half* %addr) #0 {
   ret void
 }
 
+declare half @identity.half(half)
+
 define void @test_trunc64(double %in, half* %addr) #0 {
 ; CHECK-LABEL: test_trunc64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    pushq %rbx
 ; CHECK-NEXT:    movq %rdi, %rbx
 ; CHECK-NEXT:    callq __truncdfhf2@PLT
+; CHECK-NEXT:    # kill: def $ax killed $ax def $eax
+; CHECK-NEXT:    movl %eax, %edi
+; CHECK-NEXT:    callq identity.half@PLT
 ; CHECK-NEXT:    movw %ax, (%rbx)
 ; CHECK-NEXT:    popq %rbx
 ; CHECK-NEXT:    retq
 ;
+; CHECK-CC-LABEL: test_trunc64:
+; CHECK-CC:       # %bb.0:
+; CHECK-CC-NEXT:    pushq %rbx
+; CHECK-CC-NEXT:    movq %rdi, %rbx
+; CHECK-CC-NEXT:    vcvtsd2sh %xmm0, %xmm0, %xmm0
+; CHECK-CC-NEXT:    callq identity.half@PLT
+; CHECK-CC-NEXT:    vmovsh %xmm0, (%rbx)
+; CHECK-CC-NEXT:    popq %rbx
+; CHECK-CC-NEXT:    retq
+;
 ; CHECK-I686-LABEL: test_trunc64:
 ; CHECK-I686:       # %bb.0:
 ; CHECK-I686-NEXT:    pushl %esi
@@ -181,12 +232,16 @@ define void @test_trunc64(double %in, half* %addr) #0 {
 ; CHECK-I686-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
 ; CHECK-I686-NEXT:    movsd %xmm0, (%esp)
 ; CHECK-I686-NEXT:    calll __truncdfhf2
+; CHECK-I686-NEXT:    # kill: def $ax killed $ax def $eax
+; CHECK-I686-NEXT:    movl %eax, (%esp)
+; CHECK-I686-NEXT:    calll identity.half@PLT
 ; CHECK-I686-NEXT:    movw %ax, (%esi)
 ; CHECK-I686-NEXT:    addl $8, %esp
 ; CHECK-I686-NEXT:    popl %esi
 ; CHECK-I686-NEXT:    retl
   %val16 = fptrunc double %in to half
-  store half %val16, half* %addr
+  %val16b = call half @identity.half(half %val16)
+  store half %val16b, half* %addr
   ret void
 }

Is this intentional? We do already have code to handle the ABI dependency on vector-sizes, and could add this to the list of flags that change the ABI (i.e. we disable it if it will break the ABI), but wanted to confirm first if that was the intent here.

discovered from https://github.com/JuliaLang/julia/issues/44829

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 3:41 PM
Herald added a subscriber: StephenFan. · View Herald Transcript