Page MenuHomePhabricator

[X86] AVX512FP16 instructions enabling 1/6
ClosedPublic

Authored by pengfei on Jun 30 2021, 10:39 PM.

Details

Summary
  1. Enable FP16 type support and basic declarations used by following patches.
  2. Enable new instructions VMOVW and VMOVSH.

Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
pengfei marked 2 inline comments as done.Jul 9 2021, 10:53 PM
  1. Address Craig's comments.
  2. Add more patterns for i16 lowering.
clang/lib/Headers/avx512vlfp16intrin.h
75

There's no difference in assembly for immediate value. https://godbolt.org/z/sMbrM611d. But the latency of vpbroadcastd is better than vpbroadcastw in Skylake according to intrinsic guide. Here the only effect is consist with _mm256_and_epi32. Do you think it's better to use _mm256_set1_epi16?

llvm/include/llvm/IR/RuntimeLibcalls.def
290 ↗(On Diff #356376)

No. I'll move it to the 3rd patch and test it there.

llvm/lib/Target/X86/X86FastISel.cpp
58

Maybe we can use X86ScalarSSEf16, here SSE means SSE registers? Especially GCC community proposing to support FP16 since SSE2.

pengfei updated this revision to Diff 357715.Jul 10 2021, 7:06 AM

Fix another regression caused by last update.

pengfei updated this revision to Diff 363946.Aug 3 2021, 7:49 PM

Rebased.

LuoYuanke added inline comments.Aug 4 2021, 7:01 AM
clang/lib/CodeGen/TargetInfo.cpp
3405

float -> half?

clang/lib/Headers/avx512fp16intrin.h
292

Just be curious, why not directly use __W?

319

What is may_alias used for?

350

I see in _mm_mask_load_sh(), we create a __m128h with upper bits zero, not sure we also need it in store intrinsic.

419

Why not return __a[0] directly?

clang/test/CodeGen/X86/avx512fp16-abi.c
89

Any false test case that have padding between a and b?

llvm/include/llvm/IR/Intrinsics.td
315

Not sure about the legacy comments, should it be _Float16 now?

llvm/include/llvm/Target/TargetSelectionDAG.td
1054

I notice it is true for other extload. Is it same to "true"?

llvm/lib/Target/X86/Disassembler/X86Disassembler.cpp
341

This is the same to ((byte1 & 0x8) == 0x0)?

LuoYuanke added inline comments.Aug 5 2021, 6:44 AM
llvm/lib/Target/X86/MCTargetDesc/X86MCCodeEmitter.cpp
801

Add comments for map5 and map6?

llvm/lib/Target/X86/X86.td
189

customize?

llvm/lib/Target/X86/X86FastISel.cpp
2291

Also add it in isCMOVPseudo()?

llvm/lib/Target/X86/X86ISelLowering.cpp
1946

Drop the brace.

10549

Need check Subtarget.hasFP16()?

10551

Why handle i16? Isn't it handled by movw?

10744

Why exclude f16? Is there better choice for fp16?

19023

movss/movsh

LuoYuanke added inline comments.Aug 6 2021, 6:56 AM
llvm/lib/Target/X86/X86InstrAVX512.td
82

indent

3878

Not sure this can be merged to 512 version load/store pattern with muticlass by abstract type info.

4159

Why there is no OptForSize for vmovsh?

4478

Sorry, I forgot what REV stand for. Do you know it?
Is this just encoding difference for register operand compared with VMOVSHZrr? What is it used for?

llvm/lib/Target/X86/X86RegisterInfo.td
570

Given there is only EVEX instructions for fp16, is it necessary to add f16 type to it?

572

Ditto.

LuoYuanke added inline comments.Aug 6 2021, 7:04 AM
llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
374

Why this test case changes? Shall we add -mattr=+avx512fp16 to run?

llvm/test/CodeGen/X86/vector-reduce-fmin-nnan.ll
373

Ditto.

pengfei marked 7 inline comments as done.Aug 6 2021, 9:12 AM

Thanks Yuanke.

clang/lib/Headers/avx512fp16intrin.h
292

First, this is a simple mimic of _mm_mask_load_ss.
I think the reason is the intrinsic requests dst[MAX:16] := 0, while the builtin returns with src[MAX:16].
So we need to explicitly clear the upper bits.

319

This is used for preventing type-based alias analysis.
https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#Common-Type-Attributes

"In the context of section 6.5 paragraph 7 of the C99 standard, an lvalue expression dereferencing such a pointer is treated like having a character type."
"This extension exists to support some vector APIs, in which pointers to one vector type are permitted to alias pointers to a different vector type."

350

Both load and store intrinsics only access 16bit memory, the different is the load intrinsic needs to set up the high bits of the XMM register (because we do return a 128 bits result). We don't need to do that for a store.

419

Because __m128i is defined as <2 x i64>. __a[0] is correct only for i64 type.

clang/test/CodeGen/X86/avx512fp16-abi.c
89

This is the one with padding, since _Float16 aligns to 2 bytes while float aligns to 4.

llvm/include/llvm/IR/Intrinsics.td
315

LLVM IR serves for not only one type. __fp16 is still usable in Clang. Besides, OpenCL half type also use half in IR. And maybe we have other FE types too. So I'd like to keep it as is unless we have a better way to cover all other FE types.

llvm/include/llvm/Target/TargetSelectionDAG.td
1054

Good catch. I noticed it too, but forgot to change it.

llvm/lib/Target/X86/Disassembler/X86Disassembler.cpp
341

Yes, but I'm not sure if this is intentional. Maybe it keeps the shape in & X == X?

llvm/lib/Target/X86/X86.td
189

customise seems correct too. Anyway, I can change it.

llvm/lib/Target/X86/X86ISelLowering.cpp
10549

No, f16 is legal here, so it implies the feature.

10551

No, we don't have a movw instruction.

10744

We prefer to using shuffle vector rather than insert_vector_elt here, because we don't have a insert instruction for half type.

llvm/lib/Target/X86/X86InstrAVX512.td
3878

I think it is probably feasible. We may add a codegen only opcode to reuse VMOVDQU instruction defination.
But that may need careful tune, so I think we can do it as a followup.

4159

Good catch. I think we should add it here.

4478

I think REV is short for revert. Which allows a different encoding when operands order are reverted.
Yes. It's used for a different encoding.

llvm/lib/Target/X86/X86RegisterInfo.td
570

I think so. For example, we may use some i16 instructions which may be or may finally turn into AVX2 ones. Adding to it is useful for them since VR128 is subset of VR128X.

llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
374

Because we allowed one combine after X86ISelLowering.cpp:41180 without check the feature.
Although it seems the code here is correct and better, I'll add the check for feature in case any confusing.
We do have the test for avx512fp16 in D105264.

pengfei updated this revision to Diff 364817.Aug 6 2021, 9:13 AM

Address Yuanke's comments.

craig.topper added inline comments.Aug 6 2021, 9:24 AM
llvm/lib/Target/X86/X86InstrAVX512.td
4478

It is short for "reverse". Meaing the operands are in the reversed order. There are two valid encodings moving from one register to another. This happens because there are separate opcodes for moving register to memory(Store) and moving memory to register(load). The memory operand for both of those opcodes can be a register as well. The assembler and isel always uses the register to register version of the load opcode. The reversed version is only used by the disassembler

There is an exception to that. For VEX encoded AVX/AVX2 instructions, X86MCInstLowering will use an _REV move if it allows a 2 byte VEX prefix instead of a 3 byte VEX prefix. This doesn't apply to any AVX512 instructions though.

pengfei added inline comments.Aug 6 2021, 5:33 PM
llvm/lib/Target/X86/X86InstrAVX512.td
4478

Thanks Craig for the information.

LuoYuanke added inline comments.Aug 6 2021, 5:37 PM
llvm/lib/Target/X86/X86InstrAVX512.td
4478

It is short for "reverse". Meaing the operands are in the reversed order. There are two valid encodings moving from one register to another. This happens because there are separate opcodes for moving register to memory(Store) and moving memory to register(load). The memory operand for both of those opcodes can be a register as well. The assembler and isel always uses the register to register version of the load opcode. The reversed version is only used by the disassembler

There is an exception to that. For VEX encoded AVX/AVX2 instructions, X86MCInstLowering will use an _REV move if it allows a 2 byte VEX prefix instead of a 3 byte VEX prefix. This doesn't apply to any AVX512 instructions though.

I understand now. Thanks, Craig and Pengfei.

pengfei updated this revision to Diff 364960.Aug 7 2021, 7:59 AM

Add missing changes from Yuanke's comments.

pengfei updated this revision to Diff 365019.Aug 8 2021, 7:17 AM

Fix ABI incompatibility issue when a structure has three half or a float and half.

LuoYuanke added inline comments.Aug 8 2021, 7:47 AM
clang/lib/CodeGen/TargetInfo.cpp
3479

For 2 float, return <2xfloat> to be compatible to previous ABI?

pengfei added inline comments.Aug 8 2021, 7:54 AM
clang/lib/CodeGen/TargetInfo.cpp
3479

It is already handled in line 3456.

pengfei updated this revision to Diff 365024.Aug 8 2021, 7:56 AM

Fix a Lint warning.

pengfei updated this revision to Diff 365070.Aug 8 2021, 8:32 PM

Add override for <3 x half>.

LuoYuanke accepted this revision.Aug 9 2021, 1:01 AM

LGTM, but may wait 1 or 2 days for the comments from others.

This revision is now accepted and ready to land.Aug 9 2021, 1:01 AM
craig.topper added inline comments.Aug 9 2021, 9:36 AM
clang/docs/LanguageExtensions.rst
599

Might be worth mentioning that it requires AVX512FP16 here

clang/lib/CodeGen/TargetInfo.cpp
2817

Merge with the previous if?

2948

Merge with the FloatTy if?

pengfei updated this revision to Diff 365326.Aug 9 2021, 6:02 PM
pengfei marked 3 inline comments as done.

Address review comments. Thanks Craig.

This revision was landed with ongoing or failed builds.Aug 9 2021, 9:46 PM
This revision was automatically updated to reflect the committed changes.
pengfei added inline comments.Aug 11 2021, 12:55 AM
llvm/lib/Target/X86/X86InstrAVX512.td
4159

Sorry, I think we should not add OptForSize here.
This predicate is used to force to select blend instead of mov due to performance consideration.
E.g.: https://godbolt.org/z/W4v38K6va

Since we don't have a blendph instruction, I think we can always select it to movsh. Not sure if using pblendw is beneficial.
I'll change it back in next patch.