This is an archive of the discontinued LLVM Phabricator instance.

[X86] AVX512FP16 instructions enabling 5/6
ClosedPublic

Authored by pengfei on Jul 1 2021, 12:23 AM.

Diff Detail

Event Timeline

pengfei created this revision.Jul 1 2021, 12:23 AM
pengfei requested review of this revision.Jul 1 2021, 12:23 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 1 2021, 12:23 AM
pengfei edited the summary of this revision. (Show Details)Jul 1 2021, 1:00 AM
spatel added inline comments.Jul 5 2021, 11:32 AM
llvm/test/CodeGen/X86/stack-folding-fp-avx512fp16vl-fma.ll
194–195

I was just scanning through this patch and noticed the capitalization mismatch on these lines and others. This test has no valid checks as written?

pengfei added inline comments.Jul 5 2021, 6:35 PM
llvm/test/CodeGen/X86/stack-folding-fp-avx512fp16vl-fma.ll
194–195

Good catch! Yes, these two tests were manually written and lit doesn't report fail for such mismatch.

pengfei updated this revision to Diff 356585.Jul 5 2021, 6:35 PM

Fix the capitalization mismatch in tests. Thanks Sanjay!

RKSimon added inline comments.Aug 17 2021, 2:40 AM
clang/lib/Headers/avx512fp16intrin.h
2429

Add outer brackets to all the defines to prevent precedence issues:

#define _mm512_fmadd_round_ph(A, B, C, R)                                      \
 ((__m512h) __builtin_ia32_vfmaddph512_mask(                                   \
      (__v32hf)(__m512h)(A), (__v32hf)(__m512h)(B), (__v32hf)(__m512h)(C),     \
      (__mmask32)-1, (int)(R)))
pengfei updated this revision to Diff 366873.Aug 17 2021, 6:00 AM

Rebased.
Add extra parentheses for macro.

clang/lib/Headers/avx512fp16intrin.h
2429

Thanks Simon.

LuoYuanke added inline comments.Aug 19 2021, 8:00 AM
clang/include/clang/Basic/BuiltinsX86.def
2010

Can we arrange the vfmaddph variant together? Move it to line 1997?
Why there is no mask version for 128 and 256?

2014

What does "3" stand for?

clang/lib/Headers/avx512vlfp16intrin.h
1385

Sorry, I'm confused sometimes we use mask builtin, sometimes we use select builtin. Any guideline on it?

llvm/include/llvm/IR/IntrinsicsX86.td
5713

I notice there is no builtin bound to this intrinsic. What is it used for?

5731

ph?

llvm/lib/Target/X86/X86InstrFMA3Info.cpp
145

Looks some redundant logic. Only X86II::EVEX and X86II::T_MAP6 is special for FP16?

160

Can we integrate it to FMA3GROUP_PACKED_AVX512() with PH extended?

163

Ditto.

166

Ditto.

170

Ditto.

174

Seems we only need FP16Groups be separate table.

179

Ditto.

llvm/test/CodeGen/X86/avx512fp16-fma-commute.ll
10

The name 123 is not the same with the generated instruction (213sh). Is it expected?

llvm/test/CodeGen/X86/vec-strict-128-fp16.ll
105

Is it necessary to test 132, 231 version?

llvm/test/CodeGen/X86/vec-strict-256-fp16.ll
105

Ditto.

llvm/test/CodeGen/X86/vec-strict-512-fp16.ll
104

Ditto.

craig.topper added inline comments.Aug 19 2021, 8:23 AM
clang/include/clang/Basic/BuiltinsX86.def
2014

The 3 is there because AMD's 4 operand fma used vfmaddss/vfmaddsd. So Intel's 3 operand used vfmaddss3/vfmaddsd3. That naming is being carried forward here.

clang/lib/Headers/avx512vlfp16intrin.h
1385

Ideally FP should never use select because it doesn't convey that exceptions should be masked for strictfp. But the mistake was already made for add/sub/mul/div/fma/etc years ago before strictfp support existed in llvm. fp16 is intentionally following float/double for consistency.

llvm/include/llvm/IR/IntrinsicsX86.td
5713

It is manually selected in CGBuiltin.cpp

5731

It's scalar so it shouldn't be ph. This matches int_x86_avx512_vfmadd_f32 and int_x86_avx512_vfmadd_f64. They don't use ss/sd because the ss/sd names are usually used for intrinsics that 128-bit operands and only modify the lower element. int_x86_avx512_vfmadd_f32 and int_x86_avx512_vfmadd_f64 have float/double inputs and produce float/double results.

llvm/test/CodeGen/X86/avx512fp16-fma-commute.ll
10

123 represents how the 3 arguments to the fucntion are mapped to the 3 intrinsic arguments that it calls. There are 6 possible permutations which are all tested here, but only 3 instruction mnemonics.

I understand now. Thanks, Craig.

pengfei updated this revision to Diff 367957.Aug 21 2021, 8:03 AM
pengfei marked 8 inline comments as done.

Address Yuanke's comments. Thanks Yuanke and Craig.

clang/include/clang/Basic/BuiltinsX86.def
2010

We followed what're ps/pd doing. As Craig explained, this is history's legacy. We should fix them in future.

llvm/test/CodeGen/X86/vec-strict-128-fp16.ll
105

213 is the preferred version due to its order in SelectCode table. We can test others by using memory input, but they are covered in stack-folding-fp-avx512fp16vl-fma.ll. I don't think it's necessary to test here.

LuoYuanke accepted this revision.Aug 21 2021, 7:23 PM

LGTM, may wait 1 or 2 days for comments from others.

This revision is now accepted and ready to land.Aug 21 2021, 7:23 PM
This revision was landed with ongoing or failed builds.Aug 23 2021, 6:40 PM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/X86/fp-strict-scalar-fp16.ll