This is an archive of the discontinued LLVM Phabricator instance.

[X86][RFC] Using `__bf16` for AVX512_BF16 intrinsics
ClosedPublic

Authored by pengfei on Aug 21 2022, 8:47 AM.

Details

Summary

This is an alternative of D120395 and D120411.

Previously we use __bfloat16 as a typedef of unsigned short. The
name may give user an impression it is a brand new type to represent
BF16. So that they may use it in arithmetic operations and we don't have
a good way to block it.

To solve the problem, we introduced __bf16 to X86 psABI and landed the
support in Clang by D130964. Now we can solve the problem by switching
intrinsics to the new type.

Diff Detail

Event Timeline

pengfei created this revision.Aug 21 2022, 8:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2022, 8:47 AM
pengfei requested review of this revision.Aug 21 2022, 8:47 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 21 2022, 8:47 AM
LuoYuanke added inline comments.Aug 21 2022, 11:57 PM
clang/lib/AST/MicrosoftMangle.cpp
2472 ↗(On Diff #454316)

This looks irrelative to the patch.

clang/test/CodeGen/X86/avx512bf16-builtins.c
7–8

Add a test case for __bfloat16 to test compatibility?

llvm/include/llvm/IR/IntrinsicsX86.td
4904

Probably need to upgrade the old intrinsics to new version for IR compatibility or we can keep IR unchanged and just generate bitcast from bfloat16 to i16.

llvm/lib/Target/X86/X86ISelLowering.cpp
2182

Not sure about this. Does it make bf16 legal type?

pengfei updated this revision to Diff 455906.Aug 26 2022, 7:40 AM

Address Yuanke's comments.

clang/lib/AST/MicrosoftMangle.cpp
2472 ↗(On Diff #454316)

The use of __bf16 in intrinsics leads to new lit fails due to no mangling support on Windows. But I can do it in a separate patch.

clang/test/CodeGen/X86/avx512bf16-builtins.c
7–8

GCC folks prefer to not providing __bfloat16, but I'd like to deprecate it first. So we don't need test for it.

llvm/include/llvm/IR/IntrinsicsX86.td
4904

Good suggestion!

llvm/lib/Target/X86/X86ISelLowering.cpp
2182

Good catch! I made it legal to lower BUILD_VECTOR. But yes, it results in the scalar lowering failing with AVX512BF16.
I fixed the problem by adding customized code. It works for both scalar lowering and AVX512BF16 intrinsics lowering now.

pengfei updated this revision to Diff 455911.Aug 26 2022, 8:01 AM

Added upgrade tests and fixed a bug found by the test.

pengfei updated this revision to Diff 461480.Sep 20 2022, 12:58 AM

Rebase and ping.

A few minors - and this probably needs a release notes entry for 16.x?

llvm/lib/Target/X86/X86ISelLowering.cpp
2185

Isn't MVT::bf16 scalar?

llvm/test/CodeGen/X86/avx512bf16-intrinsics-upgrade.ll
30

any chance we can recover the predicated instruction?

pengfei updated this revision to Diff 461812.Sep 21 2022, 1:47 AM

Make vector operations of bf16 Expand.

llvm/lib/Target/X86/X86ISelLowering.cpp
2185

Yes, when legalize the source operand, the legalizer gets action as the type action, i.e., TypeSoftPromoteHalf. However, we don't provide methods to handle any vector actions in soft promote. So we need to set it Custom here to do the customization.

pengfei added inline comments.Sep 21 2022, 1:48 AM
llvm/test/CodeGen/X86/avx512bf16-intrinsics-upgrade.ll
30

It's possible, e.g., iterate all users of the intrinsic, bitcast all the select operands as well; or add patterns for i16; or make vselect peek through bitcast etc.
But I think the small performance regression is not a critical requirement as the backward compatibility for the old intrinsics. It may not worth the code complexity.

RKSimon added inline comments.Sep 21 2022, 3:47 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
2185

OK - please can you add a short comment explaining that

pengfei updated this revision to Diff 461875.Sep 21 2022, 6:26 AM

Add comment for BUILD_VECTOR.

pengfei marked an inline comment as done.Sep 21 2022, 6:27 AM
RKSimon added inline comments.Sep 21 2022, 7:18 AM
llvm/test/CodeGen/X86/avx512bf16-intrinsics-upgrade.ll
30

OK - how come the mask_move_lowering_f16_bf16 refactoring in X86InstrAVX512.td didn't fix this?

pengfei added inline comments.Sep 21 2022, 7:33 AM
llvm/test/CodeGen/X86/avx512bf16-intrinsics-upgrade.ll
30

The mask_move_lowering_f16_bf16 should do nothing with it. I think the problem is after AutoUpgrade the IR becomes:

%0 = tail call <32 x bfloat> @llvm.x86.avx512bf16.cvtne2ps2bf16.512(<16 x float> %A, <16 x float> %B)
%1 = bitcast i32 %U to <32 x i1>
%2 = bitcast <32 x bfloat> %0 to <32 x i16>
%3 = select <32 x i1> %1, <32 x i16> %2, <32 x i16> zeroinitializer
%4 = bitcast <32 x i16> %3 to <8 x i64>
ret <8 x i64> %4

And after refactoring of X86InstrAVX512.td, we are able to match

%0 = tail call <32 x bfloat> @llvm.x86.avx512bf16.cvtne2ps2bf16.512(<16 x float> %A, <16 x float> %B)
... ...
%2 = select <32 x i1> %1, <32 x bfloat> %0, <32 x bfloat> zeroinitializer

So leaving the upgraded IRs failed to match the predicated instruction.

LuoYuanke added inline comments.Oct 18 2022, 11:50 PM
clang/lib/Headers/avx512bf16intrin.h
13

What is this macro check used for?

clang/test/CodeGen/X86/avx512bf16-error.c
15

Need test for other operations (-, *, /) as well?

llvm/include/llvm/IR/IntrinsicsX86.td
4928

It seems we still use i32 to represent <2 x bf16>, but we don't have a better way since 1 bit mask cover a pair of bf16 elements.

llvm/lib/IR/AutoUpgrade.cpp
4095

Why there is no bitcast for the input for the other intrinsics? I expect to see the bitcast from vXi16 to vXbf16.

llvm/lib/Target/X86/X86InstrAVX512.td
3916

Not sure the indent is correct or not.

llvm/test/CodeGen/X86/bfloat.ll
32 ↗(On Diff #461875)

It seems the difference between SSE2 and BF16 is using SSE instruction or AVX instruction. What do we expect to test for BF16?

pengfei updated this revision to Diff 468891.Oct 19 2022, 6:38 AM

Update types of dpbf16ps intrinsics too.

clang/lib/Headers/avx512bf16intrin.h
13

__bf16 is not available without SSE2. This is to make sure no error generated if user include <immintrin.h>

clang/test/CodeGen/X86/avx512bf16-error.c
15

I don't think so. This is to check __bfloat16 is deprecated. We should check them when enabling __bf16 on SSE2.

llvm/include/llvm/IR/IntrinsicsX86.td
4928

I think mask is not an issue because both the passthru and dst are <4 x float>.

llvm/lib/IR/AutoUpgrade.cpp
4095

Others don't have vXbf16 in inputs.

llvm/lib/Target/X86/X86InstrAVX512.td
3916

The format is chaos in td files, at least we have code using in this way :)

llvm/test/CodeGen/X86/bfloat.ll
32 ↗(On Diff #461875)

This is to make sure the scalar type works under AVX512-BF16. We may optimize it with vcvtneps2bf16 in future.

This revision is now accepted and ready to land.Oct 19 2022, 7:20 AM

Add a short description to clang ReleaseNotes about the new bf/bh types?

pengfei updated this revision to Diff 468919.Oct 19 2022, 8:08 AM

Add description in Clang ReleaseNotes.

RKSimon accepted this revision.Oct 19 2022, 8:10 AM

LGTM - cheers

This revision was landed with ongoing or failed builds.Oct 19 2022, 8:47 AM
This revision was automatically updated to reflect the committed changes.