This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Headers/
-
Headers/
2
avx512bf16intrin.h
-
avx512vlbf16intrin.h
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
avx512bf16-builtins.c
-
avx512bf16-error.c
-
avx512vlbf16-builtins.c
-
llvm/docs/
-
docs/
-
LangRef.rst

Differential D120395

[X86] Prohibit arithmetic operations on type `__bfloat16`
AbandonedPublic

Authored by pengfei on Feb 23 2022, 1:21 AM.

Download Raw Diff

Details

Reviewers

skan
RKSimon
craig.topper
FreddyYe
LuoYuanke

Summary

__bfloat16 is defined as X86 specific type that represents the brain
floating-point format. It is only usable with X86 intrinsics. Arithmetic
operations with this type need to be forbidden.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pengfei requested review of this revision.Feb 23 2022, 1:21 AM

pengfei created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2022, 1:21 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B151013: Diff 410741.Feb 23 2022, 2:10 AM

but its still OK to perform arithmetic with __m128bh ? https://simd.godbolt.org/z/Ef59Ws4M3

In D120395#3340041, @RKSimon wrote:

but its still OK to perform arithmetic with __m128bh ? https://simd.godbolt.org/z/Ef59Ws4M3

Good point! I'd think the define of __m128bh is wrong direction. We should use __m128i like we doing with f16c intrinsics and reserve __m128bh for ABI type like we doing with avx512fp16.
I tried to warn for it using deprecated but it didn't report warning at all. Any thought?

diff --git a/clang/lib/Headers/avx512bf16intrin.h b/clang/lib/Headers/avx512bf16intrin.h
index 54f0cb9cfbf1..2f9cda6b32f2 100644
--- a/clang/lib/Headers/avx512bf16intrin.h
+++ b/clang/lib/Headers/avx512bf16intrin.h
@@ -13,8 +13,10 @@
 #ifndef __AVX512BF16INTRIN_H
 #define __AVX512BF16INTRIN_H

-typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64)));
-typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32)));
+typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64),
+                                      deprecated("use __m512i instead")));
+typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32),
+                                      deprecated("use __m256i instead")));

 /// \typedef __bfloat16
 ///    A target specific type to represent the storage only brain floating-point
diff --git a/clang/lib/Headers/avx512vlbf16intrin.h b/clang/lib/Headers/avx512vlbf16intrin.h
index d42f8eb0f0f5..0e47a930ebd0 100644
--- a/clang/lib/Headers/avx512vlbf16intrin.h
+++ b/clang/lib/Headers/avx512vlbf16intrin.h
@@ -13,7 +13,8 @@
 #ifndef __AVX512VLBF16INTRIN_H
 #define __AVX512VLBF16INTRIN_H

-typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16)));
+typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16),
+                                      deprecated("use __m128i instead")));

 #define __DEFAULT_FN_ATTRS128 \
   __attribute__((__always_inline__, __nodebug__, \

In D120395#3340153, @pengfei wrote:

In D120395#3340041, @RKSimon wrote:

but its still OK to perform arithmetic with __m128bh ? https://simd.godbolt.org/z/Ef59Ws4M3

diff --git a/clang/lib/Headers/avx512bf16intrin.h b/clang/lib/Headers/avx512bf16intrin.h
index 54f0cb9cfbf1..2f9cda6b32f2 100644
--- a/clang/lib/Headers/avx512bf16intrin.h
+++ b/clang/lib/Headers/avx512bf16intrin.h
@@ -13,8 +13,10 @@
 #ifndef __AVX512BF16INTRIN_H
 #define __AVX512BF16INTRIN_H

-typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64)));
-typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32)));
+typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64),
+                                      deprecated("use __m512i instead")));
+typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32),
+                                      deprecated("use __m256i instead")));

 /// \typedef __bfloat16
 ///    A target specific type to represent the storage only brain floating-point
diff --git a/clang/lib/Headers/avx512vlbf16intrin.h b/clang/lib/Headers/avx512vlbf16intrin.h
index d42f8eb0f0f5..0e47a930ebd0 100644
--- a/clang/lib/Headers/avx512vlbf16intrin.h
+++ b/clang/lib/Headers/avx512vlbf16intrin.h
@@ -13,7 +13,8 @@
 #ifndef __AVX512VLBF16INTRIN_H
 #define __AVX512VLBF16INTRIN_H

-typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16)));
+typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16),
+                                      deprecated("use __m128i instead")));

 #define __DEFAULT_FN_ATTRS128 \
   __attribute__((__always_inline__, __nodebug__, \

Sorry, it works.

pengfei mentioned this in D120411: [X86] Replace __m[128|256|512]bh with __m[128|256|512]i and mark the former deprecated.Feb 23 2022, 7:12 AM

Update LangRef. We use i16 type to represent bfloat16.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2022, 7:41 AM

Herald added subscribers: llvm-commits, jdoerfert. · View Herald Transcript

Harbormaster completed remote builds in B151067: Diff 410827.Feb 23 2022, 8:20 AM

In D120395#3340496, @pengfei wrote:

Update LangRef. We use i16 type to represent bfloat16.

Why are we using i16 to represent bfloat16? The bfloat type is better.

In D120395#3340891, @andrew.w.kaylor wrote:

In D120395#3340496, @pengfei wrote:

Update LangRef. We use i16 type to represent bfloat16.

Why are we using i16 to represent bfloat16? The bfloat type is better.

These intrinsics pre-date the existence of the bfloat type in LLVM. To use bfloat we have to make __bf16 a legal type in C. This means we need to support loads, stores, and arguments of that type. I think that would create bunch of backend complexity because we don't have could 16-bit load/store support to XMM registers. I think we only have load that inserts into a specific element. It's doable, but I'm not sure what we gain from it.

clang/lib/Headers/avx512bf16intrin.h
22	arithmatic -> arithmetic

In D120395#3340953, @craig.topper wrote:

These intrinsics pre-date the existence of the bfloat type in LLVM. To use bfloat we have to make __bf16 a legal type in C. This means we need to support loads, stores, and arguments of that type. I think that would create bunch of backend complexity because we don't have could 16-bit load/store support to XMM registers. I think we only have load that inserts into a specific element. It's doable, but I'm not sure what we gain from it.

My motivation for wanting to use 'bloat' started with wanting to use '__bf16' as the front end type. It just doesn't make sense to me to define a new type when we have an existing built-in type that has the same semantics and binary representation. The argument for introducing a new IR type was made here: https://reviews.llvm.org/D76077 It doesn't seem like a particularly strong argument, but it's what was decided then. Using bfloat rather than i16 in the IR has the benefit that it expresses what the type actually is instead of just using something that has the same size. Using i16, the semantics of the type are known only to the front end and we have to rely on what the front end did for enforcement of the semantics. That's generally going to be OK, but it seems to me like it works for the wrong reason. That is, i16 is not a storage-only type and the only reason we don't notice is that the front end doesn't generate IR that violates the implicit semantics of the type.

I think there's a minor argument to be made concerning TBAA (short and bfloat16 look like compatible types). Perhaps a more significant argument is that using the bf16 built-in type would allow us to define a type like __m256bh like this:

typedef __bf16 __m256bh __attribute__((__vector_size__(32), __aligned__(32)));

So my question would be, how much work are we talking about to make this work with the x86 backend?

In D120395#3343799, @andrew.w.kaylor wrote:
In D120395#3340953, @craig.topper wrote:

These intrinsics pre-date the existence of the bfloat type in LLVM. To use bfloat we have to make __bf16 a legal type in C. This means we need to support loads, stores, and arguments of that type. I think that would create bunch of backend complexity because we don't have could 16-bit load/store support to XMM registers. I think we only have load that inserts into a specific element. It's doable, but I'm not sure what we gain from it.

My motivation for wanting to use 'bloat' started with wanting to use '__bf16' as the front end type. It just doesn't make sense to me to define a new type when we have an existing built-in type that has the same semantics and binary representation. The argument for introducing a new IR type was made here: https://reviews.llvm.org/D76077 It doesn't seem like a particularly strong argument, but it's what was decided then. Using bfloat rather than i16 in the IR has the benefit that it expresses what the type actually is instead of just using something that has the same size. Using i16, the semantics of the type are known only to the front end and we have to rely on what the front end did for enforcement of the semantics. That's generally going to be OK, but it seems to me like it works for the wrong reason. That is, i16 is not a storage-only type and the only reason we don't notice is that the front end doesn't generate IR that violates the implicit semantics of the type.

I think there's a minor argument to be made concerning TBAA (short and bfloat16 look like compatible types). Perhaps a more significant argument is that using the bf16 built-in type would allow us to define a type like __m256bh like this:
typedef __bf16 __m256bh __attribute__((__vector_size__(32), __aligned__(32)));
So my question would be, how much work are we talking about to make this work with the x86 backend?

I don't see much value to support __bf16 in front end for X86. I guess you may want something like __fp16. But the design of __fp16 doesn't look great to me. GCC doesn't support __fp16 for X86. And the existing implementation of __fp16 somehow becomes obstacle for us to support _Float16, especially when we want to support for targets without avx512fp16. Not to mention the functionality of __bf16 isn't as complete as __fp16: https://godbolt.org/z/WzKPrYTYP I think it's far from evaluating the backend work.
I believe the right approch is to define the ABI type firstly like _Float16, then we can do something in backend to support it.

Anyway, it doesn't matter to the intrinsics we are supporting here whether we want to support __bf16 or not. We are free to define and use target specific type for target intrinsics. As mature intrinsics, our focuses are backward compatibilities and cross compiler compatibilities. Both stop us from defining with __bf16.

Disscussed with GCC folks. We think it's better to use the same way as D120411 that replacing it with short int.

pengfei retitled this revision from [X86] Prohibit arithmatic operations on type `__bfloat16` to [X86] Prohibit arithmetic operations on type `__bfloat16`.Feb 25 2022, 12:03 AM

pengfei edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B151426: Diff 411334.Feb 25 2022, 12:56 AM

In D120395#3344890, @pengfei wrote:

Disscussed with GCC folks. We think it's better to use the same way as D120411 that replacing it with short int.

Which GCC folks did you discuss it with? I ask because GCC currently considers __m128bh and __m128i to be incompatible types, which is what I would expect: https://godbolt.org/z/z9nefbrEc

andrew.w.kaylor added inline comments.Feb 25 2022, 12:57 PM

clang/lib/Headers/avx512bf16intrin.h
37	Are we trying to make our intrinsics weakly typed? I don't like this change at all.

There's a lot of churn around proposed "solutions" on this and related PR, but not a very clear analysis of what the problem we're trying to solve is.

Concretely, what are the semantics that we want for the BF16 types and intrinsics? Unlike the other floating-point types, there's no standard to guide this, so it's even more important to clearly specify how these types are to be used, instead of having an ad-hoc semantics of whatever someone happens to implement.

In D120395#3346591, @scanon wrote:

There's a lot of churn around proposed "solutions" on this and related PR, but not a very clear analysis of what the problem we're trying to solve is.

I thought the problem that the patch was originally trying to solve is that __bfloat16 variables could be used in arithmetic operations (using the __bfloat16 type defined in the avx512bf16intrin.h header). For example, clang currently compiles this code without any diagnostics if the target processor has the required features (avx512bf16 and avx512vl).

#include <immintrin.h>
float f(float x, float y) {
  __bfloat16 x16 = _mm_cvtness_sbh(x);
  __bfloat16 y16 = _mm_cvtness_sbh(y);
  __bfloat16 z16 = x16 + y16;
  return _mm_cvtsbh_ss(z16);
}

https://godbolt.org/z/vcbcGsPPx

The problem is that the instructions generated for that code are completely wrong because __bfloat is defined as unsigned short. It relies on the user knowing that they shouldn't use this type in arithmetic operations.

Like I said, I thought that was the original intention of this patch. However, the latest version of the patch doesn't prevent this at all. In fact, it makes the problem worse by asking the user to define the BF16 variables as unsigned short in their code. Getting correct behavior from this point

@pengfei Please correct me if I misunderstood the purpose of this patch.

In D120395#3346591, @scanon wrote:

Concretely, what are the semantics that we want for the BF16 types and intrinsics? Unlike the other floating-point types, there's no standard to guide this, so it's even more important to clearly specify how these types are to be used, instead of having an ad-hoc semantics of whatever someone happens to implement.

The binary representation of a BF16 value (such as the value returned by _mm_cvtness_sbh) is, as Phoebe mentioned, the "brain floating point type" as described here: https://en.wikichip.org/wiki/brain_floating-point_format

Unfortunately, what you can do with it seems to depend on the target architecture. For very recent x86 processors, you can convert vectors of this type to and from single precision floating point and you can do a SIMD dot product and accumulate operation (VDPBF16PS), but the only way to do this is with intrinsics. Some ARM processors support other operations, but I think with similar restrictions (i.e. only accessible through intrinsics). Apart from intrinsics, it is treated as a storage-only type.

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 6:21 PM

Thanks @andrew.w.kaylor ! You are totally correct about the intention and current implementations.

Concretely, what are the semantics that we want for the BF16 types and intrinsics? Unlike the other floating-point types, there's no standard to guide this, so it's even more important to clearly specify how these types are to be used, instead of having an ad-hoc semantics of whatever someone happens to implement.

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.
This patch as well as D120411 are trying to follow what we are doing on F16C. The difference is we are supporting scalar type too. That's why I put them into two patches.
Back to the question:

There's no BF16 type and its semantics before ABI ready on X86.
All intrinsics take BF16 as integer type.

In D120395#3356355, @pengfei wrote:

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.

I don't see the point of the ARM soft-float comparison, given that X86 doesn't have the strict distinction between integer and floating point registers that ARM has, at least not for the XMM/YMM/ZMM registers. Consider the following code:

__m128bh foo(__m128 x) {
  return _mm_cvtneps_pbh(x);
}
__m128 bar(__m128bh x) {
  return _mm_cvtpbh_ps(x);
}

Currently, both clang and gcc will use XMM0 for the argument and return value in both functions. Is XMM0 an integer register or a floating point register? There is no such distinction. It's true that the x86_64 psABI does talk about the general purpose registers as integer registers, and both clang and gcc will use one of these registers for __bfloat16 values, but that's an implementation detail (and a dubious one, considering that nearly anything useful that you can do with a __bfloat16 will require moving it into an SSE register).

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

But all of this only applies to the backend code generation. It has very little to do with the intrinsic definition in the header file or the IR generated by the front end. If we continue to define __bfloat16 as an unsigned short in the header file, the front end will treat it as an unsigned short and it will use its rules for unsigned short to generate IR. If the ABI is ever defined to treat BF16 differently than unsigned short, the front end won't be able to do anything about that because we've told the front end that the value is an unsigned short.

On the other hand, if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values. If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way. If it ends up being something different, the front end can apply rules for whatever the alternative is. The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it. This would then result in the front end generating IR using bfloat as the type for BF16 values. Again, this is a correct and accurate description of the value. It allows the optimizer to reason about it correctly in any way it needs to.

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.

In D120395#3358453, @andrew.w.kaylor wrote:
In D120395#3356355, @pengfei wrote:

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.

I don't see the point of the ARM soft-float comparison, given that X86 doesn't have the strict distinction between integer and floating point registers that ARM has, at least not for the XMM/YMM/ZMM registers. Consider the following code:
__m128bh foo(__m128 x) {
  return _mm_cvtneps_pbh(x);
}
__m128 bar(__m128bh x) {
  return _mm_cvtpbh_ps(x);
}
Currently, both clang and gcc will use XMM0 for the argument and return value in both functions. Is XMM0 an integer register or a floating point register? There is no such distinction. It's true that the x86_64 psABI does talk about the general purpose registers as integer registers, and both clang and gcc will use one of these registers for __bfloat16 values, but that's an implementation detail (and a dubious one, considering that nearly anything useful that you can do with a __bfloat16 will require moving it into an SSE register).

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

But all of this only applies to the backend code generation. It has very little to do with the intrinsic definition in the header file or the IR generated by the front end. If we continue to define __bfloat16 as an unsigned short in the header file, the front end will treat it as an unsigned short and it will use its rules for unsigned short to generate IR. If the ABI is ever defined to treat BF16 differently than unsigned short, the front end won't be able to do anything about that because we've told the front end that the value is an unsigned short.

On the other hand, if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values. If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way. If it ends up being something different, the front end can apply rules for whatever the alternative is. The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it. This would then result in the front end generating IR using bfloat as the type for BF16 values. Again, this is a correct and accurate description of the value. It allows the optimizer to reason about it correctly in any way it needs to.

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.

m256bh should not have been a new type. It should have been an alias of m256i. We don't have load/store intrinsics for m256bh so if you can even get the m256bh type in and out of memory using load/store intrinsics, it is only because we allow lax vector conversion by default. -fno-lax-vector-conversions will probably break any code trying to load/store it using a load/store intrinsic. If __m256bh was made a struct as at one point proposed, this would have been broken.

If we want m256bh to be a unique type using bf16, we must define load, store, and cast intrinsics for it. We would probably want insert/extract element intrinsics as well.

In D120395#3358533, @craig.topper wrote:
In D120395#3358453, @andrew.w.kaylor wrote:
In D120395#3356355, @pengfei wrote:

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.

I don't see the point of the ARM soft-float comparison, given that X86 doesn't have the strict distinction between integer and floating point registers that ARM has, at least not for the XMM/YMM/ZMM registers. Consider the following code:
__m128bh foo(__m128 x) {
  return _mm_cvtneps_pbh(x);
}
__m128 bar(__m128bh x) {
  return _mm_cvtpbh_ps(x);
}
Currently, both clang and gcc will use XMM0 for the argument and return value in both functions. Is XMM0 an integer register or a floating point register? There is no such distinction. It's true that the x86_64 psABI does talk about the general purpose registers as integer registers, and both clang and gcc will use one of these registers for __bfloat16 values, but that's an implementation detail (and a dubious one, considering that nearly anything useful that you can do with a __bfloat16 will require moving it into an SSE register).

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

But all of this only applies to the backend code generation. It has very little to do with the intrinsic definition in the header file or the IR generated by the front end. If we continue to define __bfloat16 as an unsigned short in the header file, the front end will treat it as an unsigned short and it will use its rules for unsigned short to generate IR. If the ABI is ever defined to treat BF16 differently than unsigned short, the front end won't be able to do anything about that because we've told the front end that the value is an unsigned short.

On the other hand, if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values. If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way. If it ends up being something different, the front end can apply rules for whatever the alternative is. The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it. This would then result in the front end generating IR using bfloat as the type for BF16 values. Again, this is a correct and accurate description of the value. It allows the optimizer to reason about it correctly in any way it needs to.

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.
m256bh should not have been a new type. It should have been an alias of m256i. We don't have load/store intrinsics for m256bh so if you can even get the m256bh type in and out of memory using load/store intrinsics, it is only because we allow lax vector conversion by default. -fno-lax-vector-conversions will probably break any code trying to load/store it using a load/store intrinsic. If __m256bh was made a struct as at one point proposed, this would have been broken.

If we want m256bh to be a unique type using bf16, we must define load, store, and cast intrinsics for it. We would probably want insert/extract element intrinsics as well.

-fno-lax-vector-conversions does indeed break the load/store intrinsics https://godbolt.org/z/b5WPj84Pa

You can fix it with C style cast, but that wouldn't work for a struct. icc and msvc both use structs for the vector types. And both alias m256bh to m256i.

if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values.

I don't agree. Unlike __fp16, __bf16 is simple an ARM specific type. I don't see why we need to define with it. https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point . Besides, the doc says it "is only available when supported in hardware". Since we only have vector instructions, I don't think we support the scalar type here.

If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way.

Doesn't it mean front end still generate i16 in IR for it?

The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it.

If you are saying we support a separate __bfloat16 type in front end and generate i16 in IR just for doing diagnose, I'm fine with it. The problem is we can't reuse the __bf16 's representation BFloat16Ty. Instead, we have to follow the way we are using for fp16, e.g., HalfTy for __fp16 (no ABI) and Float16Ty for _Float16 (has ABI). But it doesn't worth the effort to me.

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

No new IR types, no ABI issues. The declaration of double underscore type are free without ABI. __fp16 is a good example: https://godbolt.org/z/6qKqGc6Gj

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.

So, it is not a problem of easy or difficult. It's a wrong direction we can't go with it (introducing new IR type without clear ABI declarations). We made such mistake with the half type. https://godbolt.org/z/K55s5zqPG We shouldn't make it again.

In D120395#3358533, @craig.topper wrote:

m256bh should not have been a new type. It should have been an alias of m256i. We don't have load/store intrinsics for m256bh so if you can even get the m256bh type in and out of memory using load/store intrinsics, it is only because we allow lax vector conversion by default. -fno-lax-vector-conversions will probably break any code trying to load/store it using a load/store intrinsic. If __m256bh was made a struct as at one point proposed, this would have been broken.

If we want m256bh to be a unique type using bf16, we must define load, store, and cast intrinsics for it. We would probably want insert/extract element intrinsics as well.

From this, it sounds like our intrinsics support is incomplete for this type. Even if it is defined as an alias of some existing type (such as __m256i), I would have to do the mental gymnastics of mixing and matching intrinsics to get the behavior I want. I think this gets back to the question @scanon asked about the semantics of this type. What should I be able to do with it? Can I load a vector of these values from memory? Can I store them to memory? Can I assign one vector to another? Can I pass it as an argument? Can I use it as a return type? It looks the only things we have intrinsics for (for the vector types) are converting to and from single precision vectors and performing the dot product accumulate operation.

I was wondering about similar issues when I was looking at what we had for the __bfloat16 type. We have intrinsics to convert this type to and from single precision floating point values, but I can't do anything else with it. Nothing else at all, including inserting it into a vector of bf16 values.

So @pengfei is trying to press ahead with the backend implementation, but our front end support is incomplete. That might explain why Phoebe and I haven't been able to agree on what should be done here.

This patch is strictly a front end patch, but it's trying to just wedge definitions into header files to get the desired outcome in the code generation. From the user's perspective, it feels totally broken.

Consider this function.

__mm128 f(__bfloat16 *p1, __bfloat16 *p2) {
  // Load the vectors using the integer load intrinsics??
  __mm128i temp1 = _mm_loadu_epi32(p1);
  __mm128i temp2 = _mm_loadu_epi32(p2);

  // Zero-initialize the a base value vector
  __mm128 base = _mm_set_ps1(0.0f);

  // Perform the dot product
  return _mm_dpbf16_ps (base, temp1, temp2);
}

Is what you'd expect with the current definitions? It looks like it produces the instructions I expected, but with -fno-lax-vector-conversions I get an error unless I explicitly bitcast the arguments from __m128i to __m128bh.

I think that just brings me up to speed with what Craig was saying, right?

So at this point we have these options:

Make the __m[128|256|512]bh types aliases of __m[128|256|512]i
Deprecate the __m[128|256|512]bh types and replace them with __m[128|256|512]i
Add load/store/insert/extract intrinsics for the __bfloat16 type

Of these, I'd prefer the third option because both of the first two require the an overloaded use of the vector-integer type. I already don't like that we use the same type for any size integer vector. Using it for BF16 vectors just seems wrong.

For the example above, I'd like to write code like this:

__mm128 f(__bfloat16 *p1, __bfloat16 *p2) {
  // Load the BF16 vectors
  __mm128bh v1 = _mm_load_pbh(p1);
  __mm128bh v2 = _mm_load_pbh(p2);

  // Zero-initialize the a base value vector
  __mm128 base = _mm_set_ps1(0.0f);

  // Perform the dot product
  return _mm_dpbf16_ps (base, v1, v2);
}

That's more work, but it has the merit of allowing me to use types that match what the program is doing.

The fact that you can't pass a m128i value to a function that is expecting m128bh is a good thing. We shouldn't be making changes that prevents this diagnostic.

In D120395#3359255, @pengfei wrote:

I don't agree. Unlike __fp16, __bf16 is simple an ARM specific type.

Why is __bf16 an ARM-specific type? It's a type that describes a floating point value with a specific binary representation that is supported on some ARM and some Intel processors. Why should the type be ARM-specific? What the clang documentation says is just a description of the current implementation. We can change the documentation based on what the compiler supports.

In D120395#3359255, @pengfei wrote:

So, it is not a problem of easy or difficult. It's a wrong direction we can't go with it (introducing new IR type without clear ABI declarations). We made such mistake with the half type. https://godbolt.org/z/K55s5zqPG We shouldn't make it again.

If people are writing code that uses variables of this type, we aren't really solving anything by pretending it's a different type because the ABI for the type they're using hasn't been defined. The solution here is to work with HJ and others to get the ABI defined for this type.

So at this point we have these options:

Make the __m[128|256|512]bh types aliases of __m[128|256|512]i

Deprecate the __m[128|256|512]bh types and replace them with __m[128|256|512]i

Add load/store/insert/extract intrinsics for the __bfloat16 type

Of these, I'd prefer the third option because both of the first two require the an overloaded use of the vector-integer type. I already don't like that we use the same type for any size integer vector. Using it for BF16 vectors just seems wrong.

The third option also needs bitcast intrinsics to/from the same sized ps/pd/i type. We have similar casts between the other 3 types already.

In D120395#3365622, @craig.topper wrote:

So at this point we have these options:

Make the __m[128|256|512]bh types aliases of __m[128|256|512]i

Deprecate the __m[128|256|512]bh types and replace them with __m[128|256|512]i

Add load/store/insert/extract intrinsics for the __bfloat16 type

Of these, I'd prefer the third option because both of the first two require the an overloaded use of the vector-integer type. I already don't like that we use the same type for any size integer vector. Using it for BF16 vectors just seems wrong.

The third option also needs bitcast intrinsics to/from the same sized ps/pd/i type. We have similar casts between the other 3 types already.

If we'd like to not conflict with future bf16 ABI, the second option is the right direction.

pengfei mentioned this in D129858: [X86][BF16][WIP] Change the promote type of bf16 from i16 to f32 to follow the psABI.Jul 15 2022, 8:10 AM

pengfei mentioned this in D130964: [X86][BF16] Enable __bf16 for x86 targets..Aug 2 2022, 6:15 PM

pengfei mentioned this in D132329: [X86][RFC] Using `__bf16` for AVX512_BF16 intrinsics.Aug 21 2022, 8:47 AM

pengfei mentioned this in rGbc1819389fb4: [X86][RFC] Using `__bf16` for AVX512_BF16 intrinsics.Oct 19 2022, 8:47 AM

@pengfei Do we still need this?

Herald added a subscriber: StephenFan. · View Herald TranscriptOct 20 2022, 8:11 AM

This is not needed anymore, thanks @RKSimon

Revision Contents

Path

Size

clang/

lib/

Headers/

avx512bf16intrin.h

5 lines

avx512vlbf16intrin.h

3 lines

test/

CodeGen/

X86/

avx512bf16-builtins.c

2 lines

avx512bf16-error.c

14 lines

avx512vlbf16-builtins.c

2 lines

llvm/

docs/

LangRef.rst

4 lines

Diff 411334

clang/lib/Headers/avx512bf16intrin.h

	Show All 9 Lines
	#error "Never use <avx512bf16intrin.h> directly; include <immintrin.h> instead."			#error "Never use <avx512bf16intrin.h> directly; include <immintrin.h> instead."
	#endif			#endif

	#ifndef __AVX512BF16INTRIN_H			#ifndef __AVX512BF16INTRIN_H
	#define __AVX512BF16INTRIN_H			#define __AVX512BF16INTRIN_H

	typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64)));			typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64)));
	typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32)));			typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32)));
	typedef unsigned short __bfloat16;			typedef unsigned short __bfloat16
				__attribute__((deprecated("use unsigned short instead")));

	#define __DEFAULT_FN_ATTRS512 \			#define __DEFAULT_FN_ATTRS512 \
	__attribute__((__always_inline__, __nodebug__, __target__("avx512bf16"), \			__attribute__((__always_inline__, __nodebug__, __target__("avx512bf16"), \
				craig.topperUnsubmitted Not Done Reply Inline Actions arithmatic -> arithmetic craig.topper: arithmatic -> arithmetic
	__min_vector_width__(512)))			__min_vector_width__(512)))
	#define __DEFAULT_FN_ATTRS \			#define __DEFAULT_FN_ATTRS \
	__attribute__((__always_inline__, __nodebug__, __target__("avx512bf16")))			__attribute__((__always_inline__, __nodebug__, __target__("avx512bf16")))

	/// Convert One BF16 Data to One Single Float Data.			/// Convert One BF16 Data to One Single Float Data.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic does not correspond to a specific instruction.			/// This intrinsic does not correspond to a specific instruction.
	///			///
	/// \param __A			/// \param __A
	/// A bfloat data.			/// A bfloat data.
	/// \returns A float data whose sign field and exponent field keep unchanged,			/// \returns A float data whose sign field and exponent field keep unchanged,
	/// and fraction field is extended to 23 bits.			/// and fraction field is extended to 23 bits.
	static __inline__ float __DEFAULT_FN_ATTRS _mm_cvtsbh_ss(__bfloat16 __A) {			static __inline__ float __DEFAULT_FN_ATTRS _mm_cvtsbh_ss(unsigned short __A) {
				andrew.w.kaylorUnsubmitted Not Done Reply Inline Actions Are we trying to make our intrinsics weakly typed? I don't like this change at all. andrew.w.kaylor: Are we trying to make our intrinsics weakly typed? I don't like this change at all.
	return __builtin_ia32_cvtsbf162ss_32(__A);			return __builtin_ia32_cvtsbf162ss_32(__A);
	}			}

	/// Convert Two Packed Single Data to One Packed BF16 Data.			/// Convert Two Packed Single Data to One Packed BF16 Data.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> VCVTNE2PS2BF16 </c> instructions.			/// This intrinsic corresponds to the <c> VCVTNE2PS2BF16 </c> instructions.
	▲ Show 20 Lines • Show All 235 Lines • Show Last 20 Lines

clang/lib/Headers/avx512vlbf16intrin.h

	Show First 20 Lines • Show All 407 Lines • ▼ Show 20 Lines
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> VCVTNEPS2BF16 </c> instructions.			/// This intrinsic corresponds to the <c> VCVTNEPS2BF16 </c> instructions.
	///			///
	/// \param __A			/// \param __A
	/// A float data.			/// A float data.
	/// \returns A bf16 data whose sign field and exponent field keep unchanged,			/// \returns A bf16 data whose sign field and exponent field keep unchanged,
	/// and fraction field is truncated to 7 bits.			/// and fraction field is truncated to 7 bits.
	static __inline__ __bfloat16 __DEFAULT_FN_ATTRS128 _mm_cvtness_sbh(float __A) {			static __inline__ unsigned short __DEFAULT_FN_ATTRS128
				_mm_cvtness_sbh(float __A) {
	__v4sf __V = {__A, 0, 0, 0};			__v4sf __V = {__A, 0, 0, 0};
	__v8hi __R = __builtin_ia32_cvtneps2bf16_128_mask(			__v8hi __R = __builtin_ia32_cvtneps2bf16_128_mask(
	(__v4sf)__V, (__v8hi)_mm_undefined_si128(), (__mmask8)-1);			(__v4sf)__V, (__v8hi)_mm_undefined_si128(), (__mmask8)-1);
	return __R[0];			return __R[0];
	}			}

	/// Convert Packed BF16 Data to Packed float Data.			/// Convert Packed BF16 Data to Packed float Data.
	///			///
	▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

clang/test/CodeGen/X86/avx512bf16-builtins.c

	// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin \			// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin \
	// RUN: -target-feature +avx512bf16 -emit-llvm -o - -Wall -Werror \			// RUN: -target-feature +avx512bf16 -emit-llvm -o - -Wall -Werror \
	// RUN: \| FileCheck %s			// RUN: \| FileCheck %s

	#include <immintrin.h>			#include <immintrin.h>

	float test_mm_cvtsbh_ss(__bfloat16 A) {			float test_mm_cvtsbh_ss(unsigned short A) {
	// CHECK-LABEL: @test_mm_cvtsbh_ss			// CHECK-LABEL: @test_mm_cvtsbh_ss
	// CHECK: zext i16 %{{.*}} to i32			// CHECK: zext i16 %{{.*}} to i32
	// CHECK: shl i32 %{{.*}}, 16			// CHECK: shl i32 %{{.*}}, 16
	// CHECK: bitcast i32 %{{.*}} to float			// CHECK: bitcast i32 %{{.*}} to float
	// CHECK: ret float %{{.*}}			// CHECK: ret float %{{.*}}
	return _mm_cvtsbh_ss(A);			return _mm_cvtsbh_ss(A);
	}			}

	▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

clang/test/CodeGen/X86/avx512bf16-error.c

This file was added.

				// RUN: %clang_cc1 -fsyntax-only -verify -ffreestanding -triple x86_64-linux-pc %s

				// expected-error@+1 3 {{unknown type name '__bfloat16'}}
				__bfloat16 foo(__bfloat16 a, __bfloat16 b) {
				return a + b;
				}

				#include <immintrin.h>

				// expected-warning@+2 3 {{'__bfloat16' is deprecated: use unsigned short instead}}
				// expected-note@* 3 {{'__bfloat16' has been explicitly marked deprecated here}}
				__bfloat16 bar(__bfloat16 a, __bfloat16 b) {
				return a + b;
				}

clang/test/CodeGen/X86/avx512vlbf16-builtins.c

	Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	__m256 test_mm256_mask_dpbf16_ps(__m256 D, __m256bh A, __m256bh B, __mmask8 U) {			__m256 test_mm256_mask_dpbf16_ps(__m256 D, __m256bh A, __m256bh B, __mmask8 U) {
	// CHECK-LABEL: @test_mm256_mask_dpbf16_ps			// CHECK-LABEL: @test_mm256_mask_dpbf16_ps
	// CHECK: @llvm.x86.avx512bf16.dpbf16ps.256			// CHECK: @llvm.x86.avx512bf16.dpbf16ps.256
	// CHECK: select <8 x i1> %{{.}}, <8 x float> %{{.}}, <8 x float> %{{.*}}			// CHECK: select <8 x i1> %{{.}}, <8 x float> %{{.}}, <8 x float> %{{.*}}
	// CHECK: ret <8 x float> %{{.*}}			// CHECK: ret <8 x float> %{{.*}}
	return _mm256_mask_dpbf16_ps(D, U, A, B);			return _mm256_mask_dpbf16_ps(D, U, A, B);
	}			}

	__bfloat16 test_mm_cvtness_sbh(float A) {			unsigned short test_mm_cvtness_sbh(float A) {
	// CHECK-LABEL: @test_mm_cvtness_sbh			// CHECK-LABEL: @test_mm_cvtness_sbh
	// CHECK: @llvm.x86.avx512bf16.mask.cvtneps2bf16.128			// CHECK: @llvm.x86.avx512bf16.mask.cvtneps2bf16.128
	// CHECK: ret i16 %{{.*}}			// CHECK: ret i16 %{{.*}}
	return _mm_cvtness_sbh(A);			return _mm_cvtness_sbh(A);
	}			}

	__m128 test_mm_cvtpbh_ps(__m128bh A) {			__m128 test_mm_cvtpbh_ps(__m128bh A) {
	// CHECK-LABEL: @test_mm_cvtpbh_ps			// CHECK-LABEL: @test_mm_cvtpbh_ps
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,363 Lines • ▼ Show 20 Lines	* - Type
- Description		- Description

* - ``half``		* - ``half``
- 16-bit floating-point value		- 16-bit floating-point value

* - ``bfloat``		* - ``bfloat``
- 16-bit "brain" floating-point value (7-bit significand). Provides the		- 16-bit "brain" floating-point value (7-bit significand). Provides the
same number of exponent bits as ``float``, so that it matches its dynamic		same number of exponent bits as ``float``, so that it matches its dynamic
range, but with greatly reduced precision. Used in Intel's AVX-512 BF16		range, but with greatly reduced precision. Used in Arm's ARMv8.6-A
extensions and Arm's ARMv8.6-A extensions, among others.		extensions, among others.

* - ``float``		* - ``float``
- 32-bit floating-point value		- 32-bit floating-point value

* - ``double``		* - ``double``
- 64-bit floating-point value		- 64-bit floating-point value

* - ``fp128``		* - ``fp128``
▲ Show 20 Lines • Show All 20,741 Lines • Show Last 20 Lines