Page MenuHomePhabricator

[X86] Prohibit arithmetic operations on type `__bfloat16`
AbandonedPublic

Authored by pengfei on Feb 23 2022, 1:21 AM.

Details

Summary

__bfloat16 is defined as X86 specific type that represents the brain
floating-point format. It is only usable with X86 intrinsics. Arithmetic
operations with this type need to be forbidden.

Diff Detail

Event Timeline

pengfei requested review of this revision.Feb 23 2022, 1:21 AM
pengfei created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2022, 1:21 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

but its still OK to perform arithmetic with __m128bh ? https://simd.godbolt.org/z/Ef59Ws4M3

but its still OK to perform arithmetic with __m128bh ? https://simd.godbolt.org/z/Ef59Ws4M3

Good point! I'd think the define of __m128bh is wrong direction. We should use __m128i like we doing with f16c intrinsics and reserve __m128bh for ABI type like we doing with avx512fp16.
I tried to warn for it using deprecated but it didn't report warning at all. Any thought?

diff --git a/clang/lib/Headers/avx512bf16intrin.h b/clang/lib/Headers/avx512bf16intrin.h
index 54f0cb9cfbf1..2f9cda6b32f2 100644
--- a/clang/lib/Headers/avx512bf16intrin.h
+++ b/clang/lib/Headers/avx512bf16intrin.h
@@ -13,8 +13,10 @@
 #ifndef __AVX512BF16INTRIN_H
 #define __AVX512BF16INTRIN_H

-typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64)));
-typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32)));
+typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64),
+                                      deprecated("use __m512i instead")));
+typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32),
+                                      deprecated("use __m256i instead")));

 /// \typedef __bfloat16
 ///    A target specific type to represent the storage only brain floating-point
diff --git a/clang/lib/Headers/avx512vlbf16intrin.h b/clang/lib/Headers/avx512vlbf16intrin.h
index d42f8eb0f0f5..0e47a930ebd0 100644
--- a/clang/lib/Headers/avx512vlbf16intrin.h
+++ b/clang/lib/Headers/avx512vlbf16intrin.h
@@ -13,7 +13,8 @@
 #ifndef __AVX512VLBF16INTRIN_H
 #define __AVX512VLBF16INTRIN_H

-typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16)));
+typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16),
+                                      deprecated("use __m128i instead")));

 #define __DEFAULT_FN_ATTRS128 \
   __attribute__((__always_inline__, __nodebug__, \

but its still OK to perform arithmetic with __m128bh ? https://simd.godbolt.org/z/Ef59Ws4M3

Good point! I'd think the define of __m128bh is wrong direction. We should use __m128i like we doing with f16c intrinsics and reserve __m128bh for ABI type like we doing with avx512fp16.
I tried to warn for it using deprecated but it didn't report warning at all. Any thought?

diff --git a/clang/lib/Headers/avx512bf16intrin.h b/clang/lib/Headers/avx512bf16intrin.h
index 54f0cb9cfbf1..2f9cda6b32f2 100644
--- a/clang/lib/Headers/avx512bf16intrin.h
+++ b/clang/lib/Headers/avx512bf16intrin.h
@@ -13,8 +13,10 @@
 #ifndef __AVX512BF16INTRIN_H
 #define __AVX512BF16INTRIN_H

-typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64)));
-typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32)));
+typedef short __m512bh __attribute__((__vector_size__(64), __aligned__(64),
+                                      deprecated("use __m512i instead")));
+typedef short __m256bh __attribute__((__vector_size__(32), __aligned__(32),
+                                      deprecated("use __m256i instead")));

 /// \typedef __bfloat16
 ///    A target specific type to represent the storage only brain floating-point
diff --git a/clang/lib/Headers/avx512vlbf16intrin.h b/clang/lib/Headers/avx512vlbf16intrin.h
index d42f8eb0f0f5..0e47a930ebd0 100644
--- a/clang/lib/Headers/avx512vlbf16intrin.h
+++ b/clang/lib/Headers/avx512vlbf16intrin.h
@@ -13,7 +13,8 @@
 #ifndef __AVX512VLBF16INTRIN_H
 #define __AVX512VLBF16INTRIN_H

-typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16)));
+typedef short __m128bh __attribute__((__vector_size__(16), __aligned__(16),
+                                      deprecated("use __m128i instead")));

 #define __DEFAULT_FN_ATTRS128 \
   __attribute__((__always_inline__, __nodebug__, \

Sorry, it works.

pengfei updated this revision to Diff 410827.Feb 23 2022, 7:41 AM

Update LangRef. We use i16 type to represent bfloat16.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2022, 7:41 AM

Update LangRef. We use i16 type to represent bfloat16.

Why are we using i16 to represent bfloat16? The bfloat type is better.

Update LangRef. We use i16 type to represent bfloat16.

Why are we using i16 to represent bfloat16? The bfloat type is better.

These intrinsics pre-date the existence of the bfloat type in LLVM. To use bfloat we have to make __bf16 a legal type in C. This means we need to support loads, stores, and arguments of that type. I think that would create bunch of backend complexity because we don't have could 16-bit load/store support to XMM registers. I think we only have load that inserts into a specific element. It's doable, but I'm not sure what we gain from it.

clang/lib/Headers/avx512bf16intrin.h
22

arithmatic -> arithmetic

These intrinsics pre-date the existence of the bfloat type in LLVM. To use bfloat we have to make __bf16 a legal type in C. This means we need to support loads, stores, and arguments of that type. I think that would create bunch of backend complexity because we don't have could 16-bit load/store support to XMM registers. I think we only have load that inserts into a specific element. It's doable, but I'm not sure what we gain from it.

My motivation for wanting to use 'bloat' started with wanting to use '__bf16' as the front end type. It just doesn't make sense to me to define a new type when we have an existing built-in type that has the same semantics and binary representation. The argument for introducing a new IR type was made here: https://reviews.llvm.org/D76077 It doesn't seem like a particularly strong argument, but it's what was decided then. Using bfloat rather than i16 in the IR has the benefit that it expresses what the type actually is instead of just using something that has the same size. Using i16, the semantics of the type are known only to the front end and we have to rely on what the front end did for enforcement of the semantics. That's generally going to be OK, but it seems to me like it works for the wrong reason. That is, i16 is not a storage-only type and the only reason we don't notice is that the front end doesn't generate IR that violates the implicit semantics of the type.

I think there's a minor argument to be made concerning TBAA (short and bfloat16 look like compatible types). Perhaps a more significant argument is that using the bf16 built-in type would allow us to define a type like __m256bh like this:

typedef __bf16 __m256bh __attribute__((__vector_size__(32), __aligned__(32)));

So my question would be, how much work are we talking about to make this work with the x86 backend?

These intrinsics pre-date the existence of the bfloat type in LLVM. To use bfloat we have to make __bf16 a legal type in C. This means we need to support loads, stores, and arguments of that type. I think that would create bunch of backend complexity because we don't have could 16-bit load/store support to XMM registers. I think we only have load that inserts into a specific element. It's doable, but I'm not sure what we gain from it.

My motivation for wanting to use 'bloat' started with wanting to use '__bf16' as the front end type. It just doesn't make sense to me to define a new type when we have an existing built-in type that has the same semantics and binary representation. The argument for introducing a new IR type was made here: https://reviews.llvm.org/D76077 It doesn't seem like a particularly strong argument, but it's what was decided then. Using bfloat rather than i16 in the IR has the benefit that it expresses what the type actually is instead of just using something that has the same size. Using i16, the semantics of the type are known only to the front end and we have to rely on what the front end did for enforcement of the semantics. That's generally going to be OK, but it seems to me like it works for the wrong reason. That is, i16 is not a storage-only type and the only reason we don't notice is that the front end doesn't generate IR that violates the implicit semantics of the type.

I think there's a minor argument to be made concerning TBAA (short and bfloat16 look like compatible types). Perhaps a more significant argument is that using the bf16 built-in type would allow us to define a type like __m256bh like this:

typedef __bf16 __m256bh __attribute__((__vector_size__(32), __aligned__(32)));

So my question would be, how much work are we talking about to make this work with the x86 backend?

I don't see much value to support __bf16 in front end for X86. I guess you may want something like __fp16. But the design of __fp16 doesn't look great to me. GCC doesn't support __fp16 for X86. And the existing implementation of __fp16 somehow becomes obstacle for us to support _Float16, especially when we want to support for targets without avx512fp16. Not to mention the functionality of __bf16 isn't as complete as __fp16: https://godbolt.org/z/WzKPrYTYP I think it's far from evaluating the backend work.
I believe the right approch is to define the ABI type firstly like _Float16, then we can do something in backend to support it.

Anyway, it doesn't matter to the intrinsics we are supporting here whether we want to support __bf16 or not. We are free to define and use target specific type for target intrinsics. As mature intrinsics, our focuses are backward compatibilities and cross compiler compatibilities. Both stop us from defining with __bf16.

pengfei updated this revision to Diff 411334.Feb 25 2022, 12:03 AM

Disscussed with GCC folks. We think it's better to use the same way as D120411 that replacing it with short int.

pengfei retitled this revision from [X86] Prohibit arithmatic operations on type `__bfloat16` to [X86] Prohibit arithmetic operations on type `__bfloat16`.Feb 25 2022, 12:03 AM
pengfei edited the summary of this revision. (Show Details)
andrew.w.kaylor added a comment.EditedFeb 25 2022, 11:07 AM

Disscussed with GCC folks. We think it's better to use the same way as D120411 that replacing it with short int.

Which GCC folks did you discuss it with? I ask because GCC currently considers __m128bh and __m128i to be incompatible types, which is what I would expect: https://godbolt.org/z/z9nefbrEc

clang/lib/Headers/avx512bf16intrin.h
43

Are we trying to make our intrinsics weakly typed? I don't like this change at all.

scanon added a subscriber: scanon.Feb 25 2022, 2:20 PM

There's a lot of churn around proposed "solutions" on this and related PR, but not a very clear analysis of what the problem we're trying to solve is.

Concretely, what are the semantics that we want for the BF16 types and intrinsics? Unlike the other floating-point types, there's no standard to guide this, so it's even more important to clearly specify how these types are to be used, instead of having an ad-hoc semantics of whatever someone happens to implement.

There's a lot of churn around proposed "solutions" on this and related PR, but not a very clear analysis of what the problem we're trying to solve is.

I thought the problem that the patch was originally trying to solve is that __bfloat16 variables could be used in arithmetic operations (using the __bfloat16 type defined in the avx512bf16intrin.h header). For example, clang currently compiles this code without any diagnostics if the target processor has the required features (avx512bf16 and avx512vl).

#include <immintrin.h>
float f(float x, float y) {
  __bfloat16 x16 = _mm_cvtness_sbh(x);
  __bfloat16 y16 = _mm_cvtness_sbh(y);
  __bfloat16 z16 = x16 + y16;
  return _mm_cvtsbh_ss(z16);
}

https://godbolt.org/z/vcbcGsPPx

The problem is that the instructions generated for that code are completely wrong because __bfloat is defined as unsigned short. It relies on the user knowing that they shouldn't use this type in arithmetic operations.

Like I said, I thought that was the original intention of this patch. However, the latest version of the patch doesn't prevent this at all. In fact, it makes the problem worse by asking the user to define the BF16 variables as unsigned short in their code. Getting correct behavior from this point

@pengfei Please correct me if I misunderstood the purpose of this patch.

Concretely, what are the semantics that we want for the BF16 types and intrinsics? Unlike the other floating-point types, there's no standard to guide this, so it's even more important to clearly specify how these types are to be used, instead of having an ad-hoc semantics of whatever someone happens to implement.

The binary representation of a BF16 value (such as the value returned by _mm_cvtness_sbh) is, as Phoebe mentioned, the "brain floating point type" as described here: https://en.wikichip.org/wiki/brain_floating-point_format

Unfortunately, what you can do with it seems to depend on the target architecture. For very recent x86 processors, you can convert vectors of this type to and from single precision floating point and you can do a SIMD dot product and accumulate operation (VDPBF16PS), but the only way to do this is with intrinsics. Some ARM processors support other operations, but I think with similar restrictions (i.e. only accessible through intrinsics). Apart from intrinsics, it is treated as a storage-only type.

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 6:21 PM

Thanks @andrew.w.kaylor ! You are totally correct about the intention and current implementations.

Concretely, what are the semantics that we want for the BF16 types and intrinsics? Unlike the other floating-point types, there's no standard to guide this, so it's even more important to clearly specify how these types are to be used, instead of having an ad-hoc semantics of whatever someone happens to implement.

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.
This patch as well as D120411 are trying to follow what we are doing on F16C. The difference is we are supporting scalar type too. That's why I put them into two patches.
Back to the question:

  1. There's no BF16 type and its semantics before ABI ready on X86.
  2. All intrinsics take BF16 as integer type.

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.

I don't see the point of the ARM soft-float comparison, given that X86 doesn't have the strict distinction between integer and floating point registers that ARM has, at least not for the XMM/YMM/ZMM registers. Consider the following code:

__m128bh foo(__m128 x) {
  return _mm_cvtneps_pbh(x);
}
__m128 bar(__m128bh x) {
  return _mm_cvtpbh_ps(x);
}

Currently, both clang and gcc will use XMM0 for the argument and return value in both functions. Is XMM0 an integer register or a floating point register? There is no such distinction. It's true that the x86_64 psABI does talk about the general purpose registers as integer registers, and both clang and gcc will use one of these registers for __bfloat16 values, but that's an implementation detail (and a dubious one, considering that nearly anything useful that you can do with a __bfloat16 will require moving it into an SSE register).

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

But all of this only applies to the backend code generation. It has very little to do with the intrinsic definition in the header file or the IR generated by the front end. If we continue to define __bfloat16 as an unsigned short in the header file, the front end will treat it as an unsigned short and it will use its rules for unsigned short to generate IR. If the ABI is ever defined to treat BF16 differently than unsigned short, the front end won't be able to do anything about that because we've told the front end that the value is an unsigned short.

On the other hand, if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values. If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way. If it ends up being something different, the front end can apply rules for whatever the alternative is. The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it. This would then result in the front end generating IR using bfloat as the type for BF16 values. Again, this is a correct and accurate description of the value. It allows the optimizer to reason about it correctly in any way it needs to.

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.

I don't see the point of the ARM soft-float comparison, given that X86 doesn't have the strict distinction between integer and floating point registers that ARM has, at least not for the XMM/YMM/ZMM registers. Consider the following code:

__m128bh foo(__m128 x) {
  return _mm_cvtneps_pbh(x);
}
__m128 bar(__m128bh x) {
  return _mm_cvtpbh_ps(x);
}

Currently, both clang and gcc will use XMM0 for the argument and return value in both functions. Is XMM0 an integer register or a floating point register? There is no such distinction. It's true that the x86_64 psABI does talk about the general purpose registers as integer registers, and both clang and gcc will use one of these registers for __bfloat16 values, but that's an implementation detail (and a dubious one, considering that nearly anything useful that you can do with a __bfloat16 will require moving it into an SSE register).

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

But all of this only applies to the backend code generation. It has very little to do with the intrinsic definition in the header file or the IR generated by the front end. If we continue to define __bfloat16 as an unsigned short in the header file, the front end will treat it as an unsigned short and it will use its rules for unsigned short to generate IR. If the ABI is ever defined to treat BF16 differently than unsigned short, the front end won't be able to do anything about that because we've told the front end that the value is an unsigned short.

On the other hand, if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values. If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way. If it ends up being something different, the front end can apply rules for whatever the alternative is. The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it. This would then result in the front end generating IR using bfloat as the type for BF16 values. Again, this is a correct and accurate description of the value. It allows the optimizer to reason about it correctly in any way it needs to.

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.

m256bh should not have been a new type. It should have been an alias of m256i. We don't have load/store intrinsics for m256bh so if you can even get the m256bh type in and out of memory using load/store intrinsics, it is only because we allow lax vector conversion by default. -fno-lax-vector-conversions will probably break any code trying to load/store it using a load/store intrinsic. If __m256bh was made a struct as at one point proposed, this would have been broken.

If we want m256bh to be a unique type using bf16, we must define load, store, and cast intrinsics for it. We would probably want insert/extract element intrinsics as well.

craig.topper added a comment.EditedMar 3 2022, 4:29 PM

Good question! This is actually the scope of ABI. Unfortunately, we don't have the BF16 ABI at the present. We can't assume what are the physical registers the arguments been passed and returned before we have such a hardware. For example, ARM has soft FP ABI that supports FP arithmetic operations and passes and returns arguments by integer registers. When we enabling some ISA set whose type doesn't have ABI representation, e.g., F16C, we borrowed such conception. And as a trade off, we used integer rather than introducing a new IR type, since we don't need to support the arithmetic operations.

I don't see the point of the ARM soft-float comparison, given that X86 doesn't have the strict distinction between integer and floating point registers that ARM has, at least not for the XMM/YMM/ZMM registers. Consider the following code:

__m128bh foo(__m128 x) {
  return _mm_cvtneps_pbh(x);
}
__m128 bar(__m128bh x) {
  return _mm_cvtpbh_ps(x);
}

Currently, both clang and gcc will use XMM0 for the argument and return value in both functions. Is XMM0 an integer register or a floating point register? There is no such distinction. It's true that the x86_64 psABI does talk about the general purpose registers as integer registers, and both clang and gcc will use one of these registers for __bfloat16 values, but that's an implementation detail (and a dubious one, considering that nearly anything useful that you can do with a __bfloat16 will require moving it into an SSE register).

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

But all of this only applies to the backend code generation. It has very little to do with the intrinsic definition in the header file or the IR generated by the front end. If we continue to define __bfloat16 as an unsigned short in the header file, the front end will treat it as an unsigned short and it will use its rules for unsigned short to generate IR. If the ABI is ever defined to treat BF16 differently than unsigned short, the front end won't be able to do anything about that because we've told the front end that the value is an unsigned short.

On the other hand, if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values. If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way. If it ends up being something different, the front end can apply rules for whatever the alternative is. The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it. This would then result in the front end generating IR using bfloat as the type for BF16 values. Again, this is a correct and accurate description of the value. It allows the optimizer to reason about it correctly in any way it needs to.

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.

m256bh should not have been a new type. It should have been an alias of m256i. We don't have load/store intrinsics for m256bh so if you can even get the m256bh type in and out of memory using load/store intrinsics, it is only because we allow lax vector conversion by default. -fno-lax-vector-conversions will probably break any code trying to load/store it using a load/store intrinsic. If __m256bh was made a struct as at one point proposed, this would have been broken.

If we want m256bh to be a unique type using bf16, we must define load, store, and cast intrinsics for it. We would probably want insert/extract element intrinsics as well.

-fno-lax-vector-conversions does indeed break the load/store intrinsics https://godbolt.org/z/b5WPj84Pa

You can fix it with C style cast, but that wouldn't work for a struct. icc and msvc both use structs for the vector types. And both alias m256bh to m256i.

if we define the __bfloat16 type as the built-in __bf16 type, then the front end can apply whatever rules it has for that type, including adding whatever ABI handling is needed for BF16 values.

I don't agree. Unlike __fp16, __bf16 is simple an ARM specific type. I don't see why we need to define with it. https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point . Besides, the doc says it "is only available when supported in hardware". Since we only have vector instructions, I don't think we support the scalar type here.

If that ends up being the same as the rules for unsigned short, that's no problem. The front end can implement it that way.

Doesn't it mean front end still generate i16 in IR for it?

The point is, by telling the front end that this is a BF16 value, we allow the front end to control the semantics for it.

If you are saying we support a separate __bfloat16 type in front end and generate i16 in IR just for doing diagnose, I'm fine with it. The problem is we can't reuse the __bf16 's representation BFloat16Ty. Instead, we have to follow the way we are using for fp16, e.g., HalfTy for __fp16 (no ABI) and Float16Ty for _Float16 (has ABI). But it doesn't worth the effort to me.

Also, you say we can't assume what registers will be used (in the eventual ABI?) but we are assuming exactly that. If the ABI is ever defined differently than what clang and gcc are currently doing, they will both be wrong.

No new IR types, no ABI issues. The declaration of double underscore type are free without ABI. __fp16 is a good example: https://godbolt.org/z/6qKqGc6Gj

I don't see why we would treat BF16 values as unsigned short and i16 throughout the compiler just to make the backend implementation easier when we already have types available for BF16.

So, it is not a problem of easy or difficult. It's a wrong direction we can't go with it (introducing new IR type without clear ABI declarations). We made such mistake with the half type. https://godbolt.org/z/K55s5zqPG We shouldn't make it again.

m256bh should not have been a new type. It should have been an alias of m256i. We don't have load/store intrinsics for m256bh so if you can even get the m256bh type in and out of memory using load/store intrinsics, it is only because we allow lax vector conversion by default. -fno-lax-vector-conversions will probably break any code trying to load/store it using a load/store intrinsic. If __m256bh was made a struct as at one point proposed, this would have been broken.

If we want m256bh to be a unique type using bf16, we must define load, store, and cast intrinsics for it. We would probably want insert/extract element intrinsics as well.

From this, it sounds like our intrinsics support is incomplete for this type. Even if it is defined as an alias of some existing type (such as __m256i), I would have to do the mental gymnastics of mixing and matching intrinsics to get the behavior I want. I think this gets back to the question @scanon asked about the semantics of this type. What should I be able to do with it? Can I load a vector of these values from memory? Can I store them to memory? Can I assign one vector to another? Can I pass it as an argument? Can I use it as a return type? It looks the only things we have intrinsics for (for the vector types) are converting to and from single precision vectors and performing the dot product accumulate operation.

I was wondering about similar issues when I was looking at what we had for the __bfloat16 type. We have intrinsics to convert this type to and from single precision floating point values, but I can't do anything else with it. Nothing else at all, including inserting it into a vector of bf16 values.

So @pengfei is trying to press ahead with the backend implementation, but our front end support is incomplete. That might explain why Phoebe and I haven't been able to agree on what should be done here.

This patch is strictly a front end patch, but it's trying to just wedge definitions into header files to get the desired outcome in the code generation. From the user's perspective, it feels totally broken.

Consider this function.

__mm128 f(__bfloat16 *p1, __bfloat16 *p2) {
  // Load the vectors using the integer load intrinsics??
  __mm128i temp1 = _mm_loadu_epi32(p1);
  __mm128i temp2 = _mm_loadu_epi32(p2);

  // Zero-initialize the a base value vector
  __mm128 base = _mm_set_ps1(0.0f);

  // Perform the dot product
  return _mm_dpbf16_ps (base, temp1, temp2);
}

Is what you'd expect with the current definitions? It looks like it produces the instructions I expected, but with -fno-lax-vector-conversions I get an error unless I explicitly bitcast the arguments from __m128i to __m128bh.

I think that just brings me up to speed with what Craig was saying, right?

So at this point we have these options:

  1. Make the __m[128|256|512]bh types aliases of __m[128|256|512]i
  2. Deprecate the __m[128|256|512]bh types and replace them with __m[128|256|512]i
  3. Add load/store/insert/extract intrinsics for the __bfloat16 type

Of these, I'd prefer the third option because both of the first two require the an overloaded use of the vector-integer type. I already don't like that we use the same type for any size integer vector. Using it for BF16 vectors just seems wrong.

For the example above, I'd like to write code like this:

__mm128 f(__bfloat16 *p1, __bfloat16 *p2) {
  // Load the BF16 vectors
  __mm128bh v1 = _mm_load_pbh(p1);
  __mm128bh v2 = _mm_load_pbh(p2);

  // Zero-initialize the a base value vector
  __mm128 base = _mm_set_ps1(0.0f);

  // Perform the dot product
  return _mm_dpbf16_ps (base, v1, v2);
}

That's more work, but it has the merit of allowing me to use types that match what the program is doing.

The fact that you can't pass a m128i value to a function that is expecting m128bh is a good thing. We shouldn't be making changes that prevents this diagnostic.

I don't agree. Unlike __fp16, __bf16 is simple an ARM specific type.

Why is __bf16 an ARM-specific type? It's a type that describes a floating point value with a specific binary representation that is supported on some ARM and some Intel processors. Why should the type be ARM-specific? What the clang documentation says is just a description of the current implementation. We can change the documentation based on what the compiler supports.

So, it is not a problem of easy or difficult. It's a wrong direction we can't go with it (introducing new IR type without clear ABI declarations). We made such mistake with the half type. https://godbolt.org/z/K55s5zqPG We shouldn't make it again.

If people are writing code that uses variables of this type, we aren't really solving anything by pretending it's a different type because the ABI for the type they're using hasn't been defined. The solution here is to work with HJ and others to get the ABI defined for this type.

So at this point we have these options:

  1. Make the __m[128|256|512]bh types aliases of __m[128|256|512]i
  2. Deprecate the __m[128|256|512]bh types and replace them with __m[128|256|512]i
  3. Add load/store/insert/extract intrinsics for the __bfloat16 type

Of these, I'd prefer the third option because both of the first two require the an overloaded use of the vector-integer type. I already don't like that we use the same type for any size integer vector. Using it for BF16 vectors just seems wrong.

The third option also needs bitcast intrinsics to/from the same sized ps/pd/i type. We have similar casts between the other 3 types already.

skan added a comment.Mar 16 2022, 12:00 AM

So at this point we have these options:

  1. Make the __m[128|256|512]bh types aliases of __m[128|256|512]i
  2. Deprecate the __m[128|256|512]bh types and replace them with __m[128|256|512]i
  3. Add load/store/insert/extract intrinsics for the __bfloat16 type

Of these, I'd prefer the third option because both of the first two require the an overloaded use of the vector-integer type. I already don't like that we use the same type for any size integer vector. Using it for BF16 vectors just seems wrong.

The third option also needs bitcast intrinsics to/from the same sized ps/pd/i type. We have similar casts between the other 3 types already.

If we'd like to not conflict with future bf16 ABI, the second option is the right direction.

@pengfei Do we still need this?

pengfei abandoned this revision.Oct 20 2022, 11:59 PM

This is not needed anymore, thanks @RKSimon