GCC folks decided to make the mangling of __bf16 the same as std::bfloat16_t.
We need to match with GCC: https://godbolt.org/z/jjne5qxPa
Other targets may still use their own mangling way.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
What are the rules on this? Do we just handle this as an ABI breaking change and document it in the release notes - or do we need to provide any auto-upgrade path (with a warning?)?
TBH, I don't have a good idea. It supposes to be Itanium ABI problem which I know little about. @ldionne WDYT?
I also put an RFC in the discourse, but no response so far.
The GCC proposal suggests for ARM/AArch64 as well. They should have the same problem like us since they used u6__bf16 for several versions, but I don't know any progress so far.
I want to get some feedback by this and I can discuss my GCC colleagues to see if we can change it back in GCC if this is not a good direction. There are two reasons I get so far:
- It is a ABI breaking change to LLVM and other targets;
- A mangled name of __bf16 will be demanged as std::bfloat_t, which might confuse the user;
Welcome for suggestions~
Do you think this patch needs to be expanded to handle ARM/AArch64/NVPTX - all of which override getBFloat16Mangling in similar ways?
I thought of that. It would be great if we can make consensus with other backends here.
NVPTX has no current uses of this, so if the mangling needs to change, now would be the good time.
Adding clang-vendors because of the potential ABI break concerns and @rjmccall for the Itanium ABI questions specifically.
I don't know that we have hard and fast rules for this, we usually handle it on a case by case basis. If users are relying on the existing mangling, we may want to provide an ABI compat check so users can get the old mangling behavior. We definitely need to add a release note calling this change out, and we should probably also announce it on discourse once the changes land.
clang/lib/Basic/Targets/X86.h | ||
---|---|---|
413 | This does not appear to match the Itanium ABI (https://itanium-cxx-abi.github.io/cxx-abi/abi.html): ::= DF <number> _ # ISO/IEC TS 18661 binary floating point type _FloatN (N bits) it's missing the underscore after the number of bits. Is there a proposal in front of the Itanium ABI group to add this form with b instead of an underscore? |
We talked about this on the Itanium list, and as currently specified, it is absolutely not correct for __bf16 to have the same mangling as std::bfloat16_t, because __bf16 does not have the correct semantics for std::bfloat16_t and must be a distinct type. If GCC changed __bf16 to use the new mangling without also updating the semantics, it is a bug.
That discussion was here: https://github.com/itanium-cxx-abi/cxx-abi/pull/147
If we want to implement std::bfloat16_t in Clang, we need to make it a normal arithmetic type, and in practice it needs to guarantee excess-precision arithmetic, as I discussed on that thread. Coincidentally, we did recently implement excess-precision arithmetic in Clang for _Float16.
Jakub Jelinek has clarified that GCC did indeed change the semantics of __bf16 on i386 and x86_64 to be a proper extended floating point type.
We could change the mangling to match GCC, but I think it would be inappropriate to do that without also matching the semantics change. Since the mangling change is trivial to land, I think the semantics change should happen first.
Thanks @rjmccall for pointing to the Itanium list. I'm clear about what to do now. Yes, fully agreed, we should match the semantics at the same time.
Thanks @tra. Per to @rjmccall's information. The mangling should be aligned with its semantics. It's fine to use u6__bf16 if a target doesn't want to support arithmetic operations.
And thanks @RKSimon, @tschuett and @aaron.ballman for your kind inputs!
As for Arm/AAch64, we're still assessing the amount of pain we'd cause with the name change, but we don't have an issue with letting go of storage-only.
Thanks @stuij. Indeed! ABI breaking is an annoying problem. I don't have a good idea for it. I tend to leave it as breaking. I think it's reasonable iff we changed the semantics together. And it does make sense to use a different mangling for storage-only type.
The mangling should be aligned with its semantics. It's fine to use u6__bf16 if a target doesn't want to support arithmetic operations.
We (speaking for CUDA/NVPTX) do want to support math on bfloat, just didn't get to implementing it yet.
NVPTX will likely start supporting arithmetics on bfloat at some point in the future. Does it mean that we'd need to change the mangling then? Or would I need to use a different type with corresponding mangling for bfloat-with-ops?
On the topic of supporting BF16 arithmetic, note my comment here: https://github.com/itanium-cxx-abi/cxx-abi/pull/147#issuecomment-1254078916. To summarize, according to Steve Canon, we really shouldn't implement arithmetic directly in the BF16 format, because the precision is simply too low to be useful for intermediate results. Instead, we need to guarantee excess-precision arithmetic so that we only truncate back to BF16 when the source code requires it. We can do that in the frontend using the same excess-precision logic we added for _Float16.
If Steve's argument is true, that suggests that it would be a waste for NVPTX to directly support BF16 arithmetic, at least not the same way it would support float or double. (Providing operations like x86's VDPBF16PS — https://en.wikichip.org/wiki/x86/avx512_bf16 — that start from BF16 operands but perform their arithmetic in float is a different story.)
On one hand I agree that bfloat's low precision makes it somewhat problematic to use as is, w/o doing actual math in single/double floats.
On the other hand, the primary use case for bfloat is machine learning, where the apps are often less concerned about precision but do care a lot about the magnitude of the numbers and the raw number crunching performance. Typical use pattern observed in the wild is to do as much as possible in bf16 (or fp16, depending on accelerator in use) and use higher precision only for the operations that really need it.
While heavy-weight ops like dot-product, matmul, etc will continue to consume the bulk of the GPU cycles, there is a practical need for simple ops, too. A lot of performance benefits in ML apps are derived from fusing multiple simple kernels into one and that rarely maps into those accelerated instructions. We need to be able to do plain old add/sub/mul/cmp. If bf16 has higher throughput compared to fp32, that will provide a tangible benefit for large classes of ML applications.
My bet is that bf16 math will be used extensively once NVIDIA's H100 GPUs become widely available. We do have sufficient evidence that bf16 works well enough on multiple generations of Google's TPUs and the trend is likely to continue with more platforms adopting it. Fun fact: NVIDIA is introducing even lower precision formats in its new GPU: FP8
Okay. That raises the question of what the default semantics should be for std::bfloat16_t, i.e. whether we should semantically default the type to using excess-precision float arithmetic. If we did, we'd still be able to demote solitary float operations to BF16, but anything more complex than that would often force promotion without user intervention.
Of course, even if we do default the type to excess precision, we could still have a flag to disable that, and we could potentially have different default values on different targets.
FWIW, at Arm we decided to keep the old name mangling to minimise friction with existing code/libraries, but allow more operations with this same name-mangling. We also discussed with Red Hat and they were ok with this.
This does not appear to match the Itanium ABI (https://itanium-cxx-abi.github.io/cxx-abi/abi.html):
it's missing the underscore after the number of bits. Is there a proposal in front of the Itanium ABI group to add this form with b instead of an underscore?