This is an archive of the discontinued LLVM Phabricator instance.

[Clang][BFloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support.
ClosedPublic

Authored by codemzs on May 18 2023, 3:27 PM.

Details

Summary

Pursuant to RFC discussions, this change enhances the handling of the __bf16 type in Clang.

  • Firstly, it upgrades __bf16 from a storage-only type to an arithmetic type.
  • Secondly, it changes the mangling of __bf16 to DF16b on all architectures except ARM. This change has been made in accordance with the finalization of the mangling for the std::bfloat16_t type, as discussed at https://github.com/itanium-cxx-abi/cxx-abi/pull/147.
  • Finally, this commit extends the existing excess precision support to the __bf16 type. This applies to hardware architectures that do not natively support bfloat16 arithmetic.

Appropriate tests have been added to verify the effects of these changes and ensure no regressions in other areas of the compiler.

Diff Detail

Event Timeline

codemzs created this revision.May 18 2023, 3:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2023, 3:27 PM
codemzs requested review of this revision.May 18 2023, 3:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2023, 3:27 PM

Misc style improvement.

clang/lib/AST/Type.cpp
2200

Remove the tab whitespace.

clang/lib/Sema/SemaOverload.cpp
2053–2054
clang/test/Sema/arm-bfloat.cpp
45–55

Remove newline.

Great work! Thanks for the patch!

clang/include/clang/AST/ASTContext.h
1102 ↗(On Diff #523582)

Don't have a look at ISO/IEC/IEEE 60559, but I doubt BF16 is still not a IEEE type for now.

clang/lib/Basic/Targets/AMDGPU.h
120–121

I think it's time to bring D139608 back with this patch :)

clang/lib/Basic/Targets/X86.cpp
362–363

Maybe not need it.

390

I'm not sure if I understand the meaning of HasFullBFloat16. If it is used for target that supports arithmetic __bf16, we should not use +fullbf16 but always enable it for SSE2, i.e., HasFullBFloat16 = SSELevel >= SSE2. Because X86 GCC already supports arithmetic for __bf16.

If this is used in the way like HasLegalHalfType, we should enable it once we have a full BF16 ISA on X86. fullbf16 doesn't make much sense to me.

1131

ditto.

clang/test/CodeGen/X86/bfloat16.cpp
3–4

The backend has already support lowering of bfloat, I don't think it's necessary to do extra work in FE unless for excess-precision.

clang/test/CodeGen/X86/fexcess-precision-bfloat16.c
8

The tests here make me guess you want to use fullbf16 the same as HasLegalHalfType.

rjmccall added inline comments.May 19 2023, 9:49 AM
clang/lib/Basic/Targets/X86.cpp
390

At the moment, we haven't done the work to emulate BFloat16 arithmetic in any of the three ways we can do that: Clang doesn't promote it in IRGen, LLVM doesn't promote it in legalization, and we don't have compiler-rt functions for it. If we emit these instructions, they'll just sail through LLVM and fail in the backend. So in the short term, we have to restrict this to targets that directly support BFloat16 arithmetic in hardware, which doesn't include x86.

Once we have that emulation support, I agree that the x86 targets should enable this whenever they would enable __bf16.

codemzs marked an inline comment as done.May 19 2023, 12:19 PM

I believe I had updated the __bf16 documentation in /llvm-project/clang/docs/LanguageExtensions.rst, but it appears to have been omitted in this patch. I assure you, I'll rectify this in the next iteration.

clang/include/clang/AST/ASTContext.h
1102 ↗(On Diff #523582)

You are correct, it isn't officially part of ISO/IEEEE standard but implements the properties specified by the standard I think, in any case I will remove the comment as it could be misleading.

clang/lib/Basic/Targets/AMDGPU.h
120–121

I'm inclined to establish a default value, overridden only for ARM, to avoid repetition. If there are no objections, I plan to implement this change in the next iteration.

clang/lib/Basic/Targets/X86.cpp
390

@rjmccall, I concur and just wanted to confirm this change indeed intends to provide BFloat16 emulation support, utilizing excess precision for promotion to float. The HasFullBFloat16 switch is designed to determine excess precision support automatically when the hardware does not natively support bfloat16 arithmetic.

pengfei added inline comments.May 19 2023, 11:37 PM
clang/lib/Basic/Targets/X86.cpp
390

LLVM doesn't promote it in legalization, and we don't have compiler-rt functions for it.

That's not true: https://godbolt.org/z/jxf5E83vG.

The HasFullBFloat16 switch is designed to determine excess precision support automatically when the hardware does not natively support bfloat16 arithmetic.

Makes sense to me.

zahiraam added inline comments.May 21 2023, 11:27 AM
clang/lib/Basic/Targets/X86.cpp
390

At the moment, we haven't done the work to emulate BFloat16 arithmetic in any of the three ways we can do that: Clang doesn't promote it in IRGen, LLVM doesn't promote it in legalization, and we don't have compiler-rt functions for it. If we emit these instructions, they'll just sail through LLVM and fail in the backend. So in the short term, we have to restrict this to targets that directly support BFloat16 arithmetic in hardware, which doesn't include x86.

Once we have that emulation support, I agree that the x86 targets should enable this whenever they would enable __bf16.

Would be nice to add a comment to clarify it.

clang/test/CodeGen/X86/bfloat16.cpp
3–4

The backend has already support lowering of bfloat, I don't think it's necessary to do extra work in FE unless for excess-precision.

+1.

clang/test/CodeGen/X86/fexcess-precision-bfloat16.c
361

Fix this.

codemzs updated this revision to Diff 524467.May 22 2023, 1:46 PM
codemzs marked 12 inline comments as done.

@pengfei, @zahiraam, I appreciate your feedback.

@pengfei, the HasFullBFloat16 flag is primarily for identifying hardware with native bfloat16 support to facilitate automatic excess precision support. I concur that since x86 possesses backend bfloat16 emulation (as noted in D126953), front-end emulation might not be necessary. The test's purpose was to provide coverage for this change. However, I am open to either removing it entirely or relocating it to a more suitable target as per your recommendation.

codemzs added inline comments.May 22 2023, 1:48 PM
clang/lib/Basic/Targets/X86.cpp
362–363

Clarified on the other thread but if you have questions please feel free to post here and I will address them.

390

@pengfei, you're right. As part of D126953, the x86 backend received bfloat16 emulation support. Also, I hope my explanation about the HasFullBFloat16 flag addressed your questions. Please let me know if further clarification/change is needed.

clang/test/CodeGen/X86/bfloat16.cpp
3–4

@pengfei @zahiraam I added this test to verify bfloat16 IR gen functionality, considering both scenarios: with and without native bfloat16 support. However, if you believe it's more beneficial to omit it, I'm open to doing so. Happy to also move this test to another target that doesn't have backend support for emulation.

clang/test/CodeGen/X86/fexcess-precision-bfloat16.c
8

Yes that is correct it is just to emulate the correct IR gen if x86 were to have native support. Happy to remove these tests if you feel that is better?

zahiraam added inline comments.May 22 2023, 1:55 PM
clang/test/CodeGen/X86/bfloat16.cpp
3–4

I think that's fine. You can leave it.

LGTM. Just a minor comment.

clang/include/clang/Basic/LangOptions.def
321

May be differentiate the description from the previous line?

Apologies for misunderstanding what this patch was doing. This all seems reasonable, and the code changes look good. I think the documentation needs significant reorganization; I've attached a draft. Please review for correctness.

clang/docs/LanguageExtensions.rst
869

Suggested rework:

Clang supports three half-precision (16-bit) floating point types: ``__fp16``,
``_Float16`` and ``__bf16``.  These types are supported in all language
modes, but not on all targets:

- ``__fp16`` is supported on every target.

- ``_Float16`` is currently supported on the following targets:
  * 32-bit ARM (natively on some architecture versions)
  * 64-bit ARM (AArch64) (natively on ARMv8.2a and above)
  * AMDGPU (natively)
  * SPIR (natively)
  * X86 (if SSE2 is available; natively if AVX512-FP16 is also available)

- ``__bf16`` is currently supported on the following targets:
  * 32-bit ARM
  * 64-bit ARM (AArch64)
  * X86 (when SSE2 is available)

(For X86, SSE2 is available on 64-bit and all recent 32-bit processors.)

``__fp16`` and ``_Float16`` both use the binary16 format from IEEE
754-2008, which provides a 5-bit exponent and an 11-bit significand
(counting the implicit leading 1).  ``__bf16`` uses the `bfloat16
<https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`_ format,
which provides an 8-bit exponent and an 8-bit significand; this is the same
exponent range as `float`, just with greatly reduced precision.

``_Float16`` and ``__bf16`` follow the usual rules for arithmetic
floating-point types.  Most importantly, this means that arithmetic operations
on operands of these types are formally performed in the type and produce
values of the type.  ``__fp16`` does not follow those rules: most operations
immediately promote operands of type ``__fp16`` to ``float``, and so
arithmetic operations are defined to be performed in ``float`` and so result in
a value of type ``float`` (unless further promoted because of other operands).
See below for more information on the exact specifications of these types.

Only some of the supported processors for ``__fp16`` and ``__bf16`` offer
native hardware support for arithmetic in their corresponding formats.
The exact conditions are described in the lists above.  When compiling for a
processor without native support, Clang will perform the arithmetic in
``float``, inserting extensions and truncations as necessary.  This can be
done in a way that exactly emulates the behavior of hardware support for
arithmetic, but it can require many extra operations.  By default, Clang takes
advantage of the C standard's allowances for excess precision in intermediate
operands in order to eliminate intermediate truncations within statements.
This is generally much faster but can generate different results from strict
operation-by-operation emulation.

The use of excess precision can be independently controlled for these two
types with the ``-ffloat16-excess-precision=`` and
``-fbfloat16-excess-precision=`` options.  Valid values include:
- ``none`` (meaning to perform strict operation-by-operation emulation)
- ``standard`` (meaning that excess precision is permitted under the rules
  described in the standard, i.e. never across explicit casts or statements)
- ``fast`` (meaning that excess precision is permitted whenever the
  optimizer sees an opportunity to avoid truncations; currently this has no
  effect beyond ``standard``)

The ``_Float16`` type is an interchange floating type specified in
 ISO/IEC TS 18661-3:2015 ("Floating-point extensions for C").  It will
be supported on more targets as they define ABIs for it.

The ``__bf16`` type is a non-standard extension, but it generally follows
the rules for arithmetic interchange floating types from ISO/IEC TS
18661-3:2015.  In previous versions of Clang, it was a storage-only type
that forbade arithmetic operations.  It will be supported on more targets
as they define ABIs for it.

The ``__fp16`` type was originally an ARM extension and is specified
by the `ARM C Language Extensions <https://github.com/ARM-software/acle/releases>`_.
Clang uses the ``binary16`` format from IEEE 754-2008 for ``__fp16``,
not the ARM alternative format.  Operators that expect arithmetic operands
immediately promote ``__fp16`` operands to ``float``.

It is recommended that portable code use ``_Float16`` instead of ``__fp16``,
as it has been defined by the C standards committee and has behavior that is
more familiar to most programmers.

Because ``__fp16`` operands are always immediately promoted to ``float``, the
common real type of ``__fp16`` and ``_Float16`` for the purposes of the usual
arithmetic conversions is ``float``.

A literal can be given ``_Float16`` type using the suffix ``f16``. For example,
``3.14f16``.

Because default argument promotion only applies to the standard floating-point
types, ``_Float16`` values are not promoted to ``double`` when passed as variadic
or untyped arguments.  As a consequence, some caution must be taken when using
certain library facilities with ``_Float16``; for example, there is no ``printf`` format
specifier for ``_Float16``, and (unlike ``float``) it will not be implicitly promoted to
``double`` when passed to ``printf``, so the programmer must explicitly cast it to
``double`` before using it with an ``%f`` or similar specifier.
clang/lib/Sema/SemaOverload.cpp
1998–2002
pengfei added inline comments.May 24 2023, 5:25 AM
clang/docs/LanguageExtensions.rst
869
Only some of the supported processors for ``__fp16`` and ``__bf16`` offer
native hardware support for arithmetic in their corresponding formats.

Do you mean `_Float16`?

The exact conditions are described in the lists above.  When compiling for a
processor without native support, Clang will perform the arithmetic in
``float``, inserting extensions and truncations as necessary.

It's a bit conflict with These types are supported in all language modes, but not on all targets.
Why do we need to emulate for a type that doesn't necessarily support on all target?

My understand is that inserting extensions and truncations are used for 2 purposes:

  1. A type that is designed to support all target. For now, it's only used for __fp16.
  2. Support excess-precision=standard. This applies for both _Float16 and __bf16.
rjmccall added inline comments.May 24 2023, 11:07 AM
clang/docs/LanguageExtensions.rst
869

Do you mean _Float16?

Yes, thank you. I knew I'd screw that up somewhere.

Why do we need to emulate for a type that doesn't necessarily support on all target?

Would this be clearer?

Arithmetic on ``_Float16`` and ``__bf16`` is enabled on some targets that don't
provide native architectural support for arithmetic on these formats.  These
targets are noted in the lists of supported targets above.  On these targets,
Clang will perform the arithmetic in ``float``, inserting extensions and truncations
as necessary.

My understand is that inserting extensions and truncations are used for 2 purposes:

No, I believe we always insert extensions and truncations. The cases you're describing are places we insert extensions and truncations in the *frontend*, so that the backend doesn't see operations on half / bfloat at all. But when these operations do make it to the backend, and there's no direct architectural support for them on the target, the backend still just inserts extensions and truncations so it can do the arithmetic in float. This is clearest in the ARM codegen (https://godbolt.org/z/q9KoGEYqb) because the conversions are just instructions, but you can also see it in the X86 codegen (https://godbolt.org/z/ejdd4P65W): all the runtime functions are just extensions/truncations, and the actual arithmetic is done with mulss and addss. This frontend/backend distinction is not something that matters to users, so the documentation glosses over the difference.

I haven't done an exhaustive investigation, so it's possible that there are types and targets where we emit a compiler-rt call to do each operation instead, but those compiler-rt functions almost certainly just do an extension to float in the same way, so I don't think the documentation as written would be misleading for those targets, either.

codemzs updated this revision to Diff 525298.May 24 2023, 12:37 PM
codemzs marked 5 inline comments as done.
codemzs retitled this revision from [Clang][Bfloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support. to [Clang][BFloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support..
codemzs set the repository for this revision to rG LLVM Github Monorepo.

Incorporates suggestions provided by @rjmccall, @pengfei, and @zahiraam.

@rjmccall, your thorough restructuring of the floating-point types documentation is highly appreciated. Thank you.

pengfei added inline comments.May 25 2023, 12:19 AM
clang/docs/LanguageExtensions.rst
869

Thanks for the explanation! Sorry, I failed to make the distinction between "support" and "natively support", I guess users may be confusing at the beginning too.

I agree the documentation is to explain the whole behavior of compile to user. I think we have 3 aspects that want to tell users:

  1. Whether a type is arithmetic type or not and is (natively) supported by all targets or just a few;
  2. The result of a type may not be consistent across different targets or/and excess-precision value;
  3. The excess-precision control doesn't take effect if a type is natively supported by targets;

It would be more clear if we can give such a summary before the detailed explanation.

codemzs added inline comments.May 25 2023, 11:15 AM
clang/docs/LanguageExtensions.rst
869

Does adding the below to the top of the description make it more clear?

Half-Precision Floating Point

Clang supports three half-precision (16-bit) floating point types: `__fp16, _Float16 and __bf16`. These types are supported in all language modes, but their support differs across targets. Here, it's important to understand the difference between "support" and "natively support":

  • A type is "supported" if the compiler can handle code using that type, which might involve translating operations into an equivalent code that the target hardware understands.
  • A type is "natively supported" if the hardware itself understands the type and can perform operations on it directly. This typically yields better performance and more accurate results.

Another crucial aspect to note is the consistency of the result of a type across different targets and excess-precision values. Different hardware (targets) might produce slightly different results due to the level of precision they support and how they handle excess-precision values. It means the same code can yield different results when compiled for different hardware.

Finally, note that the control of excess-precision does not take effect if a type is natively supported by targets. If the hardware supports the type directly, the compiler does not need to (and cannot) use excess precision to potentially speed up the operations.

Given these points, here is the detailed support for each type:

  • `__fp16` is supported on every target.
  • `_Float16` is currently supported on the following targets:
    • 32-bit ARM (natively on some architecture versions)
    • 64-bit ARM (AArch64) (natively on ARMv8.2a and above)
    • AMDGPU (natively)
    • SPIR (natively)
    • X86 (if SSE2 is available; natively if AVX512-FP16 is also available)
  • `__bf16` is currently supported on the following targets:
    • 32-bit ARM
    • 64-bit ARM (AArch64)
    • X86 (when SSE2 is available)

...
...

rjmccall added inline comments.May 25 2023, 12:05 PM
clang/docs/LanguageExtensions.rst
869

I think that's a good basic idea, but it's okay to leave some of the detail for later. How about this:

Clang supports three half-precision (16-bit) floating point types: ``__fp16``, ``_Float16`` and ``__bf16``. These types are supported in all language modes, but their support differs between targets.  A target is said to have "native support" for a type if the target processor offers instructions for directly performing basic arithmetic on that type.  In the absence of native support, a type can still be supported if the compiler can emulate arithmetic on the type by promoting to ``float``; see below for more information on this emulation.

* ``__fp16`` is supported on all targets.  The special semantics of this type mean that no arithmetic is ever performed directly on ``__fp16`` values; see below.

* ``_Float16`` is supported on the following targets: (...)

* ``__bf16`` is supported on the following targets (currently never natively): (...)

And then below we can adjust the paragraph about emulation:

When compiling arithmetic on ``_Float16`` and ``__bf16`` for a target without
native support, Clang will perform the arithmetic in ``float``, inserting extensions
and truncations as necessary.  This can be done in a way that exactly matches the
operation-by-operation behavior of native support, but that can require many
extra truncations and extensions.  By default, when emulating ``_Float16`` and
``__bf16`` arithmetic using ``float``, Clang does not truncate intermediate operands
back to their true type unless the operand is the result of an explicit cast or
assignment.  This is generally much faster but can generate different results from
strict operation-by-operation emulation.  (Usually the results are more precise.)
This is permitted by the C and C++ standards under the rules for excess precision
in intermediate operands; see the discussion of evaluation formats in the C
standard and [expr.pre] in the C++ standard.
pengfei added inline comments.May 25 2023, 7:16 PM
clang/docs/LanguageExtensions.rst
869

This revision looks better. The contents are rather clear to me. Thanks!

codemzs updated this revision to Diff 525920.May 25 2023, 7:52 PM
codemzs marked 3 inline comments as done.

Addresses feedback on extended floating type documentation from @rjmccall and @pengfei

One slight miscommunication. Otherwise this LGTM, thank you.

clang/docs/LanguageExtensions.rst
825

You can drop this paragraph, it's no longer necessary. I should've been clearer that I was suggesting this, sorry.

codemzs updated this revision to Diff 526072.May 26 2023, 8:20 AM
codemzs marked an inline comment as done.
codemzs set the repository for this revision to rG LLVM Github Monorepo.

Addresses @rjmccall suggestions.

rjmccall accepted this revision.May 26 2023, 9:31 AM
This revision is now accepted and ready to land.May 26 2023, 9:31 AM

Hi @rjmccall, @pengfei, and @zahiraam,

Thank you for your valuable review and acceptance of my patch. As I lack commit access, could I kindly request one of you to perform the commit on my behalf? Please use the following command: git commit --amend --author="M. Zeeshan Siddiqui <mzs@microsoft.com>".

git commit message:

[Clang][BFloat16] Upgrade __bf16 to arithmetic type, change mangling,
and extend excess precision support

Pursuant to discussions at
https://discourse.llvm.org/t/rfc-c-23-p1467r9-extended-floating-point-types-and-standard-names/70033/22,
this commit enhances the handling of the __bf16 type in Clang.
- Firstly, it upgrades __bf16 from a storage-only type to an arithmetic
  type.
- Secondly, it changes the mangling of __bf16 to DF16b on all
  architectures except ARM. This change has been made in
  accordance with the finalization of the mangling for the
  std::bfloat16_t type, as discussed at
  https://github.com/itanium-cxx-abi/cxx-abi/pull/147.
- Finally, this commit extends the existing excess precision support to
  the __bf16 type. This applies to hardware architectures that do not
  natively support bfloat16 arithmetic.
Appropriate tests have been added to verify the effects of these
changes and ensure no regressions in other areas of the compiler.

Reviewed By: rjmccall, pengfei, zahiraam

Differential Revision: https://reviews.llvm.org/D150913

I would like to add that I have rebased this patch on LLVM main as of just now and also applied clang formatting to this patch. However, to maintain consistency and respect untouched lines, I opted to reverse certain clang-format changes, which might result in a clang format failure on Debian. Should you deem it necessary for me to apply clang formatting across all lines regardless, I am open to revising the format accordingly.

Your assistance is greatly appreciated.

This revision was landed with ongoing or failed builds.May 26 2023, 10:34 PM
This revision was automatically updated to reflect the committed changes.

I'm late to review and can no longer stamp an approval on this, but I'll note for the historical record that the changes look good to me.