Page MenuHomePhabricator

[llvm][clang] Create new intrinsic llvm.arithmetic.fence to control FP optimization at expression level
Needs ReviewPublic

Authored by mibintc on Mar 31 2021, 11:26 AM.

Details

Summary

This patch adds a new llvm intrinsic, llvm.arith.fence. The purpose is to provide fine control, at the expression level, over floating point optimization when -ffast-math (-ffp-model=fast) is enabled. We are also proposing a new clang builtin that provides access to this intrinsic, as well as a new clang command line option -fprotect-parens that will be implemented using this intrinsic.

This patch is authored by @pengfei

Rationale

Some expression transformations that are mathematically correct, such as reassociation and distribution, may be incorrect when dealing with finite precision floating point. For example, these two expressions,

(a + b) + c
a + (b + c)

are equivalent mathematically in integer arithmetic, but not in floating point. In some floating point (FP) models, the compiler is allowed to make these value-unsafe transformations for performance reasons, even when the programmer uses parentheses explicitly. But the compiler must always honor the parentheses implied by llvm.arith.fence, regardless of the FP model settings.

Under –ffp-model=fast, llvm.arith.fence provides a way to partially enforce ordering in an FP expression.

Original expressionTransformed expressionPermitted?
(a + b) + ca + (b + c)Yes!
llvm.arith.fence(a + b) + ca + (b + c)No!
NOTE: The llvm.arith.fence serves no purpose in value-safe FP modes like –ffp-model=precise: FP expressions are already strictly ordered.

The new llvm intrinsic also enables the implementation of the option -fprotect-parens which is available in gfortran as well as the Intel C++ and Fortran compilers: icc and ifort.

Proposed llvm IR changes

Requirements for llvm.arith.fence:

  • There is one operand. The input to the intrinsic is an llvm::Value and must be scalar floating point or vector floating point.
  • The return type is the same as the operand type.
  • The return value is equivalent to the operand.

Optimizing llvm.arith.fence

  • Constant folding may substitute the constant value of the llvm.arith.fence operand for the value of fence itself in the case where the operand is constant.
  • CSE Detection: No special changes needed: if E1 and E2 are CSE, then llvm.arith.fence(E1) and llvm.arith.fence(E2) are CSE.
  • FMA transformation should be enabled, at least in the -ffp-model=fast case.
    • The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen before “+ c” and FMA guarantees that, but to prevent later optimizations from unpacking the FMA the correct transformation needs to be:
llvm.arith.fence(a * b) + c  →  llvm.arith.fence(FMA(a, b, c))
  • In the ffp-model=fast case, FMA formation doesn’t happen until Isel, so we just need to add the llvm.arith.fence cases to ISel pattern matching.
  • There are some choices around the FMA optimization. For this example:
%t1 = fmul double %x, %y
%t2 = call double @llvm.arith.fence.f64(double %t1)
%t3 = fadd contract double %t2, %z
    1. FMA is allowed across an arith.fence if and only if the FMF contract flag is set for the llvm.arith.fence operand. After review discussion, we are convinced this choice doesn't work.
    2. FMA is not allowed across a fence We are recommending this choice
    3. The FMF contract flag should be set on the llvm.arith.fence intrinsic call if contraction should be enabled
  • Fast Math Optimization:
    • The result of a llvm.arith.fence can participate in fast math optimizations. For example:
// This transformation is legal:
w + llvm.arith.fence(x + y) + z   →   w + z + llvm.arith.fence(x + y)
  • The operand of a llvm.arith.fence can participate in fast math optimizations. For example:
// This transformation is legal:
llvm.arith.fence((x+y)+z) --> llvm.arith.fence(x+(y+z))
NOTE: We want fast-math optimization within the fence, but not across the fence.
  • MIR Optimization:
    • The use of a pseudo-operation in the MIR serves the same purpose as the intrinsic in the IR, since all the optimizations are based on patterns matching from known DAGs/MIs.
    • Backend simply respects the llvm.arith.fence intrinsic, builds llvm.arith.fence node during DAG/ISel and emits pseudo arithmetic_fence MI after it.
    • The pseudo arithmetic_fence MI turns into a comment when emitting assembly.

Other llvm changes needed -- utility functions

The ValueTracking utilities will need to be taught to handle the new intrinsic. For example, there are utility functions like isKnownNeverNaN() and CannotBeOrderedLessThanZero() that will need to “look through” the intrinsic.

A simple example

// llvm IR, llvm.arith.fence over addition.
 %5 = load double, double* %B, align 8
 %add1 = fadd fast double %4, %5
 %6 = call double @llvm.arith.fence.f64(double %add1)
 %7 = load double, double* %C, align 8
 %mul = fmul fast double %6, %7
 store double %mul, double* %A, align 8

Example, llvm.arith.fence over memory operand

Consider this similar example, which illustrates how ‘x’ can be optimized while ‘z’ is fenced. Notice ‘q’ is simplified to ‘b’ (q = a + b - a -> q = b), but ‘z’ isn’t simplified because of the fence.

// llvm IR
  define dso_local float @f(float %a, float %b) 
  local_unnamed_addr #0 {
  %x = fadd fast float %b, %a
  %tmp = call fast float @llvm.arith.fence.f32(float %x)
  %z = fsub fast float %tmp, %a
  %result = call fast float @llvm.maxnum.f32(float %z, float %b)
  ret float %result

Clang changes to take advantage of this intrinsic

  • Add new clang builtin __arithmetic_fence
    • Add builtin definition
      • There is one operand. Any kind of expression, including memory operand.
      • The return type is the same as the operand type. The result of the intrinsic is the value of its rvalue operand.
      • The operand type can be any scalar floating point type, complex, or vector with float or complex element type.
      • The invocation of __arithmetic_fence is not a C/C++ constant expression, even if the operands are constant.
    • Add semantic checks and test cases
    • Modify clang/codegen to generate the llvm.arith.fence intrinsic
  • Add support for a new command-line option -fprotect-parens which honors parentheses within a floating point expression, the default is -fno-protect-parens. For example,
// Compile with -ffast-math
double A,B,C;
A = __arithmetic_fence(A+B)*C;

// llvm IR
 %4 = load double, double* %A, align 8
 %5 = load double, double* %B, align 8
 %add1 = fadd fast double %4, %5
 %6 = call double @llvm.arith_fence.f64(double %add1)
 %7 = load double, double* %C, align 8
 %mul = fmul fast double %6, %7
 store double %mul, double* %A, align 8
  • Motivation: the new clang builtin provides clang compatibility with the Intel C++ compiler builtin __fence which has similar semantics, and likewise enables implementation of the option -fprotect-parens. The new builtin provides the clang programmer control over floating point optimizations at the expression level.

Pros & Cons

    1. Pros
  • Increases expressiveness and precise control over floating point calculations.
  • Provides a desirable compatibility feature from industrial compilers
    1. Cons
  • Intrinsic bloat.
  • Some of LLVM's optimizations need to understand the llvm.arith.fence semantics in order to retain optimization capabilities. This will require at least some engineering effort.
  • Any target that wants to support this has to make modifications to their back-end.

Diff Detail

Unit TestsFailed

TimeTest
50 msx64 debian > LLVM.CodeGen/Thumb::pr35836_2.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/Thumb/pr35836_2.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/Thumb/pr35836_2.ll
70 msx64 debian > LLVM.CodeGen/Thumb::umulo-128-legalisation-lowering.ll
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/Thumb/umulo-128-legalisation-lowering.ll -mtriple=thumb-eabi -mattr=+v6 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/Thumb/umulo-128-legalisation-lowering.ll --check-prefixes=THUMBV6
50 msx64 debian > LLVM.CodeGen/Thumb2::umulo-128-legalisation-lowering.ll
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/Thumb2/umulo-128-legalisation-lowering.ll -mtriple=thumbv7-unknown-none-gnueabi | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/Thumb2/umulo-128-legalisation-lowering.ll --check-prefixes=THUMBV7
100 msx64 windows > LLVM.CodeGen/Thumb::pr35836_2.ll
Script: -- : 'RUN: at line 1'; c:\ws\w32-1\llvm-project\premerge-checks\build\bin\llc.exe < C:\ws\w32-1\llvm-project\premerge-checks\llvm\test\CodeGen\Thumb\pr35836_2.ll | c:\ws\w32-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w32-1\llvm-project\premerge-checks\llvm\test\CodeGen\Thumb\pr35836_2.ll
100 msx64 windows > LLVM.CodeGen/Thumb::umulo-128-legalisation-lowering.ll
Script: -- : 'RUN: at line 2'; c:\ws\w32-1\llvm-project\premerge-checks\build\bin\llc.exe < C:\ws\w32-1\llvm-project\premerge-checks\llvm\test\CodeGen\Thumb\umulo-128-legalisation-lowering.ll -mtriple=thumb-eabi -mattr=+v6 | c:\ws\w32-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w32-1\llvm-project\premerge-checks\llvm\test\CodeGen\Thumb\umulo-128-legalisation-lowering.ll --check-prefixes=THUMBV6
View Full Test Results (6 Failed)

Event Timeline

mibintc created this revision.Mar 31 2021, 11:26 AM
mibintc requested review of this revision.Mar 31 2021, 11:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2021, 11:26 AM
Herald added a subscriber: jdoerfert. · View Herald Transcript
mibintc edited the summary of this revision. (Show Details)Mar 31 2021, 2:26 PM
mibintc edited the summary of this revision. (Show Details)Mar 31 2021, 2:29 PM
mibintc edited the summary of this revision. (Show Details)Mar 31 2021, 2:32 PM
spatel added a subscriber: spatel.Apr 6 2021, 8:33 AM

The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen before “+ c” and FMA guarantees that, but to prevent later optimizations from unpacking the FMA the correct transformation needs to be:

llvm.arith.fence(a * b) + c → llvm.arith.fence(FMA(a, b, c))

Does this actually block later transforms from unpacking the FMA? Maybe if the FMA isn't marked "fast"...


How is llvm.arith.fence() different from using "freeze" on a floating-point value? The goal isn't really the same, sure, but the effects seem similar at first glance.

The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen before “+ c” and FMA guarantees that, but to prevent later optimizations from unpacking the FMA the correct transformation needs to be:

llvm.arith.fence(a * b) + c → llvm.arith.fence(FMA(a, b, c))

Does this actually block later transforms from unpacking the FMA? Maybe if the FMA isn't marked "fast"...

I think we could define llvm.arith.fence to be such that this FMA contraction isn't legal/correct, or it could be left as is. In the implementation that was used for the Intel compiler FMA contraction did not occur across an an __fence boundary. It is unclear whether that was intended as the semantic, or if we just never bothered to implement that contraction.
Not allowing the FMA contraction across the llvm.arith.fence would make unpacking an FMA allowed under the same circumstances that LLVM currently allows that.


How is llvm.arith.fence() different from using "freeze" on a floating-point value? The goal isn't really the same, sure, but the effects seem similar at first glance.

They are similar. However, fence is a no-op if the operand can be proven not to be undef or poison, and in such circumstances could be removed by an optimizer. llvm.arith.fence cannot be removed by an optimizer, because doing so might allow instructions that were "outside" the fence from being reassociated/distrbuted with the instructions/operands that were inside the fence.

The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen before “+ c” and FMA guarantees that, but to prevent later optimizations from unpacking the FMA the correct transformation needs to be:

llvm.arith.fence(a * b) + c → llvm.arith.fence(FMA(a, b, c))

Does this actually block later transforms from unpacking the FMA? Maybe if the FMA isn't marked "fast"...

I'd like @pengfei to reply to this question. I think the overall idea is that many of the optimizations are pattern based, and the existing pattern wouldn't match the new intrinsic.


How is llvm.arith.fence() different from using "freeze" on a floating-point value? The goal isn't really the same, sure, but the effects seem similar at first glance.

Initially we thought the intrinsic "ssa.copy" could serve. However ssa.copy is for a different purpose and it gets optimized away. We want arith.fence to survive through codegen, that's one reason why we think a new intrinsic is needed.

The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen before “+ c” and FMA guarantees that, but to prevent later optimizations from unpacking the FMA the correct transformation needs to be:

llvm.arith.fence(a * b) + c → llvm.arith.fence(FMA(a, b, c))

Does this actually block later transforms from unpacking the FMA? Maybe if the FMA isn't marked "fast"...

Later transforms could unpack the FMA, but the result would be fenced. The intent isn't so much to prevent the FMA from being unpacked as to prevent losing the original fence semantics. That said, it doesn't quite work. For example, you might have this:

%mul = fmul fast float %a, %b
%fenced_mul = call float @llvm.arith.fence.f32(%mul)
%result = fadd fast float %fenced_mul, %c

If there are no other uses of %fenced_mul, that could become

%tmp = call fast float @llvm.fmuladd.f32(float %a, float %b, float %c)
%result = call float @llvm.arith.fence.f32(%tmp)

If a later optimization decided to unpack this, it would become this:

%mul = fmul fast float %a, %b
%tmp = fadd fast float %mul, %c
%result = call float @llvm.arith.fence.f32(%tmp)

I suggested this as a way of enabling the FMA optimization. It brings the fadd into the fence, but still protects the fmul from being reassociated or otherwise transformed with other operations outside the fence. In a purely practical sense, this would probably work. In a more strict sense, though, I now see that it has the problem that you could legally distribute the addition within the fence. I can't see a practical reason anyone would do that, but the semantics would allow it. The same ("legal but not practical") is true of forming the fmuladd intrinsic before codegen, I think.

So, no, I don't think this works the way it was intended.

That might push us back to Kevin's suggestion of just not allowing the FMA optimization across a fence.

How is llvm.arith.fence() different from using "freeze" on a floating-point value? The goal isn't really the same, sure, but the effects seem similar at first glance.

They are similar. However, freeze is a no-op if the operand can be proven not to be undef or poison, and in such circumstances could be removed by an optimizer. llvm.arith.fence cannot be removed by an optimizer, because doing so might allow instructions that were "outside" the fence from being reassociated/distrbuted with the instructions/operands that were inside the fence.

Okay. In practice, it's basically impossible for us to prove that the result of "fast" arithmetic isn't poison, given the way ninf/nnan are defined, but depending on that would be fragile.

mibintc updated this revision to Diff 337879.Apr 15 2021, 1:12 PM
mibintc edited the summary of this revision. (Show Details)

This is a minor update from @pengfei which allows simple tests cases to run end-to-end with clang.
Also I changed the "summary" to reflect the review discussion around the FMA optimization, to choose "FMA is not allowed across a fence".

mibintc updated this revision to Diff 338099.Apr 16 2021, 7:04 AM

I accidentally dropped the test case in previous commit. Just adding it back in -- under the llvm/test directory (previously it was in the wrong location).

kpn added a subscriber: kpn.Apr 16 2021, 11:30 AM

What changes are needed for a backend, and what happens if they aren't done?

In D99675#2695424, @kpn wrote:

What changes are needed for a backend, and what happens if they aren't done?

In the clang patch, I'm planning to add into TargetInfo a function like "does the target support __arithmetic_fence"?
In the llvm patch, the fallback implementation could be to merely ignore the call, and pass through the operand value. Is that adequate?

kpn added a comment.Apr 16 2021, 12:26 PM
In D99675#2695424, @kpn wrote:

What changes are needed for a backend, and what happens if they aren't done?

In the clang patch, I'm planning to add into TargetInfo a function like "does the target support __arithmetic_fence"?
In the llvm patch, the fallback implementation could be to merely ignore the call, and pass through the operand value. Is that adequate?

If clang is the only compiler to ever emit this new intrinsic then, yes, that's perfectly fine.

If a front-end other than clang uses the new fence then I'm nervous about having the fence just vanish. If the fence is used then it must be for correctness, right? Having something needed for correctness silently not work seems ... sub-optimal. It's the sort of thing that might not get caught in testing, and then you've got end-users running software that silently lacks something needed for correctness. That makes me nervous. I'd rather LLVM bomb instead of silently dropping this fence. Then developers know they have a problem before a product goes out the door.

But if I'm the only one that's nervous then that's OK and clang rejecting the compile would be sufficient.

Has this sort of issue come up in the past? How was it handled?

In D99675#2695424, @kpn wrote:

What changes are needed for a backend, and what happens if they aren't done?

As far as I understand it, backend does optimizations based on patterns of the known nodes and MIs. Inserting a new node/MI will block any optimizations across the fence. So it respects the semantics of the intrinsic without target special chenges.
I'm not sure if there's room for optimization cross the arithmetic.fence. If there is and no changes for it, backend may have some performance loss under these circumstances.

Having something needed for correctness silently not work seems ... sub-optimal.

I think backend is conservative for optimizations when use the intrinsic. It won't have correctness issue silently, but performance loss might.

kpn added a comment.Tue, May 18, 7:51 AM
In D99675#2695424, @kpn wrote:

What changes are needed for a backend, and what happens if they aren't done?

As far as I understand it, backend does optimizations based on patterns of the known nodes and MIs. Inserting a new node/MI will block any optimizations across the fence. So it respects the semantics of the intrinsic without target special chenges.
I'm not sure if there's room for optimization cross the arithmetic.fence. If there is and no changes for it, backend may have some performance loss under these circumstances.

Having something needed for correctness silently not work seems ... sub-optimal.

I think backend is conservative for optimizations when use the intrinsic. It won't have correctness issue silently, but performance loss might.

OK, that sounds fine, then.

Matt added a subscriber: Matt.Wed, May 19, 11:13 AM
mibintc updated this revision to Diff 348046.Wed, May 26, 11:58 AM
mibintc retitled this revision from RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level to [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level.
mibintc edited the summary of this revision. (Show Details)

Rebased to ToT. It fixes the previous illegal type lowering problems. It also updates the tests to show the functionality in a better way as well as fixes a newly found problem.

Ready for your code review and +1

We think this patch provides basic functionality for the intrinsic, and enhancements can be added in future patches.

Thanks!

We may add description on the intrinsic in docs/LangRef.rst.

craig.topper added inline comments.Thu, Jun 3, 10:15 AM
llvm/include/llvm/IR/IRBuilder.h
911

Do you really need curly braces around DstType and Val? A single value should be implicitly convertible to ArrayRef.

llvm/include/llvm/IR/Intrinsics.td
1341

This comment got duplicated.

llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1336

I think you should check isVerbose() before printing this.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3151

What about splitting a vector like v8f32 on SSE2?

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6296

There's already a variable called sdl that contains this. It's used in the surrounding cases.

6299

Why isn't this just Val.getValueType()?

mibintc updated this revision to Diff 350307.Mon, Jun 7, 8:26 AM

This patch addresses all of @craig.topper comments and adds documentation for the new intrinsic to the language reference as requested by @LuoYuanke nke

pengfei added inline comments.Mon, Jun 7, 10:28 PM
llvm/docs/LangRef.rst
21195

Should be equal to the text?

mibintc updated this revision to Diff 350700.Tue, Jun 8, 1:04 PM

I corrected error in LangRef documentation that @pengfei pointed out.

pengfei added inline comments.Wed, Jun 9, 7:34 AM
llvm/docs/LangRef.rst
21195

Yeah, a good catch. But I initially meant ^^^ should be equal to the title. :)

mibintc updated this revision to Diff 350912.Wed, Jun 9, 8:48 AM
mibintc retitled this revision from [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level to [llvm][clang] Create new intrinsic llvm.arithmetic.fence to control FP optimization at expression level.

Correct small formatting issue in LangRef.rst thanks @pengfei