This is a proposal to add a new llvm intrinsic, llvm.arith.fence. The purpose is to provide fine control, at the expression level, over floating point optimization when -ffast-math (-ffp-model=fast) is enabled. We are also proposing a new clang builtin that provides access to this intrinsic, as well as a new clang command line option `-fprotect-parens` that will be implemented using this intrinsic.

This patch is authored by @pengfei

### Rationale

Some expression transformations that are mathematically correct, such as reassociation and distribution, may be incorrect when dealing with finite precision floating point. For example, these two expressions,

(a + b) + c a + (b + c)

are equivalent mathematically in integer arithmetic, but not in floating point. In some floating point (FP) models, the compiler is allowed to make these value-unsafe transformations for performance reasons, even when the programmer uses parentheses explicitly. But the compiler must always honor the parentheses implied by llvm.arith.fence, regardless of the FP model settings.

Under `–ffp-model=fast`, llvm.arith.fence provides a way to partially enforce ordering in an FP expression.

Original expression | Transformed expression | Permitted? |
---|---|---|

(a + b) + c | a + (b + c) | Yes! |

llvm.arith.fence(a + b) + c | a + (b + c) | No! |

`–ffp-model=precise`: FP expressions are already strictly ordered.

The new llvm intrinsic also enables the implementation of the option `-fprotect-parens` which is available in gfortran as well as the Intel C++ and Fortran compilers: icc and ifort.

### Proposed llvm IR changes

Requirements for llvm.arith.fence:

- There is one operand. The input to the intrinsic is an llvm::Value and must be scalar floating point or vector floating point.
- The return type is the same as the operand type.
- The return value is equivalent to the operand.

### Optimizing llvm.arith.fence

- Constant folding may substitute the constant value of the llvm.arith.fence operand for the value of fence itself in the case where the operand is constant.
- CSE Detection: No special changes needed: if E1 and E2 are CSE, then llvm.arith.fence(E1) and llvm.arith.fence(E2) are CSE.
- FMA transformation should be enabled, at least in the -ffp-model=fast case.
- The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen before “+ c” and FMA guarantees that, but to prevent later optimizations from unpacking the FMA the correct transformation needs to be:

llvm.arith.fence(a * b) + c → llvm.arith.fence(FMA(a, b, c))

- In the ffp-model=fast case, FMA formation doesn’t happen until Isel, so we just need to add the llvm.arith.fence cases to ISel pattern matching.
- There are some choices around the FMA optimization. For this example:

%t1 = fmul double %x, %y %t2 = call double @llvm.arith.fence.f64(double %t1) %t3 = fadd contract double %t2, %z

- FMA is allowed across an arith.fence if and only if the FMF
`contract`flag is set for the llvm.arith.fence operand.*After review discussion, we are convinced this choice doesn't work.* - FMA is not allowed across a fence
*We are recommending this choice* - The FMF
`contract`flag should be set on the llvm.arith.fence intrinsic call if contraction should be enabled

- FMA is allowed across an arith.fence if and only if the FMF
- Fast Math Optimization:
- The result of a llvm.arith.fence can participate in fast math optimizations. For example:

// This transformation is legal: w + llvm.arith.fence(x + y) + z → w + z + llvm.arith.fence(x + y)

- The operand of a llvm.arith.fence can participate in fast math optimizations. For example:

// This transformation is legal: llvm.arith.fence((x+y)+z) --> llvm.arith.fence(x+(y+z))

- MIR Optimization:
- The use of a pseudo-operation in the MIR serves the same purpose as the intrinsic in the IR, since all the optimizations are based on patterns matching from known DAGs/MIs.
- Backend simply respects the llvm.arith.fence intrinsic, builds llvm.arith.fence node during DAG/ISel and emits pseudo arithmetic_fence MI after it.
- The pseudo arithmetic_fence MI turns into a comment when emitting assembly.

### Other llvm changes needed -- utility functions

The ValueTracking utilities will need to be taught to handle the new intrinsic. For example, there are utility functions like `isKnownNeverNaN()` and `CannotBeOrderedLessThanZero()` that will need to “look through” the intrinsic.

### A simple example

// llvm IR, llvm.arith.fence over addition. %5 = load double, double* %B, align 8 %add1 = fadd fast double %4, %5 %6 = call double @llvm.arith.fence.f64(double %add1) %7 = load double, double* %C, align 8 %mul = fmul fast double %6, %7 store double %mul, double* %A, align 8

### Example, llvm.arith.fence over memory operand

Consider this similar example, which illustrates how ‘x’ can be optimized while ‘z’ is fenced. Notice ‘q’ is simplified to ‘b’ (q = a + b - a -> q = b), but ‘z’ isn’t simplified because of the fence.

// llvm IR define dso_local float @f(float %a, float %b) local_unnamed_addr #0 { %x = fadd fast float %b, %a %tmp = call fast float @llvm.arith.fence.f32(float %x) %z = fsub fast float %tmp, %a %result = call fast float @llvm.maxnum.f32(float %z, float %b) ret float %result

### Clang changes to take advantage of this intrinsic

- Add new clang builtin __arithmetic_fence
- Add builtin definition
- There is one operand. Any kind of expression, including memory operand.
- The return type is the same as the operand type. The result of the intrinsic is the value of its rvalue operand.
- The operand type can be any scalar floating point type, complex, or vector with float or complex element type.
- The invocation of __arithmetic_fence is not a C/C++ constant expression, even if the operands are constant.

- Add builtin definition

- Add semantic checks and test cases
- Modify clang/codegen to generate the llvm.arith.fence intrinsic

- Add support for a new command-line option
`-fprotect-parens`which honors parentheses within a floating point expression, the default is`-fno-protect-parens`. For example,

// Compile with -ffast-math double A,B,C; A = __arithmetic_fence(A+B)*C; // llvm IR %4 = load double, double* %A, align 8 %5 = load double, double* %B, align 8 %add1 = fadd fast double %4, %5 %6 = call double @llvm.arith_fence.f64(double %add1) %7 = load double, double* %C, align 8 %mul = fmul fast double %6, %7 store double %mul, double* %A, align 8

- Motivation: the new clang builtin provides clang compatibility with the Intel C++ compiler builtin
`__fence`which has similar semantics, and likewise enables implementation of the option`-fprotect-parens`. The new builtin provides the clang programmer control over floating point optimizations at the expression level.

### Pros & Cons

- Pros

- Increases expressiveness and precise control over floating point calculations.
- Provides a desirable compatibility feature from industrial compilers
- Cons

- Intrinsic bloat.
- Some of LLVM's optimizations need to understand the llvm.arith.fence semantics in order to retain optimization capabilities. This will require at least some engineering effort.
- Any target that wants to support this has to make modifications to their back-end.