This patch adds a new llvm intrinsic, llvm.arith.fence. The purpose is to provide fine control, at the expression level, over floating point optimization when -ffast-math (-ffp-model=fast) is enabled. We are also proposing a new clang builtin that provides access to this intrinsic, as well as a new clang command line option -fprotect-parens that will be implemented using this intrinsic.
This patch is authored by @pengfei
Rationale
Some expression transformations that are mathematically correct, such as reassociation and distribution, may be incorrect when dealing with finite precision floating point. For example, these two expressions,
(a + b) + c a + (b + c)
are equivalent mathematically in integer arithmetic, but not in floating point. In some floating point (FP) models, the compiler is allowed to make these value-unsafe transformations for performance reasons, even when the programmer uses parentheses explicitly. But the compiler must always honor the parentheses implied by llvm.arith.fence, regardless of the FP model settings.
Under –ffp-model=fast, llvm.arith.fence provides a way to partially enforce ordering in an FP expression.
Original expression | Transformed expression | Permitted? |
---|---|---|
(a + b) + c | a + (b + c) | Yes! |
llvm.arith.fence(a + b) + c | a + (b + c) | No! |
The new llvm intrinsic also enables the implementation of the option -fprotect-parens which is available in gfortran as well as the Intel C++ and Fortran compilers: icc and ifort.
Proposed llvm IR changes
Requirements for llvm.arith.fence:
- There is one operand. The input to the intrinsic is an llvm::Value and must be scalar floating point or vector floating point.
- The return type is the same as the operand type.
- The return value is equivalent to the operand.
Optimizing llvm.arith.fence
- Constant folding may substitute the constant value of the llvm.arith.fence operand for the value of fence itself in the case where the operand is constant.
- CSE Detection: No special changes needed: if E1 and E2 are CSE, then llvm.arith.fence(E1) and llvm.arith.fence(E2) are CSE.
- FMA transformation should be enabled, at least in the -ffp-model=fast case.
- The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen before “+ c” and FMA guarantees that, but to prevent later optimizations from unpacking the FMA the correct transformation needs to be:
llvm.arith.fence(a * b) + c → llvm.arith.fence(FMA(a, b, c))
- In the ffp-model=fast case, FMA formation doesn’t happen until Isel, so we just need to add the llvm.arith.fence cases to ISel pattern matching.
- There are some choices around the FMA optimization. For this example:
%t1 = fmul double %x, %y %t2 = call double @llvm.arith.fence.f64(double %t1) %t3 = fadd contract double %t2, %z
- FMA is allowed across an arith.fence if and only if the FMF contract flag is set for the llvm.arith.fence operand. After review discussion, we are convinced this choice doesn't work.
- FMA is not allowed across a fence We are recommending this choice
- The FMF contract flag should be set on the llvm.arith.fence intrinsic call if contraction should be enabled
- Fast Math Optimization:
- The result of a llvm.arith.fence can participate in fast math optimizations. For example:
// This transformation is legal: w + llvm.arith.fence(x + y) + z → w + z + llvm.arith.fence(x + y)
- The operand of a llvm.arith.fence can participate in fast math optimizations. For example:
// This transformation is legal: llvm.arith.fence((x+y)+z) --> llvm.arith.fence(x+(y+z))
- MIR Optimization:
- The use of a pseudo-operation in the MIR serves the same purpose as the intrinsic in the IR, since all the optimizations are based on patterns matching from known DAGs/MIs.
- Backend simply respects the llvm.arith.fence intrinsic, builds llvm.arith.fence node during DAG/ISel and emits pseudo arithmetic_fence MI after it.
- The pseudo arithmetic_fence MI turns into a comment when emitting assembly.
Other llvm changes needed -- utility functions
The ValueTracking utilities will need to be taught to handle the new intrinsic. For example, there are utility functions like isKnownNeverNaN() and CannotBeOrderedLessThanZero() that will need to “look through” the intrinsic.
A simple example
// llvm IR, llvm.arith.fence over addition. %5 = load double, double* %B, align 8 %add1 = fadd fast double %4, %5 %6 = call double @llvm.arith.fence.f64(double %add1) %7 = load double, double* %C, align 8 %mul = fmul fast double %6, %7 store double %mul, double* %A, align 8
Example, llvm.arith.fence over memory operand
Consider this similar example, which illustrates how ‘x’ can be optimized while ‘z’ is fenced. Notice ‘q’ is simplified to ‘b’ (q = a + b - a -> q = b), but ‘z’ isn’t simplified because of the fence.
// llvm IR define dso_local float @f(float %a, float %b) local_unnamed_addr #0 { %x = fadd fast float %b, %a %tmp = call fast float @llvm.arith.fence.f32(float %x) %z = fsub fast float %tmp, %a %result = call fast float @llvm.maxnum.f32(float %z, float %b) ret float %result
Clang changes to take advantage of this intrinsic
- Add new clang builtin __arithmetic_fence
- Add builtin definition
- There is one operand. Any kind of expression, including memory operand.
- The return type is the same as the operand type. The result of the intrinsic is the value of its rvalue operand.
- The operand type can be any scalar floating point type, complex, or vector with float or complex element type.
- The invocation of __arithmetic_fence is not a C/C++ constant expression, even if the operands are constant.
- Add builtin definition
- Add semantic checks and test cases
- Modify clang/codegen to generate the llvm.arith.fence intrinsic
- Add support for a new command-line option -fprotect-parens which honors parentheses within a floating point expression, the default is -fno-protect-parens. For example,
// Compile with -ffast-math double A,B,C; A = __arithmetic_fence(A+B)*C; // llvm IR %4 = load double, double* %A, align 8 %5 = load double, double* %B, align 8 %add1 = fadd fast double %4, %5 %6 = call double @llvm.arith_fence.f64(double %add1) %7 = load double, double* %C, align 8 %mul = fmul fast double %6, %7 store double %mul, double* %A, align 8
- Motivation: the new clang builtin provides clang compatibility with the Intel C++ compiler builtin __fence which has similar semantics, and likewise enables implementation of the option -fprotect-parens. The new builtin provides the clang programmer control over floating point optimizations at the expression level.
Pros & Cons
- Pros
- Increases expressiveness and precise control over floating point calculations.
- Provides a desirable compatibility feature from industrial compilers
- Cons
- Intrinsic bloat.
- Some of LLVM's optimizations need to understand the llvm.arith.fence semantics in order to retain optimization capabilities. This will require at least some engineering effort.
- Any target that wants to support this has to make modifications to their back-end.
Should be equal to the text?