Page MenuHomePhabricator

Pass for translating math intrinsics to math library calls.
ClosedPublic

Authored by mmasten on Apr 26 2016, 10:56 AM.

Details

Summary

This is the first patch related to translating math intrinsics to math library calls. This set of changes relates specifically to translating vector math intrinsics to svml function calls and laying a foundation so that intrinsics can be translated to any library of choice regardless of target. The changes here correspond to the following RFC.

RFC: A proposal for vectorizing calls to math functions using the Intel short vector math library (SVML)

Overview

Very simply, SVML (short vector math library) functions are vector variants of scalar math functions that take vector arguments,
apply an operation to each element, and store the result in a vector register. These vector variants can be
generated by the compiler, based on precision requirements specified by the user, resulting in substantial performance gains.
This is an initial proposal to add a new LLVM IR transformation pass to translate scalar math calls to svml calls.

Problem Description

Currently, without the "#pragma clang loop vectorize(enable)", the loop vectorizer will not vectorize loops with math
calls due to cost model reasons. Additionally, When the loop pragma is used, the loop vectorizer will widen the math call
using an intrinsic, but the resulting code is inefficient because the intrinsic is replaced with scalarized function
calls. Please see the example below for a simple loop containing a sinf call. For demonstration purposes, the example was
compiled for an xmm target, thus VF = 4 given the float type.

Example sinf.c

#define N 1000

#pragma clang loop vectorize(enable)
for (i = 0; i < N; i++) {

array[i] = sinf((float)i);

}

Without the loop pragma the loop vectorizer's cost model rejects the loop.

clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize sinf.c

sinf.c:19:3: remark: the cost-model indicates that vectorization is not beneficial [-Rpass-analysis=loop-vectorize]

for (i = 0; i < N; i++) {
^

sinf.c:19:3: remark: the cost-model indicates that interleaving is not beneficial and is explicitly disabled or interleave
count is set to 1 [-Rpass-analysis=loop-vectorize]

When the the loop pragma is used, the loop is vectorized (i.e., @llvm.sin.v4f32 is generated), but the call is later scalarized with the
additional overhead of unpacking the scalar function arguments from a vector. This can be seen from inspection of the
resulting assembly code just below the LLVM IR.

vector.body: ; preds = %vector.body, %vector.ph

%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !6
%0 = trunc i64 %index to i32, !dbg !7
%broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0, !dbg !7
%broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6, <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
%induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>, !dbg !7
%1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
%2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8
%3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9
%4 = bitcast float* %3 to <4 x float>*, !dbg !10
store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa !11
%index.next = add i64 %index, 4, !dbg !6
%5 = icmp eq i64 %index.next, 1000, !dbg !6
br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15

.LBB0_1: # %vector.body

  1. =>This Inner Loop Header: Depth=1 movd %ebx, %xmm0 pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] paddd .LCPI0_0(%rip), %xmm0 cvtdq2ps %xmm0, %xmm0 movaps %xmm0, 16(%rsp) # 16-byte Spill shufps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3] callq sinf movaps %xmm0, (%rsp) # 16-byte Spill movaps 16(%rsp), %xmm0 # 16-byte Reload shufps $229, %xmm0, %xmm0 # xmm0 = xmm0[1,1,2,3] callq sinf unpcklps (%rsp), %xmm0 # 16-byte Folded Reload
  2. xmm0 = xmm0[0],mem[0],xmm0[1],mem[1] movaps %xmm0, (%rsp) # 16-byte Spill movaps 16(%rsp), %xmm0 # 16-byte Reload callq sinf movaps %xmm0, 32(%rsp) # 16-byte Spill movapd 16(%rsp), %xmm0 # 16-byte Reload shufpd $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0] callq sinf movaps 32(%rsp), %xmm1 # 16-byte Reload unpcklps %xmm0, %xmm1 # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1] unpcklps (%rsp), %xmm1 # 16-byte Folded Reload
  3. xmm1 = xmm1[0],mem[0],xmm1[1],mem[1] movups %xmm1, (%r14,%rbx,4) addq $4, %rbx cmpq $1000, %rbx # imm = 0x3E8 jne .LBB0_1

Proposed New Functionality

In order to take advantage of the performance benefits of the svml library, the proposed solution is to introduce a new
LLVM IR pass that is capable of translating the vector math intrinsics to svml calls. As an example, the LLVM IR above for
"vector.body", introduced in the Problem Description section, would serve as input to the proposed pass and be transformed
into the following LLVM IR. Special attention should be paid to the "__svml_sinf4_ha" call in the LLVM IR and resulting
assembly code snippet.

vector.body: ; preds = %vector.body, %entry

%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6
%0 = trunc i64 %index to i32, !dbg !7
%broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0, !dbg !7
%broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6, <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
%induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>, !dbg !7
%1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
%vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)
%2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8
%3 = bitcast float* %2 to <4 x float>*, !dbg !9
store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa !10
%index.next = add i64 %index, 4, !dbg !6
%4 = icmp eq i64 %index.next, 1000, !dbg !6
br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14

The resulting assembly would appear as:

.LBB0_1: # %vector.body

  1. =>This Inner Loop Header: Depth=1 movd %ebx, %xmm0 pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] paddd .LCPI0_0(%rip), %xmm0 cvtdq2ps %xmm0, %xmm0 callq __svml_sinf4_ha movups %xmm0, (%r14,%rbx,4) addq $4, %rbx cmpq $1000, %rbx # imm = 0x3E8 jne .LBB0_1

In order to perform the translation, several requirements must be met to guide code generation. Those include:

  1. In addition to the -ffast-math flag, support is needed from clang to allow the user to be able to specify the desired precision requirements. The additional flags needed include the following, where imf is shorthand for "Intel math function".

    -fimf-absolute-error=value[:funclist] define the maximum allowable absolute error for math library function results value - a positive, floating-point number conforming to the format [digits][.digits][{e|E}[sign]digits] funclist - optional comma separated list of one or more math library functions to which the attribute should be applied

    -fimf-accuracy-bits=bits[:funclist] define the relative error, measured by the number of correct bits, for math library function results bits - a positive, floating-point number funclist - optional comma separated list of one or more math library functions to which the attribute should be applied

    -fimf-arch-consistency=value[:funclist] ensures that the math library functions produce consistent results across different implementations of the same architecture value - true or false funclist - optional comma separated list of one or more math library functions to which the attribute should be applied

    -fimf-max-error=ulps[:funclist] defines the maximum allowable relative error, measured in ulps, for math library function results ulps - a positive, floating-point number conforming to the format [digits][.digits][{e|E}[sign]digits] funclist - optional comma separated list of one or more math library functions to which the attribute should be applied

    -fimf-precision=value[:funclist] defines the accuracy (precision) for math library functions value - defined as one of the following values high - equivalent to max-error = 0.6 medium - equivalent to max-error = 4 (DEFAULT) low - equivalent to accuracy-bits = 11 (single precision); accuracy-bits = 26 (double precision) funclist - optional comma separated list of one or more math library functions to which the attribute should be applied

    -fimf-domain-exclusion=classlist[:funclist] indicates the input arguments domain on which math functions must provide correct results. classlist - defined as one of the following values nans, infinities, denormals, zeros all, none, common funclist - optional list of one or more math library functions to which the attribute should be applied.

Information from the flags can then be encoded as function attributes at each call site. In the future, this
functionality will enable more fine-grained control over specifying precision for individual calls/regions,
instead of setting the precision requirements for all call instances of a function. Please note that the
example translation presented so far does not have the IMF attributes attached to the @llvm.sin.v4f32 call,
and as a result the default is set to an svml variant marked with "_ha", which is short for high accuracy.
Other supported variants will include low precision, enhanced performance, bitwise reproducible, and correctly
rounded. Please refer to the IEEE-754 standard for additional information regarding supported precisions. The
compiler will select the most appropriate variant based on the IMF attributes. See #2.

  1. An interface to query for the appropriate svml function variant based on the scalar function name and IMF attributes.
  1. For calls to math functions that store to memory (e.g., sincos), additional analysis of the pointer arguments is beneficial in order to generate the best performing store instructions.

GCC/ICC compatibility

The initial implementation will involve the translation of 6 svml functions, which include sin, cos, log, pow, exp, and
sincos (both single and double precision variants). Support for these functions matches the current capabilities of GCC and a
subset of ICC. As more functions become open-sourced, the plan is to support them as part of the final solution determined
from this proposal. The flags referenced in the Proposed New FUnctionality section are required to maintain icc compatibility.

Current Implementation

To evaluate the feasibility of this proposal, a prototype transform pass has been developed, which performs
the following:

  1. Searches for vector math intrinsics as candidates for translation to svml.
  2. Reads function attributes to obtain precision requirements for the call. If none, default to attributes that will force the selection of a high accuracy variant.
  3. Since the vector factor of the intrinsic can be wider than what is legally supported by the target, type legalization is performed so that the correct svml variant is selected. For example, if a call to @llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the pass will generate two __svml_sinf4 calls and will do the appropriate splitting of the vectors required as arguments to each call. The pass is also capable of handling less than full vector cases. E.g., @llvm.sin.v2f32.
  4. Special handling for sincos since the results are stored to a double wide vector and additional analysis is needed to optimize the stores to memory.
  5. Vector intrinsics that are not translated to svml are scalarized.
  6. The loop vectorizer has been taught to allow widening of sincos and additional utilities have been written to analyze arguments for sincos.

Feedback

I would appreciate those who are interested in this topic to review this proposal and provide feedback on the proposed
approach. Help is also welcome and much appreciated in the development process.

Diff Detail

Event Timeline

mmasten updated this revision to Diff 55036.Apr 26 2016, 10:56 AM
mmasten retitled this revision from to Pass for translating math intrinsics to math library calls..
mmasten updated this object.
mmasten added reviewers: hfinkel, mzolotukhin, spatel.
mmasten added a subscriber: llvm-commits.
hfinkel edited edge metadata.Apr 26 2016, 7:38 PM

Thanks for contributing this! I think we're going to need to work on splitting this into a smaller set of changes that can be integrated with our current infrastructure. I recommend that we start by adding support for the vector math library functions themselves.

In addition to the -ffast-math flag, support is needed from clang to allow the user to be able to specify the desired precision requirements.

Why? Is this just because the vector math calls don't set errno, or are even the "high accuracy" versions not as precise as the libm implementations?

Here's a simple recipe:

  1. Add a new enum value for IMF to VectorLibrary in include/llvm/Analysis/TargetLibraryInfo.h
  2. Add the required code to TargetLibraryInfoImpl::addVectorizableFunctionsFromVecLib -- it will look something like this:

(at first, only one variant, maybe you want to pick the _ha versions -- it would be reasonable to add a FastMath flag to addVectorizableFunctionsFromVecLib so that other variants can be selected)

case IMF: {
  const VecDesc VecFuncs[] = {
      {"sinf", "__svml_sinf4", 4},
      {"llvm.sin.f32", "__svml_sinf4", 4},
      {"cosf", "__svml_cosf4", 4},
      {"llvm.cos.f32", "__svml_cosf4", 4},
      {"powf", "__svml_powf4", 4},
      {"llvm.pow.f32", "__svml_powf4", 4},
      {"__powf_finite", "__svml_powf4", 4},
      {"logf", "__svml_logf4", 4},
      {"llvm.log.f32", "__svml_logf4", 4},
      {"__logf_finite", "__svml_logf4", 4},
      {"expf", "__svml_expf4", 4},
      {"llvm.exp.f32", "__svml_expf4", 4},
      {"__expf_finite", "__svml_expf4", 4},
  ...    
}

Then you can make the minimal Clang change (in a follow-up patch), and we'll be able to start:

  1. In include/clang/Frontend/CodeGenOptions.def, change ENUM_CODEGENOPT(VecLib, VectorLibrary, 1, NoLibrary) to ENUM_CODEGENOPT(VecLib, VectorLibrary, 2, NoLibrary) (i.e. increase the number of bits)
  2. In include/clang/Frontend/CodeGenOptions.h, add a corresponding enum to enum VectorLibrary
  3. In lib/CodeGen/BackendUtil.cpp, add corresponding code to call addVectorizableFunctionsFromVecLib in createTLII
  4. In lib/Frontend/CompilerInvocation.cpp, add corresponding code to ParseCodeGenArgs

Once we have that done, we can start working on improving the infrastructure understand more variants, generate better sincos calls, have system-independent constant folding for the various functions, etc. Some of these things should probably be directly integrated into the vectorizer(s) (via reusable utilities), so that the cost model can be directly reasoned about.

Once the implementations of the functions themselves have been contributed, we can look at setting a default vector math function library as well.

In practice, we're going to end up replaying all of your design decisions in slow motion as part of this process. Breaking this into smaller pieces and steps is the only way to do this. And that's the only way we'll be able to give sufficient attention to all of the details. Nevertheless, thanks for posting this entire patch. It does help to give context on where you'd like to go with this.

Thanks for the feedback, Hal. We need the clang support because we actually have several variants of each svml function, each of them having varying levels of precision. To be able to pick the right variant, not only do we need to know the variant name and vector length, we also need the additional precision information specified by the user via -imf flags from the command line. If we go the route of implementing the translation using the addVectorizableFunctionsFromVecLib(), it seems some additional information about precision requirements would be needed in the VecDesc struct. One thing I'm concerned about with this approach is that if the math calls are translated immediately to library calls, then any subsequent optimizations are probably in a less advantageous position of being able to optimize further. Thus, one of the design goals of this project was to keep the vectorized intrinsics as late as possible before translation. What are your thoughts on this?

I apologize for throwing such a large patch at you guys. I was afraid that might be the case. I'll try and keep the changes to a minimum going forward.

Thanks,

Matt

Thanks for the feedback, Hal. We need the clang support because we actually have several variants of each svml function, each of them having varying levels of precision. To be able to pick the right variant, not only do we need to know the variant name and vector length, we also need the additional precision information specified by the user via -imf flags from the command line.

I'm aware, however:

  1. We need to break this into smaller pieces. We can start with adding support for some subset of the functions using our current infrastructure, and then extend that infrastructure.
  2. We need to have a separate design discussion on the Clang flags and other aspects of the user interface (pragmas, attributes, etc.).

If we go the route of implementing the translation using the addVectorizableFunctionsFromVecLib(), it seems some additional information about precision requirements would be needed in the VecDesc struct.

Yes, or the tables become callback functions, or any number of other things.

One thing I'm concerned about with this approach is that if the math calls are translated immediately to library calls, then any subsequent optimizations are probably in a less advantageous position of being able to optimize further. Thus, one of the design goals of this project was to keep the vectorized intrinsics as late as possible before translation. What are your thoughts on this?

This is a valid concern. However, we currently vectorize very late in the pipeline, and so the only optimizations we're talking about are optimizations that would not have applied to the original scalar instructions, but might now apply to the vector instructions. I can't think of anything that falls into this category, but if you can, please elaborate.

I apologize for throwing such a large patch at you guys. I was afraid that might be the case. I'll try and keep the changes to a minimum going forward.

You don't need to apologize. We still need to break this into smaller bits. However, it is completely reasonable to post a large change first, and then get community feedback on how to split it up.

mmasten updated this revision to Diff 61110.Jun 17 2016, 10:33 AM
mmasten updated this revision to Diff 61111.
mmasten edited edge metadata.
mmasten added a subscriber: cfe-commits.

Hello all,

Just wanted to see if you guys have some time to review.

Thanks,

Matt

spatel edited edge metadata.Jul 14 2016, 11:09 AM

Hi Matt -

This looks like the right first step in the path that Hal suggested, except I think we need a test case for each function that you want to enable. Please see test/Transforms/LoopVectorize/X86/veclib-calls.ll as a reference for how to do that.

Hi Matt -

This looks like the right first step in the path that Hal suggested, except I think we need a test case for each function that you want to enable. Please see test/Transforms/LoopVectorize/X86/veclib-calls.ll as a reference for how to do that.

Agreed. Once this has regression tests it should be good to go. Thanks for continuing to work on this!

Thanks for reviewing. One concern I have going forward is the number of entries that will appear in the switch statement inside addVectorizableFunctionsFromVecLib(). I assume that the right thing to do is to replace this with something that is TableGen'd? Also, I just wanted to point out that some of these entries will result in svml calls that are not legal. E.g., __svml_sinf32 does not actually exist in the library, but can be legalized in case one explicitly sets a vector length of 32. Although these types of cases are probably not common, I wanted to bring this to your attention since the legalization piece of this work will be reviewed and committed separately. If needed, I can remove those entries until the legalization is in place.

Thanks for reviewing. One concern I have going forward is the number of entries that will appear in the switch statement inside addVectorizableFunctionsFromVecLib(). I assume that the right thing to do is to replace this with something that is TableGen'd?

I completely agree; this seems like a good candidate to be TableGen'd.

Also, I just wanted to point out that some of these entries will result in svml calls that are not legal. E.g., __svml_sinf32 does not actually exist in the library, but can be legalized in case one explicitly sets a vector length of 32. Although these types of cases are probably not common, I wanted to bring this to your attention since the legalization piece of this work will be reviewed and committed separately. If needed, I can remove those entries until the legalization is in place.

Yes, let's start only with the directly-legal calls.

In the process of writing test cases, I noticed that a loop with a call to llvm.log.f32 was not getting vectorized due to cost modeling. When forcing vectorization on the loop and throwing -fveclib=SVML, the loop was vectorized with a widened intrinsic instead of the svml call. Is this correct? I would have expected to get the svml call. In light of this, wouldn't it be better to represent the math calls with vector intrinsics and let CodeGenPrepare or the backends decide how to lower them?

Thanks,

Matt

spatel added a subscriber: davide.Jul 19 2016, 11:10 AM

In the process of writing test cases, I noticed that a loop with a call to llvm.log.f32 was not getting vectorized due to cost modeling. When forcing vectorization on the loop and throwing -fveclib=SVML, the loop was vectorized with a widened intrinsic instead of the svml call. Is this correct? I would have expected to get the svml call. In light of this, wouldn't it be better to represent the math calls with vector intrinsics and let CodeGenPrepare or the backends decide how to lower them?

I don't know the answer, but I'm curious about this too for an unrelated change in LibCallSimplifier (cc @davide).

The LangRef has this boilerplate for all target-independent math intrinsics:
"Not all targets support all types however."

Is that only intended for the weird types (x86_fp80, ppc_fp128, fp128?), or does it mean that we shouldn't create these intrinsics for vectors with standard FP types (eg, v4f32)?

mmasten updated this revision to Diff 64571.Jul 19 2016, 3:01 PM
mmasten edited edge metadata.

I think this is just saying that some of the weird types are not supported on all targets. For now, is it ok to proceed with checking this code in?

Thanks,

Matt

mzolotukhin accepted this revision.Jul 21 2016, 5:26 PM
mzolotukhin edited edge metadata.

LGTM with a small nit: could you please run opt -instnamer on your test (it'll replace %0, %1,... with %tmp0, %tmp1 etc, making it easier to modify test in future)?

Thanks,
Michael

This revision is now accepted and ready to land.Jul 21 2016, 5:26 PM
mmasten updated this revision to Diff 65152.Jul 22 2016, 2:10 PM
mmasten edited edge metadata.

Thanks Michael. The tests have been updated.

Matt

I think this is just saying that some of the weird types are not supported on all targets. For now, is it ok to proceed with checking this code in?

Correct.

Thanks Michael. The tests have been updated.

Matt

Do you need someone to commit this for you?

I was just recently given commit privileges, so I can do it. Thanks Hal.

This revision was automatically updated to reflect the committed changes.