This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Use an interleave count of 1 when using a vector library call
Needs ReviewPublic

Authored by rob.lougher on Jun 14 2018, 1:14 PM.

Details

Summary

Given the following test program:

#include <math.h>

void test(float *a, float *b, int n) {
  for(int i = 0; i < n; i++)
    b[i] = sinf(a[i]);
}

If we tell the compiler we have a vector-library available and compile it as follows:

$ clang -O2 --target=x86_64-unknown-linux -march=btver2 -mllvm -vector-library=SVML -S test.c

The loop will be vectorized with a vectorization factor of 8, and the call to sinf will be widened to a vector library call (__svml_sinf8):

.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%r12,%r13,4), %ymm0
	vmovups	32(%r12,%r13,4), %ymm1
	vmovups	64(%r12,%r13,4), %ymm3
	vmovups	96(%r12,%r13,4), %ymm2
	vmovups	%ymm1, (%rsp)           # 32-byte Spill
	vmovups	%ymm3, 32(%rsp)         # 32-byte Spill
	vmovups	%ymm2, 96(%rsp)         # 32-byte Spill
	callq	__svml_sinf8
	vmovups	%ymm0, 64(%rsp)         # 32-byte Spill
	vmovups	(%rsp), %ymm0           # 32-byte Reload
	callq	__svml_sinf8
	vmovups	%ymm0, (%rsp)           # 32-byte Spill
	vmovups	32(%rsp), %ymm0         # 32-byte Reload
	callq	__svml_sinf8
	vmovups	%ymm0, 32(%rsp)         # 32-byte Spill
	vmovups	96(%rsp), %ymm0         # 32-byte Reload
	callq	__svml_sinf8
	vmovups	64(%rsp), %ymm1         # 32-byte Reload
	vmovups	(%rsp), %ymm3           # 32-byte Reload
	vmovups	32(%rsp), %ymm2         # 32-byte Reload
	vmovups	%ymm1, (%r14,%r13,4)
	vmovups	%ymm3, 32(%r14,%r13,4)
	vmovups	%ymm2, 64(%r14,%r13,4)
	vmovups	%ymm0, 96(%r14,%r13,4)
	addq	$32, %r13
	cmpq	%r13, %rbx
	jne	.LBB0_6

However, as can be seen the code generated is poor, containing a large number of spills and reloads. The reason for this is the loop vectorizer has chosen an interleave count (aka unroll factor) of 4.

In general, the heuristics tries to create parallel instances of the loop to expose ILP without causing spilling. It bases this on the number of registers used in the loop and the number of registers available. However, due to the way instructions are interleaved, the vector call causes the registers for the other instances to be spilled (thus defeating the heuristics).

This patch changes the heuristics to use an interleave count of 1 when a call will be vectorized to a library call. The test above now generates:

.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%r12,%r13,4), %ymm0
	callq	__svml_sinf8
	vmovups	%ymm0, (%r14,%r13,4)
	addq	$8, %r13
	cmpq	%r13, %rbx
	jne	.LBB0_6

Diff Detail

Event Timeline

rob.lougher created this revision.Jun 14 2018, 1:14 PM

Hi Robert,

thanks for bringing this up! This approach is blindly setting the interleave factor to 1 when there are vector math function calls. I have the following questions/comments:

  1. Maybe I'm missing something but, wouldn't the same problem happen when the function calls are scalar or for any arbitrary function call (not necessarily math functions)? Why should we do this for vector math function calls only?
  2. I'm concerned about this change introducing performance regressions. For example, imaging a loop body where the total gain of interleaving overcomes the penalty of the register spilling caused by the function call. Wouldn't it be better to properly model this particular register spilling penalty in the context of function calls instead of blindly disabling interleaving for those cases?

Thanks,
Diego

Hi Robert,

thanks for bringing this up! This approach is blindly setting the interleave factor to 1 when there are vector math function calls. I have the following questions/comments:

Thanks for responding. Your questions are ones which I considered while doing the patch, so I suspected I would be asked them...

  1. Maybe I'm missing something but, wouldn't the same problem happen when the function calls are scalar or for any arbitrary function call (not necessarily math functions)? Why should we do this for vector math function calls only?

Legalization doesn't allow arbitrary function calls (it must be an intrinsic or a library call). But yes, a vector call may be scalarized, or a widened intrinsic may be lowered back to scalar library calls. But in this case we're still going to be spilling/reloading all over the place with an IC of 1. The point for doing it for vector library calls, it that currently it generates poor code and we can fix it for them with a simple change.

  1. I'm concerned about this change introducing performance regressions. For example, imaging a loop body where the total gain of interleaving overcomes the penalty of the register spilling caused by the function call. Wouldn't it be better to properly model this particular register spilling penalty in the context of function calls instead of blindly disabling interleaving for those cases?

From what I can see the loop vectorizer is conservative, and its model of the target is very basic (see the register usage calculation, and assumptions such as the number of load/store ports being the max interleave count). Trying to add ABI considerations and spill cost calculation at the level of the loop vectorizer will be difficult. In the case of vector library calls we can clearly see a codegen issue, and setting IC=1 in this case is conservative.

Thanks,
Diego

Hi Robert,

  1. I'm concerned about this change introducing performance regressions. For example, imaging a loop body where the total gain of interleaving overcomes the penalty of the register spilling caused by the function call. Wouldn't it be better to properly model this particular register spilling penalty in the context of function calls instead of blindly disabling interleaving for those cases?

From what I can see the loop vectorizer is conservative, and its model of the target is very basic (see the register usage calculation, and assumptions such as the number of load/store ports being the max interleave count). Trying to add ABI considerations and spill cost calculation at the level of the loop vectorizer will be difficult. In the case of vector library calls we can clearly see a codegen issue, and setting IC=1 in this case is conservative.

I forgot to say that in the case of loops without reductions, only small loops are interleaved (i.e. a loop cost less than 20). For reference, the loop above, (if I remember correctly, I'm not in work now) has a cost of 12 (a call cost of 10 plus 1 each for the store and load). So it's unlikely that a small loop could overcome the cost of spilling, as there's not a lot of room left for extra instructions.

Thanks again for the questions...

Thanks,
Diego

hsaito added a subscriber: hsaito.Jun 14 2018, 3:58 PM

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The register allocator is very constrained on its ability to rematerialize the loads to avoid the spill and reload. It would require the pointers for the address computation to be available on the other side of the call which may require spills/reloads of that register.

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The call uses one register but causes all other live values at the call location to be spilled (the other live values are caused by the interleaving). To model this we would need to know the calling-convention of the target (which registers are preserved), and also which registers the live values will end up in. This isn't known until after register allocation.

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The call uses one register but causes all other live values at the call location to be spilled (the other live values are caused by the interleaving). To model this we would need to know the calling-convention of the target (which registers are preserved), and also which registers the live values will end up in. This isn't known until after register allocation.

Craig topper told me that LLVM currently doesn't have any special knowledge on SVML's register usage. I just checked ICC behavior. ICC uses reg-reg move. So, there appears to be a room for improvement in that area.
I think it's better to look into that first. Doing something blindly like this look easy but I hate to see unnecessary restrictions to be imposed. Even if we don't have a great accuracy here, we should still try to do some accounting, especially so if the calls are to something that's better behaving than the worst case.

What's the value of Legal checking MayHaveVectorLibCall()? Cost model can tell whether the call is vector or scalar on its own for each VF, and that's already happened by the time selectInterleaveCount() is called.

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The call uses one register but causes all other live values at the call location to be spilled (the other live values are caused by the interleaving). To model this we would need to know the calling-convention of the target (which registers are preserved), and also which registers the live values will end up in. This isn't known until after register allocation.

Craig topper told me that LLVM currently doesn't have any special knowledge on SVML's register usage. I just checked ICC behavior. ICC uses reg-reg move. So, there appears to be a room for improvement in that area.
I think it's better to look into that first. Doing something blindly like this look easy but I hate to see unnecessary restrictions to be imposed. Even if we don't have a great accuracy here, we should still try to do some accounting, especially so if the calls are to something that's better behaving than the worst case.

What's the value of Legal checking MayHaveVectorLibCall()? Cost model can tell whether the call is vector or scalar on its own for each VF, and that's already happened by the time selectInterleaveCount() is called.

I'm not an expert on the loop vectorizer. From what I can see it checks to see if a vector lib call exists in 4 places:

  1. Legalization
  2. Planning stage
  3. Cost model
  4. Plan execution

Legalization just checks if we can vectorize the call (vector lib or intrinsic). We don't know at this stage if the call will use the vector lib as it depends on the vectorization factor.

The planning stage tries various permutations of vectorization factor ranges. Again, the planning decision is whether the call can be vectorized, which means it could be either an intrinsic or a vector lib call (within a VF range, one VF might use the intrinsic and one the vector lib, which makes it difficult to record).

The cost model is then used to calculate the expected loop cost for each VF within the VF range from planning.

At this stage the VF is then chosen (the cheapest cost).

Until this point we have been dealing with several possible VFs. It would be possible to store within the cost model whether the vector lib call was to be used against each queried VF, but the cost model looks relatively stateless (I may be wrong but that was my impression).

So I decided to add an extra pass after the VF is chosen that looks for calls within the loop that will be vectorized with a vector library call (given the chosen VF).

Although the pass is not particularly expensive, I added MayHaveVectorLibCall() to the legalization phase. This records if a call is seen for which a vector library call is available. However, as stated above we can't tell if it will be used. However, the flag is used to guard the extra pass added after the VF is chosen. For most cases this is sufficient to avoid running the pass.

This patch tries to fix improve the SVML calling convention. https://reviews.llvm.org/D47188 Maybe it will help this code?

This patch tries to fix improve the SVML calling convention. https://reviews.llvm.org/D47188 Maybe it will help this code?

Hi Craig,

Thanks for the link.

I had an idea for an alternative approach before sending the patch. This involved changing the register usage calculation to record the number of live values at the point of a call. Then, if we knew how many vector registers are preserved across the call, we could estimate register pressure, and potentially allow interleaving. For example, if there was 1 live value at the call, and 4 registers are preserved, we could allow an IC of 4 (1*4 means there would be 4 live registers after interleaving minus 1 for the value dead after the call).

The problem was finding out how many registers are preserved. The TargetTransformInfo pass exposes codegen information to IR-level passes. So for example, it provides to the loop vectorizer the number of registers (scalar or vector). However, this is just a simple target specific number. In contrast, the number of preserved registers depends on the calling-convention/word-size/instruction-set, etc. A vector-library call could use any calling-convention, and as far as I can see, there's nothing to prevent anybody from providing an SVML-like library on say, ARM, so this would also need to be implemented for all targets.

Of course, this sort of information is needed by the register allocator. TargetRegisterInfo provides an interface to find out information about the target registers, e.g. getCalleeSavedRegs() and getCallPreservedMask(). The TargetRegisterInfo is normally obtained via the subtarget attached to the machine function (this subtarget is created during codegen prepare). However, the TargetTransformInfo also has a subtarget object (as part of the TTIImpl), which means the TargetRegisterInfo could potentially be queried by the loop-vectorizer. However, both getCalleeSavedRegs() and getCallPreservedMask() take a MachineFunction pointer (which obviously doesn't exist when the loop-vectorizer is ran). The functions are also much more low-level than we require (we would need to convert the return into a number, based on register class, etc.).

D47188 is interesting for two reasons. Firstly it provides an explicit calling-convention for SVML. Secondly, only an X86 implementation of the calling-convention is provided. So, if this were to land, implementing the number of preserved registers becomes trivial. However, on re-reading the previous comments, there's a reluctance to only handle vector-library calls (unfortunately handling all calls is problematic, as an intrinsic call may end up as a sequence of instructions, or it may be lowered back into scalar calls to libm). Also, is it true that the SVML framework is X86-only?

So I'm still rather stuck as to what to do for the preserved registers. I don't want to go to the fuss of providing a full implementation for every target (duplicating the logic in getCalleeSavedRegs/getCallPreservedMask) if it isn't necessary. If people think the approach outlined above sounds promising, I could provide an initial patch that just handled the default CCs on X86?

Thanks,
Rob.

I've created a follow up review D50798.