This is an archive of the discontinued LLVM Phabricator instance.

[X86][SLP] Enable SLP vectorization for 128-bit horizontal X86 instructions (add, sub)
ClosedPublic

Authored by anton-afanasyev on Dec 26 2018, 4:14 AM.

Details

Summary
Try to use 64-bit SLP vectorization. In addition to horizontal instrs          
this change triggers optimizations for partial vector operations (for instance,                                           
using low halfs of 128-bit registers xmm0 and xmm1 to multiply <2 x float> by
<2 x float>).                                                                                                             
                                                                                                                      
Fixes llvm.org/PR32433

Diff Detail

Event Timeline

General question: why 256bit sie of the vector is set is minimal vector register size if the architecture supports 128 bit vectors? TTI should report 128 as the min regsize, not 256. And we don't need all these new options, functions etc.

anton-afanasyev added a comment.EditedDec 26 2018, 8:14 AM

General question: why 256bit sie of the vector is set is minimal vector register size if the architecture supports 128 bit vectors? TTI should report 128 as the min regsize, not 256. And we don't need all these new options, functions etc.

Hi @ABataev , x86 TTI has already been reporting 128 bit vector (TTI->getMinVectorRegisterBitWidth() returns 128), but we actually need 64-bit vectors (high and low parts of 128-bit registers) to be tried by SLPVectorizer to support horizontal 128-bit adds and subs. Though making TTI->getMinVectorRegisterBitWidth() returning 64 would fix this issue as well, we cannot merge these two notions (minimum vector register and minimum semi vector), since getMinVectorRegisterBitWidth() is used in other places.

Maybe the confusion was caused by naming HADDPS as horizontal 128-bit addings (here http://llvm.org/PR32433), and it should be called as horizontal 64-bit vector pair sum (<2 x float> + <2 x float> -> <2 x float>).

General question: why 256bit sie of the vector is set is minimal vector register size if the architecture supports 128 bit vectors? TTI should report 128 as the min regsize, not 256. And we don't need all these new options, functions etc.

Hi @ABataev , x86 TTI has already been reporting 128 bit vector (TTI->getMinVectorRegisterBitWidth() returns 128), but we actually need 64-bit vectors (high and low parts of 128-bit registers) to be tried by SLPVectorizer to support horizontal 128-bit adds and subs. Though making TTI->getMinVectorRegisterBitWidth() returning 64 would fix this issue as well, we cannot merge these two notions (minimum vector register and minimum semi vector), since getMinVectorRegisterBitWidth() is used in other places.

Maybe the confusion was caused by naming HADDPS as horizontal 128-bit addings (here http://llvm.org/PR32433), and it should be called as horizontal 64-bit vector pair sum (<2 x float> + <2 x float> -> <2 x float>).

Thenit is better to introduce another function in TTI - something like getMinVectOpWidth() and use it in SLP vectorizer at least rather than iadding sometthing like semi vector size.

Thenit is better to introduce another function in TTI - something like getMinVectOpWidth() and use it in SLP vectorizer at least rather than iadding sometthing like semi vector size.

Thanks for your note! This requires TTI API changing as well but may be more flexible (one can change getMinVecOpWidth() to 1/4 of MinVectorRegisterBitWidth, for instance). Though it leads to less clear root cause of where this semi vectors come from (from special horizontal operations).

I'm to change this patch appropriately. I'm also to check what are the consequences of merging MinVecOpWidth and MinVectorRegisterBitWidth.

Instead of being so explicit, why can't we just partially use a known legal vector width, maybe limited to subvectors that fit into the legal type (float2/int2/char4 etc.) - leaving the upper elements as undef/duplicates of the partial subvector? The cost model will likely be using the legal widths anyhow..

@craig.topper may have some thoughts on this patch's effects on D55251 - vector widening legalization

Instead of being so explicit, why can't we just partially use a known legal vector width, maybe limited to subvectors that fit into the legal type (float2/int2/char4 etc.) - leaving the upper elements as undef/duplicates of the partial subvector? The cost model will likely be using the legal widths anyhow..

@craig.topper may have some thoughts on this patch's effects on D55251 - vector widening legalization

Hi Simon, yes, if I understand you correctly, that is exactly the conclusion I came to. I've investigated getMinVectorRegisterBitWidth() uses, the only its user is SLPVectorizer itself and it makes sense just to change MinVectorRegisterBitWidth to 64 for x86 target. I've looked for cases when cost model cannot process this but it woks well. Several new cases open with this change -- for instrs like PSUB[B|W|D] working with MMX 64-bit registers. Also cost model allows partial subvector operatioins.

So I propose just changing MinVectorRegisterBitWidth to 64 for x86 (like for aarch64). All the subtle tuning should be done by cost model.
I'm to edit revision accordingly.

anton-afanasyev edited the summary of this revision. (Show Details)

I'm concerned about integer types. Without -x86-experimental-vector-widening-legalization we end up promoting v2i32 to v2i64 during type legalization. An X86 specific DAG combine turns some v2i64 operations back to v4i32 based on the result being truncated, but it isn't always able to rearrange the shuffles well.

Changing semi-vec-reg-128bit.ll to use i32 instead of float results in this code instead of phaddd. Even with -mcpu=btver2 which is needed to generate haddps for the float type for this test.

	vpshufd	$245, %xmm0, %xmm1      # xmm1 = xmm0[1,1,3,3]
	vpaddd	%xmm1, %xmm0, %xmm0
	vpshufd	$232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
	vmovq	%xmm0, (%rdi)
RKSimon added inline comments.Dec 27 2018, 11:07 AM
lib/Target/X86/X86TargetTransformInfo.cpp
150

Can we add a IsFloat bool argument here?

I'm concerned about integer types. Without -x86-experimental-vector-widening-legalization we end up promoting v2i32 to v2i64 during type legalization. An X86 specific DAG combine turns some v2i64 operations back to v4i32 based on the result being truncated, but it isn't always able to rearrange the shuffles well.

Changing semi-vec-reg-128bit.ll to use i32 instead of float results in this code instead of phaddd. Even with -mcpu=btver2 which is needed to generate haddps for the float type for this test.

	vpshufd	$245, %xmm0, %xmm1      # xmm1 = xmm0[1,1,3,3]
	vpaddd	%xmm1, %xmm0, %xmm0
	vpshufd	$232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
	vmovq	%xmm0, (%rdi)

That is not related to this patch, since it is doing the same thing for either float or i32:

> ~/llvm/build_rel_exp/bin/opt -S -mcpu=btver2 -slp-vectorizer -instcombine semi-vec-reg-128bit-i32.ll
...
define void @add_pairs_128(<4 x i32>, i32* nocapture) #0 {
  %3 = shufflevector <4 x i32> %0, <4 x i32> undef, <2 x i32> <i32 0, i32 2>
  %4 = shufflevector <4 x i32> %0, <4 x i32> undef, <2 x i32> <i32 1, i32 3>
  %5 = add <2 x i32> %3, %4
  %6 = bitcast i32* %1 to <2 x i32>*
  store <2 x i32> %5, <2 x i32>* %6, align 4
  ret void
}

attributes #0 = { nounwind "target-cpu"="btver2" }

An issue here is with x86 ISel, I believe it should be fixed there (does -x86-experimental-vector-widening-legalization fix it?). Another candidate could be InstCombiner to make specific combination of exctracts and inserts gotten from vectorizer.

anton-afanasyev marked an inline comment as done.Dec 27 2018, 1:46 PM
anton-afanasyev added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
150

That is not possible: at this stage we can only operate by vector register width notion regardless of the scalars type. And what is the cause we can need it?

Added comments and changed semi-vec-reg-128bit.ll test (renamed to vec-reg-64bit.ll, added comments).

lebedev.ri added inline comments.
test/Transforms/SLPVectorizer/X86/vec-reg-64bit.ll
13–38

There are no run lines for these prefixes.

lebedev.ri added inline comments.Dec 28 2018, 2:39 AM
lib/Target/X86/X86TargetTransformInfo.cpp
150–154

I lack context, but will this also handle e.g. _mm_hadd_epi16() ?

anton-afanasyev marked 2 inline comments as done.Dec 28 2018, 3:18 AM
anton-afanasyev added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
150–154

Yes, phaddw is also handled.

test/Transforms/SLPVectorizer/X86/vec-reg-64bit.ll
13–38

Oops, thanks, forget to remove after prefix renaming.

Removed unchecked test prefix.

Ping.
Changed patch according to notes.

Ping!

To clarify: I have changed the patch after review notes so it just modifies the value returned by the function X86TTIImpl::getMinVectorRegisterBitWidth() (was 128, now it is 64).

RKSimon added inline comments.Jan 16 2019, 3:05 AM
lib/Target/X86/X86TargetTransformInfo.cpp
150

Don't refer to MMX - its gives the wrong impression that we actually generate code for it (all we ever do is emit intrinsic code).

anton-afanasyev marked 2 inline comments as done.Jan 16 2019, 3:12 AM
anton-afanasyev added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
150

Ok, I'm to remove this from comment.

lebedev.ri added inline comments.Jan 16 2019, 3:22 AM
lib/Target/X86/X86TargetTransformInfo.cpp
150–154

Then i don't understand the function name. (yes, i understand that it is a pre-existing hook)
_mm_hadd_epi16() operates on i16, how come 64 is the right pick here?

anton-afanasyev marked 3 inline comments as done.Jan 16 2019, 3:34 AM
anton-afanasyev added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
150–154

64 is the size of whole vector with 16-bit elements packed in it. Here is a piece of manual:

PHADDW (with 64-bit operands):
mm1[15-0] = mm1[31-16] + mm1[15-0];
mm1[31-16] = mm1[63-48] + mm1[47-32];
mm1[47-32] = mm2/m64[31-16] + mm2/m64[15-0];
mm1[63-48] = mm2/m64[63-48] + mm2/m64[47-32];
ABataev added inline comments.Jan 16 2019, 6:50 AM
lib/Target/X86/X86TargetTransformInfo.cpp
150

Still refers to MMX

anton-afanasyev marked an inline comment as done.

Update comment, remove MMX reference.

anton-afanasyev marked 2 inline comments as done.Jan 16 2019, 7:38 AM
anton-afanasyev added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
150

Do you mean reference in comment? Removed.

ABataev added inline comments.Jan 16 2019, 7:46 AM
lib/Target/X86/X86TargetTransformInfo.cpp
150

Yes, it was about the comment. Still, the comment does not match the function. It says about 128-bit operations but returns 64 as a result.

anton-afanasyev marked 3 inline comments as done.
anton-afanasyev added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
150

"horizontal 128-bit operations" means "operations with high and low 64-bit parts". Updating comment to be more clear.

There are a lot of test changes here that have nothing to do with the add/sub operations mentioned in the title. How do we know these changes are good? Was any benchmarking done with this patch?

There are a lot of test changes here that have nothing to do with the add/sub operations mentioned in the title. How do we know these changes are good? Was any benchmarking done with this patch?

Yes, I have done a lot of benchmarking using test suite while issue investigation.
I cannot attach results right now (working on other issue), can do it later.

The most suspicios was size..text metrics of dijkstra test (https://gcc.godbolt.org/z/3E1AmW), but I have investigated that is not bug, but rightful work of Loop Unrolling Pass.

Is size..text and exec_time enough? Can you advise any other benchmarking?

There are a lot of test changes here that have nothing to do with the add/sub operations mentioned in the title. How do we know these changes are good? Was any benchmarking done with this patch?

As for the test changes not related to add/sub operations (horizontal ops), they are mostly caused by partial vectorization (like fmul <2 x float>). Firstly I tried to escape this by adding additional cost for all 64-bit ops not concerned with horizontal ones, but it looks dirty and unnecessary since partial vectorization are fair optimization as well (if Cost < 0).

Here are size..text results for test-suite compiled with Os:

> ~/llvm/test-suite/utils/compare.py  --filter-short -m size..text results_rel_base_Os.json results_rel_exp_Os.json
Tests: 1160
Short Running: 243 (filtered out)
Remaining: 917
Metric: size..text

Program                                         results_rel_base_Os  results_rel_exp_Os diff  
                                                                                             
SingleSour...Regression-C++-pointer_method2   610.00               482.00              -21.0%
MultiSourc...rks/Prolangs-C++/shapes/shapes   2674.00              2802.00              4.8% 
MultiSource/Benchmarks/Bullet/bullet          586082.00            569010.00           -2.9% 
MultiSourc.../Benchmarks/McCat/15-trie/trie   1218.00              1250.00              2.6% 
MultiSourc...rsaBench/beamformer/beamformer   3826.00              3730.00             -2.5% 
SingleSour...e/UnitTests/2003-05-07-VarArgs   1346.00              1378.00              2.4% 
SingleSour.../Vector/SSE/Vector-sse.stepfft   2114.00              2066.00             -2.3% 
SingleSource/UnitTests/initp1                 1570.00              1538.00             -2.0% 
SingleSour...arks/Shootout/Shootout-objinst   802.00               818.00               2.0% 
SingleSour...e/UnitTests/C++11/stdthreadbug   802.00               786.00              -2.0% 
SingleSour...rks/Shootout/Shootout-methcall   818.00               834.00               2.0% 
SingleSour...UnitTests/2003-08-11-VaListArg   1682.00              1714.00              1.9% 
MultiSourc...s/Prolangs-C++/objects/objects   1762.00              1794.00              1.8% 
SingleSour...tout-C++/Shootout-C++-methcall   898.00               914.00               1.8% 
SingleSource/Benchmarks/Dhrystone/dry         1010.00              1026.00              1.6%
       results_rel_base_Os  results_rel_exp_Os        diff
count  9.170000e+02         9.170000e+02        917.000000
mean   3.171576e+04         3.172267e+04        0.000862  
std    2.406645e+05         2.405892e+05        0.007775  
min    3.700000e+02         3.700000e+02       -0.209836  
25%    1.090000e+03         1.090000e+03        0.000000  
50%    2.697800e+04         2.704200e+04        0.000000  
75%    2.713800e+04         2.720200e+04        0.002360  
max    7.149522e+06         7.148562e+06        0.047868

and exec_time results for test-suite compiled with O3:

> ~/llvm/test-suite/utils/compare.py  --filter-short -m exec_time results_rel_base.json results_rel_exp.json
Tests: 1160
Short Running: 762 (filtered out)
Remaining: 398
Metric: exec_time

Program                                         results_rel_base  results_rel_exp diff  
                                                                                         
MicroBench...XRayFDRMultiThreaded/threads:2   524.56             411.88            -21.5%
MicroBench...RayFDRMultiThreaded/threads:16   1057.94            875.81            -17.2%
MicroBench...t:BM_PRESSURE_CALC_LAMBDA/5001    13.03              15.23            16.9% 
MicroBench...XRayFDRMultiThreaded/threads:8   1087.77            912.91            -16.1%
MultiSourc...nch/beamformer/beamformer.test     0.98               0.86            -11.9%
MicroBench...mbda.test:BM_PIC_2D_LAMBDA/171     1.95               2.14             9.7% 
MultiSource/Applications/siod/siod.test         1.80               1.64            -8.7% 
MicroBench...XRayFDRMultiThreaded/threads:4   669.22             612.08            -8.5% 
MicroBench...da.test:BM_PIC_1D_LAMBDA/44217   681.44             731.22             7.3% 
SingleSour...ut-C++/Shootout-C++-hash2.test     2.41               2.58             7.0% 
SingleSour...bra/kernels/gemver/gemver.test     0.84               0.78            -6.7% 
MultiSourc...mbolics-dbl/Symbolics-dbl.test     2.99               2.80            -6.6% 
MicroBench...ambda.test:BM_EOS_LAMBDA/44217    95.19              89.23            -6.3% 
MicroBench...da.test:BM_HYDRO_2D_LAMBDA/171    16.66              17.69             6.2% 
MicroBench...RayFDRMultiThreaded/threads:32   1022.42            1084.72            6.1%
       results_rel_base  results_rel_exp        diff
count  397.000000         398.000000        397.000000
mean   2536.523923        2523.193837      -0.001760  
std    17779.494744       17680.005983      0.025555  
min    0.600000           0.600000         -0.214805  
25%    2.064000           2.065000         -0.003161    
50%    6.472000           6.296000          0.000000    
75%    148.025000         148.279000        0.001152    
max    227023.000000      226291.000000     0.169066

Rebased, updated new tests. Does it have a chance for lgtm?

This revision is now accepted and ready to land.Feb 12 2019, 3:23 AM