Try to use 64-bit SLP vectorization. In addition to horizontal instrs this change triggers optimizations for partial vector operations (for instance, using low halfs of 128-bit registers xmm0 and xmm1 to multiply <2 x float> by <2 x float>). Fixes llvm.org/PR32433
Details
Diff Detail
Event Timeline
General question: why 256bit sie of the vector is set is minimal vector register size if the architecture supports 128 bit vectors? TTI should report 128 as the min regsize, not 256. And we don't need all these new options, functions etc.
Hi @ABataev , x86 TTI has already been reporting 128 bit vector (TTI->getMinVectorRegisterBitWidth() returns 128), but we actually need 64-bit vectors (high and low parts of 128-bit registers) to be tried by SLPVectorizer to support horizontal 128-bit adds and subs. Though making TTI->getMinVectorRegisterBitWidth() returning 64 would fix this issue as well, we cannot merge these two notions (minimum vector register and minimum semi vector), since getMinVectorRegisterBitWidth() is used in other places.
Maybe the confusion was caused by naming HADDPS as horizontal 128-bit addings (here http://llvm.org/PR32433), and it should be called as horizontal 64-bit vector pair sum (<2 x float> + <2 x float> -> <2 x float>).
Thenit is better to introduce another function in TTI - something like getMinVectOpWidth() and use it in SLP vectorizer at least rather than iadding sometthing like semi vector size.
Thanks for your note! This requires TTI API changing as well but may be more flexible (one can change getMinVecOpWidth() to 1/4 of MinVectorRegisterBitWidth, for instance). Though it leads to less clear root cause of where this semi vectors come from (from special horizontal operations).
I'm to change this patch appropriately. I'm also to check what are the consequences of merging MinVecOpWidth and MinVectorRegisterBitWidth.
Instead of being so explicit, why can't we just partially use a known legal vector width, maybe limited to subvectors that fit into the legal type (float2/int2/char4 etc.) - leaving the upper elements as undef/duplicates of the partial subvector? The cost model will likely be using the legal widths anyhow..
@craig.topper may have some thoughts on this patch's effects on D55251 - vector widening legalization
Hi Simon, yes, if I understand you correctly, that is exactly the conclusion I came to. I've investigated getMinVectorRegisterBitWidth() uses, the only its user is SLPVectorizer itself and it makes sense just to change MinVectorRegisterBitWidth to 64 for x86 target. I've looked for cases when cost model cannot process this but it woks well. Several new cases open with this change -- for instrs like PSUB[B|W|D] working with MMX 64-bit registers. Also cost model allows partial subvector operatioins.
So I propose just changing MinVectorRegisterBitWidth to 64 for x86 (like for aarch64). All the subtle tuning should be done by cost model.
I'm to edit revision accordingly.
I'm concerned about integer types. Without -x86-experimental-vector-widening-legalization we end up promoting v2i32 to v2i64 during type legalization. An X86 specific DAG combine turns some v2i64 operations back to v4i32 based on the result being truncated, but it isn't always able to rearrange the shuffles well.
Changing semi-vec-reg-128bit.ll to use i32 instead of float results in this code instead of phaddd. Even with -mcpu=btver2 which is needed to generate haddps for the float type for this test.
vpshufd $245, %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3] vpaddd %xmm1, %xmm0, %xmm0 vpshufd $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3] vmovq %xmm0, (%rdi)
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | Can we add a IsFloat bool argument here? |
That is not related to this patch, since it is doing the same thing for either float or i32:
> ~/llvm/build_rel_exp/bin/opt -S -mcpu=btver2 -slp-vectorizer -instcombine semi-vec-reg-128bit-i32.ll ... define void @add_pairs_128(<4 x i32>, i32* nocapture) #0 { %3 = shufflevector <4 x i32> %0, <4 x i32> undef, <2 x i32> <i32 0, i32 2> %4 = shufflevector <4 x i32> %0, <4 x i32> undef, <2 x i32> <i32 1, i32 3> %5 = add <2 x i32> %3, %4 %6 = bitcast i32* %1 to <2 x i32>* store <2 x i32> %5, <2 x i32>* %6, align 4 ret void } attributes #0 = { nounwind "target-cpu"="btver2" }
An issue here is with x86 ISel, I believe it should be fixed there (does -x86-experimental-vector-widening-legalization fix it?). Another candidate could be InstCombiner to make specific combination of exctracts and inserts gotten from vectorizer.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | That is not possible: at this stage we can only operate by vector register width notion regardless of the scalars type. And what is the cause we can need it? |
Added comments and changed semi-vec-reg-128bit.ll test (renamed to vec-reg-64bit.ll, added comments).
test/Transforms/SLPVectorizer/X86/vec-reg-64bit.ll | ||
---|---|---|
13–38 | There are no run lines for these prefixes. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150–154 | I lack context, but will this also handle e.g. _mm_hadd_epi16() ? |
Ping!
To clarify: I have changed the patch after review notes so it just modifies the value returned by the function X86TTIImpl::getMinVectorRegisterBitWidth() (was 128, now it is 64).
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | Don't refer to MMX - its gives the wrong impression that we actually generate code for it (all we ever do is emit intrinsic code). |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | Ok, I'm to remove this from comment. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150–154 | Then i don't understand the function name. (yes, i understand that it is a pre-existing hook) |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150–154 | 64 is the size of whole vector with 16-bit elements packed in it. Here is a piece of manual: PHADDW (with 64-bit operands): mm1[15-0] = mm1[31-16] + mm1[15-0]; mm1[31-16] = mm1[63-48] + mm1[47-32]; mm1[47-32] = mm2/m64[31-16] + mm2/m64[15-0]; mm1[63-48] = mm2/m64[63-48] + mm2/m64[47-32]; |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | Still refers to MMX |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | Do you mean reference in comment? Removed. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | Yes, it was about the comment. Still, the comment does not match the function. It says about 128-bit operations but returns 64 as a result. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
150 | "horizontal 128-bit operations" means "operations with high and low 64-bit parts". Updating comment to be more clear. |
There are a lot of test changes here that have nothing to do with the add/sub operations mentioned in the title. How do we know these changes are good? Was any benchmarking done with this patch?
Yes, I have done a lot of benchmarking using test suite while issue investigation.
I cannot attach results right now (working on other issue), can do it later.
The most suspicios was size..text metrics of dijkstra test (https://gcc.godbolt.org/z/3E1AmW), but I have investigated that is not bug, but rightful work of Loop Unrolling Pass.
Is size..text and exec_time enough? Can you advise any other benchmarking?
As for the test changes not related to add/sub operations (horizontal ops), they are mostly caused by partial vectorization (like fmul <2 x float>). Firstly I tried to escape this by adding additional cost for all 64-bit ops not concerned with horizontal ones, but it looks dirty and unnecessary since partial vectorization are fair optimization as well (if Cost < 0).
Here are size..text results for test-suite compiled with Os:
> ~/llvm/test-suite/utils/compare.py --filter-short -m size..text results_rel_base_Os.json results_rel_exp_Os.json Tests: 1160 Short Running: 243 (filtered out) Remaining: 917 Metric: size..text Program results_rel_base_Os results_rel_exp_Os diff SingleSour...Regression-C++-pointer_method2 610.00 482.00 -21.0% MultiSourc...rks/Prolangs-C++/shapes/shapes 2674.00 2802.00 4.8% MultiSource/Benchmarks/Bullet/bullet 586082.00 569010.00 -2.9% MultiSourc.../Benchmarks/McCat/15-trie/trie 1218.00 1250.00 2.6% MultiSourc...rsaBench/beamformer/beamformer 3826.00 3730.00 -2.5% SingleSour...e/UnitTests/2003-05-07-VarArgs 1346.00 1378.00 2.4% SingleSour.../Vector/SSE/Vector-sse.stepfft 2114.00 2066.00 -2.3% SingleSource/UnitTests/initp1 1570.00 1538.00 -2.0% SingleSour...arks/Shootout/Shootout-objinst 802.00 818.00 2.0% SingleSour...e/UnitTests/C++11/stdthreadbug 802.00 786.00 -2.0% SingleSour...rks/Shootout/Shootout-methcall 818.00 834.00 2.0% SingleSour...UnitTests/2003-08-11-VaListArg 1682.00 1714.00 1.9% MultiSourc...s/Prolangs-C++/objects/objects 1762.00 1794.00 1.8% SingleSour...tout-C++/Shootout-C++-methcall 898.00 914.00 1.8% SingleSource/Benchmarks/Dhrystone/dry 1010.00 1026.00 1.6% results_rel_base_Os results_rel_exp_Os diff count 9.170000e+02 9.170000e+02 917.000000 mean 3.171576e+04 3.172267e+04 0.000862 std 2.406645e+05 2.405892e+05 0.007775 min 3.700000e+02 3.700000e+02 -0.209836 25% 1.090000e+03 1.090000e+03 0.000000 50% 2.697800e+04 2.704200e+04 0.000000 75% 2.713800e+04 2.720200e+04 0.002360 max 7.149522e+06 7.148562e+06 0.047868
and exec_time results for test-suite compiled with O3:
> ~/llvm/test-suite/utils/compare.py --filter-short -m exec_time results_rel_base.json results_rel_exp.json Tests: 1160 Short Running: 762 (filtered out) Remaining: 398 Metric: exec_time Program results_rel_base results_rel_exp diff MicroBench...XRayFDRMultiThreaded/threads:2 524.56 411.88 -21.5% MicroBench...RayFDRMultiThreaded/threads:16 1057.94 875.81 -17.2% MicroBench...t:BM_PRESSURE_CALC_LAMBDA/5001 13.03 15.23 16.9% MicroBench...XRayFDRMultiThreaded/threads:8 1087.77 912.91 -16.1% MultiSourc...nch/beamformer/beamformer.test 0.98 0.86 -11.9% MicroBench...mbda.test:BM_PIC_2D_LAMBDA/171 1.95 2.14 9.7% MultiSource/Applications/siod/siod.test 1.80 1.64 -8.7% MicroBench...XRayFDRMultiThreaded/threads:4 669.22 612.08 -8.5% MicroBench...da.test:BM_PIC_1D_LAMBDA/44217 681.44 731.22 7.3% SingleSour...ut-C++/Shootout-C++-hash2.test 2.41 2.58 7.0% SingleSour...bra/kernels/gemver/gemver.test 0.84 0.78 -6.7% MultiSourc...mbolics-dbl/Symbolics-dbl.test 2.99 2.80 -6.6% MicroBench...ambda.test:BM_EOS_LAMBDA/44217 95.19 89.23 -6.3% MicroBench...da.test:BM_HYDRO_2D_LAMBDA/171 16.66 17.69 6.2% MicroBench...RayFDRMultiThreaded/threads:32 1022.42 1084.72 6.1% results_rel_base results_rel_exp diff count 397.000000 398.000000 397.000000 mean 2536.523923 2523.193837 -0.001760 std 17779.494744 17680.005983 0.025555 min 0.600000 0.600000 -0.214805 25% 2.064000 2.065000 -0.003161 50% 6.472000 6.296000 0.000000 75% 148.025000 148.279000 0.001152 max 227023.000000 226291.000000 0.169066
Can we add a IsFloat bool argument here?