Page MenuHomePhabricator

[SLP] Initial support for the vectorization of the non-power-of-2 vectors.
Needs ReviewPublic

Authored by ABataev on Jan 22 2019, 8:30 AM.

Details

Summary

Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.

Diff Detail

Unit TestsFailed

TimeTest
30 msClang.SemaOpenCL::numbered-address-space.cl
Script: -- : 'RUN: at line 2'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/clang -cc1 -internal-isystem /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/clang/10.0.0/include -nostdsysteminc -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -verify -pedantic -fsyntax-only /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/clang/test/SemaOpenCL/numbered-address-space.cl

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Some initial comments - there's a lot going on (cutting it into smaller patches would be very useful).

The sdiv undefs test worries me, and I'm not certain if we have good masked load/store costs at the moment.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1500 ↗(On Diff #192640)

Check for failure instead of just dereferencing?

2008 ↗(On Diff #192640)

drop the * ?

2138 ↗(On Diff #192640)

drop the * ?

2298 ↗(On Diff #192640)

Is it safe to use VL[0] like this - do we guarantee the undef types are correct?

2987 ↗(On Diff #192640)

This is going to cause problems for an incoming patch I'm working on - any chance that you can avoid using Instruction &I?

test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16 ↗(On Diff #192640)

regression?

test/Transforms/SLPVectorizer/X86/addsub.ll
319 ↗(On Diff #192640)

regression?

test/Transforms/SLPVectorizer/X86/alternate-fp.ll
87 ↗(On Diff #192640)

regression?

test/Transforms/SLPVectorizer/X86/alternate-int.ll
590 ↗(On Diff #192640)

This test is checking that the sdiv didn't get vectorized, because if any elt is undef then the whole result is undef - but if its stays scalarized then on elts 0 and 4 become undef.

Passing-by question: will this *only* consider non-power-of-two vectors, or all vectors that are smaller than the vector length?
I.e. is this something that will help with e.g. https://godbolt.org/z/h64TuT ? (from IRC)

Passing-by question: will this *only* consider non-power-of-two vectors, or all vectors that are smaller than the vector length?
I.e. is this something that will help with e.g. https://godbolt.org/z/h64TuT ? (from IRC)

It should handle all vector lengths, 6 ops must be vectorized as 8 elems vector with 2 undefs

@ABataev Are you still looking at this or should we move to D66416? Now that X86 has enabled integer widening, we could more easily make use of this.

ABataev updated this revision to Diff 215884.Aug 19 2019, 6:56 AM

Updated tests

@ABataev Are you still looking at this or should we move to D66416? Now that X86 has enabled integer widening, we could more easily make use of this.

Sorry, just saw your reply on D66416!

@ABataev Are you still looking at this or should we move to D66416? Now that X86 has enabled integer widening, we could more easily make use of this.

I just don't have enough time for this. I referenced this patch in D66416. This patch is not X86 specific (if integer widening is supported only for X86) + it works with the existing cost model more correctly, I believe, since our cost model is not adapted to non-power-2 vectors and must be tweaked carefully. I suggested that someone else could take this patch and split it into several smaller patches, etc. What do you think?

RKSimon added inline comments.Sep 5 2019, 2:32 PM
lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
1174 ↗(On Diff #218923)

Should this be a separate patch?

ABataev marked an inline comment as done.Sep 5 2019, 2:38 PM
ABataev added inline comments.
lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
1174 ↗(On Diff #218923)

Oops, added it accidentally.

spatel added inline comments.Sep 6 2019, 11:08 AM
test/Transforms/SLPVectorizer/X86/sext.ll
602–619 ↗(On Diff #218923)

For reference, this is the test noted in PR42997:
https://bugs.llvm.org/show_bug.cgi?id=42997
...and this patch will produce the expected AVX codegen:

vpmovsxwq	(%rdi), %xmm0
vpmovsxwq	4(%rdi), %xmm1
vinsertf128	$1, %xmm1, %ymm0, %ymm0

As per https://bugs.llvm.org/show_bug.cgi?id=43844#c5, rebase this, please?
(also, maybe move to monorepo..)

rebase?

I have rebased version locally, will update it here on Monday when I get back to work.

ABataev updated this revision to Diff 227730.Nov 4 2019, 9:40 AM

Just rebase.

ABataev updated this revision to Diff 231261.Nov 27 2019, 7:35 AM

Just rebase

Build result: pass - 60338 tests passed, 0 failed and 732 were skipped.

Log files: console-log.txt, CMakeCache.txt

ABataev marked 3 inline comments as done.Nov 27 2019, 12:01 PM
ABataev added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2298 ↗(On Diff #192640)

Yes, the types must be the same.

test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16 ↗(On Diff #192640)

Cost model problem.

test/Transforms/SLPVectorizer/X86/alternate-int.ll
590 ↗(On Diff #192640)

Does it mean that we cannot replace the operands of sdiv with undefs and must replace them with, say, 1 instead?

ABataev updated this revision to Diff 231317.Nov 27 2019, 1:24 PM

Rebase + address comments.

Build result: fail - 60362 tests passed, 2 failed and 732 were skipped.

failed: LLVM.CodeGen/PowerPC/pr36292.ll
failed: LLVM.CodeGen/PowerPC/sms-cpy-1.ll

Log files: console-log.txt, CMakeCache.txt

vdmitrie added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3299–3302

What is reasoning for this min?
Imagine VL[0] and VL[1] are extracts of two subsequent elements from the same vector of size 2
and VL[2], VL[3] are extracts from another vector (which can even be of different size). NElts will be assigned 2 based on VL[0] while VL size is 4. The for loop at line 3300 will not visit 3th and 4th elements of the VL and final answer turns out "true" which is obviously incorrect as we must gather these extracts.

ABataev marked an inline comment as done.Dec 6 2019, 9:38 AM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3299–3302

Good catch, thanks! It is required to handle the case where 2 other elements are actually UndefvVslues. Just need to add a check for this here.

ABataev updated this revision to Diff 232616.Dec 6 2019, 11:16 AM

Fixed bug + rebase.

Build result: pass - 60563 tests passed, 0 failed and 726 were skipped.

Log files: console-log.txt, CMakeCache.txt

Build result: pass - 60628 tests passed, 0 failed and 726 were skipped.

Log files: console-log.txt, CMakeCache.txt

There is still problem with extracts.
This case shows it:
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%struct.S = type { [3 x float], [3 x float], [4 x float] }

define i32 @foo(i32 %0, i32* %1, float* %2) {

%t4 = alloca %struct.S, align 8
%t8 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 1
%t9 = getelementptr inbounds [3 x float], [3 x float]* %t8, i64 1, i64 0
%t14 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 1, i64 0
%t11 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 1, i64 1
%t15 = load float, float* %t14, align 4
%t16 = load float, float* %t11, align 4
br label %t37

t37:

%t18 = phi float [ %t16, %3 ], [ %x24, %t37 ]
%t19 = phi float [ %t15, %3 ], [ %x23, %t37 ]
%t20 = fdiv fast float 1.000000e+00, %t19
%t24 = fdiv fast float 1.000000e+00, %t18
%t21 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 0
%t25 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 1
%t31 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 2
%t33 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 3
store float %t20, float* %t21, align 4
store float %t24, float* %t25, align 4
store float %t24, float* %t31, align 4
store float %t24, float* %t33, align 4
%t88 = bitcast float* %t9 to <2 x float>*
%t89 = load <2 x float>, <2 x float>* %t88, align 4
%x23 = extractelement <2 x float> %t89, i32 0
%x24 = extractelement <2 x float> %t89, i32 1
br i1 undef, label %t37, label %t55

t55:

ret i32 0

}

There is still problem with extracts.
This case shows it:
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%struct.S = type { [3 x float], [3 x float], [4 x float] }

define i32 @foo(i32 %0, i32* %1, float* %2) {

%t4 = alloca %struct.S, align 8
%t8 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 1
%t9 = getelementptr inbounds [3 x float], [3 x float]* %t8, i64 1, i64 0
%t14 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 1, i64 0
%t11 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 1, i64 1
%t15 = load float, float* %t14, align 4
%t16 = load float, float* %t11, align 4
br label %t37

t37:

%t18 = phi float [ %t16, %3 ], [ %x24, %t37 ]
%t19 = phi float [ %t15, %3 ], [ %x23, %t37 ]
%t20 = fdiv fast float 1.000000e+00, %t19
%t24 = fdiv fast float 1.000000e+00, %t18
%t21 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 0
%t25 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 1
%t31 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 2
%t33 = getelementptr inbounds %struct.S, %struct.S* %t4, i64 0, i32 2, i64 3
store float %t20, float* %t21, align 4
store float %t24, float* %t25, align 4
store float %t24, float* %t31, align 4
store float %t24, float* %t33, align 4
%t88 = bitcast float* %t9 to <2 x float>*
%t89 = load <2 x float>, <2 x float>* %t88, align 4
%x23 = extractelement <2 x float> %t89, i32 0
%x24 = extractelement <2 x float> %t89, i32 1
br i1 undef, label %t37, label %t55

t55:

ret i32 0

}

Thanks for the reproducer. Does this one produce an incorrect result, crashes compiler or something else? What exactly?

Thanks for the reproducer. Does this one produce an incorrect result, crashes compiler or something else? What exactly?

Compiler dies with assertion on mismatching types while widening PHI.

Thanks for the reproducer. Does this one produce an incorrect result, crashes compiler or something else? What exactly?

Compiler dies with assertion on mismatching types while widening PHI.

Ok, thanks for the explanation, will investigate it.

ABataev updated this revision to Diff 234753.Dec 19 2019, 10:53 AM

Fixed the bug + rebase.

Unit tests: pass. 61035 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

vdmitrie added inline comments.Thu, Dec 26, 12:01 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3046

"Non-consecutive" here is not the actual reason.

3059–3060

If we have for example this sequence:
store addr[2]
store addr[0]
store addr[1]
undef
then we bypass sorting pointers and end up vectorizing this store sequence with incorrect order.

vdmitrie added inline comments.Thu, Dec 26, 3:39 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5981–5982

PowerOf2Ceil(VF) < MinVF

ABataev updated this revision to Diff 235421.Fri, Dec 27, 8:11 AM

Rebase + address comments.

Unit tests: fail. 61127 tests passed, 1 failed and 728 were skipped.

failed: Clang.SemaOpenCL/numbered-address-space.cl

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

vdmitrie added inline comments.Mon, Jan 13, 3:58 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2761

This check is not quite complete.
If we for example have following scalars set (VL)
0: load i32 from p[0]
1: load i32 from p[2]
3: undef i32
4: undef i32
(note that p[1] is not loaded)

Pointers difference is 8, number of instructions is 2 and VL size is 4:
thus 8 <= (4 -1)*4 is true but pointers actually not loaded consecutively (although It is vectorizeable via masked load+shuffle but support seems not implemented yet). Similar issue exists for store.

ABataev marked an inline comment as done.Wed, Jan 15, 4:10 PM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2761

Hmm, see lines 4574-4600 (masked load + shuffle) and 4643-4678 (shuffle + masked store)

vdmitrie added inline comments.Wed, Jan 15, 5:17 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2761

Note that two is a power of two. Thus at 4569 it takes path that creates plain load and ends up with loading p[0] + p[1].
And even if we would go masked load+shuffle path that not correct either. Mask and shuffle there being built based on undefs rather than pointer analysis of scalar loads. In order to end up with loading p[0] and p2[] VL should look like:
0: load p[0]
1: undef
2: load p[2]
3: undef

ABataev marked an inline comment as done.Wed, Jan 15, 5:23 PM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2761

Ah, yes. Will check this carefully.

vdmitrie added inline comments.Fri, Jan 17, 1:50 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6203

at 6187 already checked for VL size.

vdmitrie added inline comments.Fri, Jan 17, 4:02 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6277
  1. if -else bodies are exactly the same.
  2. With OpsWidth !=VF there is still possibility to bypass it depending on UserCost and AllowReorder values. It should be either assertion to ensure it never happens or "break".