Page MenuHomePhabricator

[SLP] Initial support for the vectorization of the non-power-of-2 vectors.
Needs ReviewPublic

Authored by ABataev on Jan 22 2019, 8:30 AM.

Details

Summary

Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev updated this revision to Diff 247269.Feb 28 2020, 8:04 AM

Rebase + fixes

RKSimon added inline comments.Feb 28 2020, 10:12 AM
llvm/test/Transforms/SLPVectorizer/X86/cse.ll
153

The test2 changes look superfluous (maybe precommit them?).

ABataev marked an inline comment as done.Feb 28 2020, 10:16 AM
ABataev added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/cse.ll
153

Yes, will do this later

RKSimon added inline comments.Feb 28 2020, 10:19 AM
llvm/test/Transforms/SLPVectorizer/X86/zext.ll
263 ↗(On Diff #247269)

Its really odd that SLM fails to use zext <8 x i8> to <8 x i16> like SSE2, I think the custom SLM extract/insert costs are affecting something unexpected?

ABataev marked an inline comment as done.Feb 28 2020, 11:00 AM
ABataev added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/zext.ll
263 ↗(On Diff #247269)

Yes, the cost of ExtractElement instructions affects vectorization, must be fixed in another patch for the vectorization of the building of aggregates.

dtemirbulatov added inline comments.Jul 21 2020, 2:52 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4864

hmm, it might be unsafe to try to obtain type here since any element of VL could be Undef?

dtemirbulatov added inline comments.Jul 21 2020, 3:02 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2713

what do you think about defining InstructionsOnly in InstructionsState?

dtemirbulatov added inline comments.Jul 21 2020, 3:27 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4864

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@e = dso_local local_unnamed_addr global i32 0, align 4
@f = dso_local local_unnamed_addr global i32 0, align 4

; Function Attrs: nofree norecurse nounwind uwtable
define dso_local i32 @g() local_unnamed_addr #0 {
entry:

%0 = load i32, i32* @e, align 4
%tobool.not19 = icmp eq i32 %0, 0
br i1 %tobool.not19, label %while.end, label %while.body

while.body: ; preds = %entry, %while.body.backedge

%c.022 = phi i32* [ %c.022.be, %while.body.backedge ], [ undef, %entry ]
%b.021 = phi i32* [ %b.021.be, %while.body.backedge ], [ undef, %entry ]
%a.020 = phi i32* [ %a.020.be, %while.body.backedge ], [ undef, %entry ]
%incdec.ptr = getelementptr inbounds i32, i32* %c.022, i64 1
%1 = ptrtoint i32* %c.022 to i64
%2 = trunc i64 %1 to i32
%incdec.ptr1 = getelementptr inbounds i32, i32* %a.020, i64 1
%incdec.ptr2 = getelementptr inbounds i32, i32* %b.021, i64 1
switch i32 %2, label %while.body.backedge [
  i32 2, label %sw.bb
  i32 4, label %sw.bb6
]

sw.bb: ; preds = %while.body

%incdec.ptr3 = getelementptr inbounds i32, i32* %b.021, i64 2
%3 = ptrtoint i32* %incdec.ptr2 to i64
%4 = trunc i64 %3 to i32
%incdec.ptr4 = getelementptr inbounds i32, i32* %a.020, i64 2
store i32 %4, i32* %incdec.ptr1, align 4
%incdec.ptr5 = getelementptr inbounds i32, i32* %c.022, i64 2
br label %while.body.backedge

sw.bb6: ; preds = %while.body

%incdec.ptr7 = getelementptr inbounds i32, i32* %a.020, i64 2
%incdec.ptr8 = getelementptr inbounds i32, i32* %c.022, i64 2
%5 = ptrtoint i32* %incdec.ptr to i64
%6 = trunc i64 %5 to i32
%incdec.ptr9 = getelementptr inbounds i32, i32* %b.021, i64 2
store i32 %6, i32* %incdec.ptr2, align 4
br label %while.body.backedge

while.body.backedge: ; preds = %sw.bb6, %while.body, %sw.bb

%c.022.be = phi i32* [ %incdec.ptr, %while.body ], [ %incdec.ptr8, %sw.bb6 ], [ %incdec.ptr5, %sw.bb ]
%b.021.be = phi i32* [ %incdec.ptr2, %while.body ], [ %incdec.ptr9, %sw.bb6 ], [ %incdec.ptr3, %sw.bb ]
%a.020.be = phi i32* [ %incdec.ptr1, %while.body ], [ %incdec.ptr7, %sw.bb6 ], [ %incdec.ptr4, %sw.bb ]
br label %while.body

while.end: ; preds = %entry

ret i32 undef

}

attributes #0 = { nofree norecurse nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+avx,+avx2,+cx8,+fxsr,+mmx,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave" "unsafe-fp-math"="false" "use-soft-float"="false" }

ABataev marked 2 inline comments as done.Jul 22 2020, 5:43 AM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2713

I don't think it is really required. InstructrionsOnly is just a range, not a container

4864

UndefValue also has an associated type, so it should be fine. Your reproducer crashes because of different reasons.

ABataev updated this revision to Diff 279794.Jul 22 2020, 6:16 AM

Rebase + fix

Matt added a subscriber: Matt.Wed, Sep 16, 1:49 PM
RKSimon added inline comments.Thu, Sep 17, 4:05 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
406

Worth using find_if ?

2645–2646

Can we use for (Value *V : make_filter_range(VL, Instruction::classof) ?

2658

for (Value *V : make_filter_range(VL, Instruction::classof) ?

2680

Should ReuseShuffleIndicies be SmallVector<int, 4> - and we then tag undefs with -1 (llvm::UndefMaskElem) ?

3514

can we use llvm::size(InstructionsOnly) ?

3936–3937

InstructionsOnly ?

4241

IgnoredIndices might be cheaper as a SparseBitVector ?

4750

Isn't UndefValue is a type of Constant? Maybe add a comment explaining what you're doing here as its not clear, at least to me.

6183

Would SmallBitVector be cheaper for UsedIndices ?

7296

SmallBitVector ?

ABataev marked 8 inline comments as done.Thu, Sep 17, 2:28 PM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2680

No, it won't work, need to register actual positions in ReuseShuffleIndicies, -1 does not work here

3514

No, it does not work, llvm::size works only if it can be calculated in O(1). Here it is not, since InstructionsOnly may have "holes".

ABataev updated this revision to Diff 292621.Thu, Sep 17, 2:35 PM

Rebase + fixes

craig.topper added inline comments.Thu, Sep 17, 6:55 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3514

Would using std::distance directly be more clear? You'd have to explicitly write begin()/end() though?

ABataev updated this revision to Diff 292797.Fri, Sep 18, 8:17 AM

Rebase + use std::distance instead of count_if

spatel added inline comments.Tue, Sep 22, 6:41 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4361–4362

I did some clean-ups while trying to understand the behavior of this code, so this patch will need a (hopefully reduced diff) update:
rG7451bf0b0b6d
rG062276c69109

This one may also require rebase:
rGa44238cb443f

ABataev updated this revision to Diff 293480.Tue, Sep 22, 8:57 AM
ABataev marked an inline comment as done.

Rebase

some very minor style comments - a general comment would be to try and pre-commit the style/NFC refactor/cleanup changes so the size of this patch is smaller

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2745

Do these trivial style refactors separately now to reduce the size of the patch?

2761

Do these trivial style refactors separately now to reduce the size of the patch?

3574

duplicate cast

3603

duplicate cast

3617

duplicate cast

3671

duplicate cast

4244–4249

trivial style refactor - pull out of patch?

RKSimon added inline comments.Wed, Sep 23, 4:24 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
198–199

Are we going to have a problem if VL[0] is UndefValue?

ABataev added inline comments.Wed, Sep 23, 4:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
198–199

Yeah, will fix it.

ABataev updated this revision to Diff 293701.Wed, Sep 23, 5:30 AM

Rebase + fix

RKSimon added inline comments.Wed, Sep 23, 5:52 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

These "feel" like regressions to me - any idea whats going on?

ABataev added inline comments.Wed, Sep 23, 5:55 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

The cost model problem, if I recall it correctly. I investigated it before and found out that the cost model for AArch64 is not defined for long vectors in some cases and we fall back to the generic cost model evaluation which is not quite correct in many cases. Need to tweak the cost model for AArch64.

RKSimon added inline comments.Wed, Sep 23, 6:15 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Any instruction cost type (extract/shuffle/store?) in particular that needs better costs? It'd be good to at least raise a specific bug report to the aarch64 team

ABataev added inline comments.Wed, Sep 23, 6:31 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Do not remember already, need some time to investigate it again. Hope to do it by the end of this week.
PS. There was a question about this test already.

spatel added inline comments.Wed, Sep 23, 7:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7431

Is it necessary to copy these? If so, it would be better to name this function something like "getCopyOfExtraArgValues" to make that explicit.

If not, we can just make this a standard 'get' method:

const MapVector<Instruction *, Value *> &getExtraArgs() const {
  return ExtraArgs;
}

And then access the 'second' data in the user code?

ABataev added inline comments.Thu, Sep 24, 6:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7431

We don't need to expose the first element of the MapVector here, it is not good from the general design point of view. I'll rename the member function.

ABataev updated this revision to Diff 294040.Thu, Sep 24, 6:33 AM

Rebase + rename

ABataev added inline comments.Thu, Sep 24, 7:19 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Found the reason. It is the cost of shuffle of TTI::SK_PermuteSingleSrc kind. Before this patch, the test operated with the vector <2 x i16>, which is transformed to llvm::MVT::v2i32 by type legalization function and the cost of this shuffle is tweaked to be 1 (see llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp, AArch64TTIImpl::getShuffleCost). The cost of this operation is 1, per table.

With this patch, the original vector type is <4 x i16> which is transformed to llvm::MVT::v4i16 and there is no optimized value for TTI::SK_PermuteSingleSrc in the table for this type and the function falls back to the pessimistic cost model and returns 18.

There are several TODOs int the file already about fixing the cost model for different shuffle operations.