This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Scalarize binary ops following shuffles.
AbandonedPublic

Authored by FarhanaAleen on Apr 6 2018, 3:38 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
craig.topper

Summary

This patch performs following transformation.

%vector1 = shufflevector
%vector2 = shufflevector
%vector-res = vector-binaryop %vector1, %vector2
%scalar-res = extractelement <2 x fhalf> %vector-res, i32 0
ret half %scalar-res

>

%scalar1 = extractelement <2 x fhalf> %vector1, i32 0
%scalar2 = extractelement <2 x fhalf> %vector2, i32 0
%scalar-res = scalar binaryop %scalar1, %scalar2
return half %scalar-res

Diff Detail

Event Timeline

FarhanaAleen created this revision.Apr 6 2018, 3:38 PM

The fact that we have to limit this transform to 16-bit is a good indicator that it's not suitable as a target-independent instcombine.
Ie, just because it shows improvement for the target that you're looking at doesn't mean it will do the same for other targets.
We should either:
(1) fix the pattern-matching problems that occur in later passes, so we can do this transform universally or
(2) move this transform to the DAG gated by type and target checks, so it only happens when profitable.

Agreed, this really must take into account target costs to be of any use - its true that given vector code we don't perform much in the way of scalarization (or re-vectorization to another width). Not sure if this is the kind of thing that slp could be extended to perform without making it even more messy.....

Thanks for your feedback and I agree with you guys.

Yes, there two options with their pros and cons.

Initially, I considered 2) which is performing scalarization in the DAGCombiner gated by type and target checks. The code for implementing it inside the DAGCombiner seemed a little messy since I could not find a cleaner way to classify scalar operations such as arithmetic, bit-wise operation as opposed to shuffle vector, build vector etc. The only option I could think of is to hard-code the list of scalar operations.

On the other hand, this transformation can be implemented in a cleaner way with 1) (inside InstCombine) under the same consideration(target checks) with a simple extension. 1) also allows us to reuse the existing implementation. Another benefit of 1) is it will contain all the relevant scalarization code in one place. The only con is it requires additional support for certain targets; X86 at least and may be more.

In general, the transformation seemed a target independent optimization since it is optimizing vector code performing effective operation only on a single element; which should be rather executed as scalar code. And most of the time, performance of scalar code is better than vector code. Therefore, this optimization should benefit all the other targets. The reason it's not benefiting x86 is due to a lack of proper support in the instruction selection which can be considered as trivial and can be improved.

Do you think we can consider 1) and implement it in a few stages?


     Stage-1 Support only for 16bits which works for all targets or, 

     Stage-2  Support it for all types with a hook it to the TTI and enable the transformation for the targets that supports this optimization currently;

     Stage-3- Support the missing instruction selection pattern in X86 and possibly others.

In D45393#1062223, @FarhanaAleen wrote:

On the other hand, this transformation can be implemented in a cleaner way with 1) (inside InstCombine) under the same consideration(target checks) with a simple extension. 1) also allows us to reuse the existing implementation. Another benefit of 1) is it will contain all the relevant scalarization code in one place. The only con is it requires additional support for certain targets; X86 at least and may be more.

We shouldn't include any kind of target checks in InstCombine though. It's supposed to be used for universal simplifications/transforms. If there's any doubt, then there has to be an easy way for targets to reverse the transform.

In general, the transformation seemed a target independent optimization since it is optimizing vector code performing effective operation only on a single element; which should be rather executed as scalar code. And most of the time, performance of scalar code is better than vector code.

It's the "in general" and "most of the time" qualifiers that raise the red flag for me. I understand the appeal of making the transform in IR, and so I'd second Simon's suggestion about seeing if we can add this onto SLP vectorizer where you can use cost models to decide if it's profitable.

If you'd still like to pursue an instcombine solution, please post this to llvm-dev for more feedback. If there's consensus that we can do this here, I won't oppose it (it's obviously less work to do it here!).

It's the "in general" and "most of the time" qualifiers that raise the red flag for me.

I can see that.

I understand the appeal of making the transform in IR, and so I'd second Simon's suggestion about seeing if we can add this onto SLP vectorizer where you can use cost models to decide if it's profitable.

Ok, I will try this first and see how it goes.

FarhanaAleen abandoned this revision.Apr 10 2018, 3:41 PM

hsaito mentioned this in D45834: [TTI] Add a hook to TTI for choosing scalarized shuffle-reduction sequence for reduction idiom.Apr 20 2018, 11:40 AM

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineVectorOps.cpp

7 lines

test/

Transforms/

InstCombine/

fast-math-scalarization.ll

76 lines

Diff 141440

lib/Transforms/InstCombine/InstCombineVectorOps.cpp

Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	static bool cheapToScalarize(Value *V, bool isConstant) {
Instruction *I = dyn_cast<Instruction>(V);		Instruction *I = dyn_cast<Instruction>(V);
if (!I) return false;		if (!I) return false;

// Insert element gets simplified to the inserted element or is deleted if		// Insert element gets simplified to the inserted element or is deleted if
// this is constant idx extract element and its a constant idx insertelt.		// this is constant idx extract element and its a constant idx insertelt.
if (I->getOpcode() == Instruction::InsertElement && isConstant &&		if (I->getOpcode() == Instruction::InsertElement && isConstant &&
isa<ConstantInt>(I->getOperand(2)))		isa<ConstantInt>(I->getOperand(2)))
return true;		return true;
if (I->getOpcode() == Instruction::Load && I->hasOneUse())		if (I->hasOneUse()) {
		if (I->getOpcode() == Instruction::Load \|\|
		// FIXME: Support other types?
		(isConstant && isa<ShuffleVectorInst>(I) &&
		V->getType()->getScalarSizeInBits() == 16))
return true;		return true;
		}
if (BinaryOperator *BO = dyn_cast<BinaryOperator>(I))		if (BinaryOperator *BO = dyn_cast<BinaryOperator>(I))
if (BO->hasOneUse() &&		if (BO->hasOneUse() &&
(cheapToScalarize(BO->getOperand(0), isConstant) \|\|		(cheapToScalarize(BO->getOperand(0), isConstant) \|\|
cheapToScalarize(BO->getOperand(1), isConstant)))		cheapToScalarize(BO->getOperand(1), isConstant)))
return true;		return true;
if (CmpInst *CI = dyn_cast<CmpInst>(I))		if (CmpInst *CI = dyn_cast<CmpInst>(I))
if (CI->hasOneUse() &&		if (CI->hasOneUse() &&
(cheapToScalarize(CI->getOperand(0), isConstant) \|\|		(cheapToScalarize(CI->getOperand(0), isConstant) \|\|
▲ Show 20 Lines • Show All 1,414 Lines • Show Last 20 Lines

test/Transforms/InstCombine/fast-math-scalarization.ll

Show All 22 Lines	for.body:
%mul = fmul fast <4 x float> %x.0, <float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000>		%mul = fmul fast <4 x float> %x.0, <float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000>
%inc = add nsw i32 %i.0, 1		%inc = add nsw i32 %i.0, 1
br label %for.cond		br label %for.cond

for.end:		for.end:
ret void		ret void
}		}

; CHECK-LABEL: test_extract_element_fastmath		; CHECK-LABEL: @test_extract_element_fastmath
; CHECK: fadd fast float		; CHECK: fadd fast float
define float @test_extract_element_fastmath(<4 x float> %x) #0 {		define float @test_extract_element_fastmath(<4 x float> %x) #0 {
entry:		entry:
%add = fadd fast <4 x float> %x, <float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000>		%add = fadd fast <4 x float> %x, <float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000, float 0x4002A3D700000000>
%0 = extractelement <4 x float> %add, i32 2		%0 = extractelement <4 x float> %add, i32 2
ret float %0		ret float %0
}		}

		; CHECK-LABEL: @scalarize_shuffles_half2
		; CHECK: extractelement
		; CHECK-NEXT: extractelement
		; CHECK-NEXT: fadd fast half
		; CHECK-NEXT: ret
		define half @scalarize_shuffles_half2(<2 x half> %x, <2 x half> %y) {
		entry:
		%shuff1 = shufflevector <2 x half> %x, <2 x half> %y, <2 x i32> <i32 2, i32 1>
		%shuff2 = shufflevector <2 x half> %y, <2 x half> %x, <2 x i32> <i32 2, i32 1>
		%vadd = fadd fast <2 x half> %shuff1, %shuff2
		%scalar = extractelement <2 x half> %vadd, i32 0
		ret half %scalar
		}

		; CHECK-LABEL: @scalarize_shuffles_2i16
		; CHECK: extractelement
		; CHECK-NEXT: extractelement
		; CHECK: add i16
		; CHECK-NEXT: ret
		define i16 @scalarize_shuffles_2i16(<2 x i16> %x, <2 x i16> %y) {
		entry:
		%shuff1 = shufflevector <2 x i16> %x, <2 x i16> %y, <2 x i32> <i32 2, i32 1>
		%shuff2 = shufflevector <2 x i16> %y, <2 x i16> %x, <2 x i32> <i32 2, i32 1>
		%vadd = add <2 x i16> %shuff1, %shuff2
		%scalar = extractelement <2 x i16> %vadd, i32 0
		ret i16 %scalar
		}

		; CHECK-LABEL: @scalarize_multiplebinops
		; CHECK: extractelement
		; CHECK-NEXT: extractelement
		; CHECK-NEXT: extractelement
		; CHECK-NEXT: fadd fast half
		; CHECK-NEXT: fadd fast half
		; CHECK-NEXT: ret
		define half @scalarize_multiplebinops(<2 x half> %x, <2 x half> %y) {
		entry:
		%x.1 = shufflevector <2 x half> %x, <2 x half> undef, <2 x i32> <i32 1, i32 undef>
		%shuff1 = shufflevector <2 x half> %x, <2 x half> %y, <2 x i32> <i32 2, i32 1>
		%shuff2 = shufflevector <2 x half> %y, <2 x half> %x, <2 x i32> <i32 2, i32 1>
		%vadd1 = fadd fast <2 x half> %shuff1, %shuff2
		%vadd2 = fadd fast <2 x half> %x.1, %vadd1
		%scalar = extractelement <2 x half> %vadd2, i32 0
		ret half %scalar
		}

		; CHECK-LABEL: scalarize_reduction_half2
		; CHECK: extractelement
		; CHECK-NEXT: extractelement
		; CHECK: fadd fast half
		; CHECK: ret
		define half @scalarize_reduction_half2(<2 x half> %x) {
		entry:
		%rdx.shuf1 = shufflevector <2 x half> %x, <2 x half> undef, <2 x i32> <i32 1, i32 undef>
		%add = fadd fast <2 x half> %x, %rdx.shuf1
		%scalar = extractelement <2 x half> %add, i32 0
		ret half %scalar
		}

		; CHECK-LABEL: @scalarize_reduction_float2
		; TODO: The following test should be scalarized also. cheapToScalarize() in InstCombineVectorOps.cpp is
		; restricted to type half only. This is because allowing it for other types might cause performance regression
		; on other targets(such as X86). When there is performance regression this is an artifact of pattern matching being
		; too restrictive, and should be fixed.
		; For example, X86(isHorizontalBinOp) relies on shuffle pattern to
		; detect horizontal binary operations. Absense of the shuffle causes isHorizontalBinOp to not match the horizontalbinop
		; pattern and produces vmovshdup+vaddss instead of vhaddps on avx2;
		define float @scalarize_reduction_float2(<2 x float> %x) {
		entry:
		%rdx.shuf1 = shufflevector <2 x float> %x, <2 x float> undef, <2 x i32> <i32 1, i32 undef>
		%add = fadd fast <2 x float> %x, %rdx.shuf1
		%scalar = extractelement <2 x float> %add, i32 0
		ret float %scalar
		}