This is an archive of the discontinued LLVM Phabricator instance.

Adjust the cost of vectorized SHL/SRL/SRA
Needs ReviewPublic

Authored by wmi on May 21 2015, 4:15 PM.

Download Raw Diff

Details

Reviewers

nadav
aschwaighofer

Summary

The patch is to solve the problem in https://llvm.org/bugs/show_bug.cgi?id=23582. It adjusts the cost of vectorized SHL/SRL/SRA and makes sure they are lowered to vectorized shift instruction.

There are a bunch of testcases needed to be adjusted. send out the patch first to see if it is ok generally. Will update the patch with adjusted tests.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 26282.May 21 2015, 4:15 PM

wmi retitled this revision from to Adjust the cost of vectorized SHL/SRL/SRA.

wmi updated this object.

wmi edited the test plan for this revision. (Show Details)

wmi added reviewers: nadav, aschwaighofer.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: Unknown Object (MLST), davidxl.

Looks good to me assuming that you write the cost model tests, and that you write tests for the ISel changes.

Thanks for looking at this - comments below. I should mention I have some similar work in progress in D9474 and D9645 that is trying to efficiently vectorize more integer shifts.

lib/Target/X86/X86ISelLowering.cpp
16421	I think this is out of date - Elena did some refactoring on this recently.
lib/Target/X86/X86TargetTransformInfo.cpp
255	Why is v4i64 declared here?
265	SSE only has fast variable shift support for uniform values - these cost values surely should reflect the likely cost of general shift. It would be better to update the TargetTransformInfo::OK_UniformConstantValue tables + code above this to support TargetTransformInfo::OK_UniformValue as well.

I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).

#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}

or:

for(k=0, r=0; k<pos; k++)
  r += (MAX_UNSIGNED) 1 << k;

In D9923#177012, @aschwaighofer wrote:
I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).
#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}
or:
for(k=0, r=0; k<pos; k++)
  r += (MAX_UNSIGNED) 1 << k;

Thanks for sharing the testcase. For the first testcase:

Without the patch, the generated code for the kernel loop is:
.LBB0_1: # %for.body

=>This Inner Loop Header: Depth=1 movsbl B1+1024(%rax), %edx movb C1+1024(%rax), %cl sarl %cl, %edx movb %dl, A1+1024(%rax) incq %rax jne .LBB0_1

With the patch, the generated code for the kernel loop is:
.LBB0_1: # %vector.body

=>This Inner Loop Header: Depth=1 movd B1+1024(%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero punpcklbw %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3] psrad $24, %xmm1 movd C1+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero punpcklbw %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3] psrad $24, %xmm2 psrad %xmm2, %xmm1 pand %xmm0, %xmm1 packuswb %xmm1, %xmm1 packuswb %xmm1, %xmm1 movd %xmm1, A1+1024(%rax) addq $4, %rax jne .LBB0_1

The vectorized version is slightly better than the scalarized version. But the cost estimation to compute VF is not very good -- The cost estimation shows cost is 8 when VF==1 and cost is 2 when VF==4. The estimated costs of vectorized sext and trunc are too low and don't match the real costs.

Another problem is that vectorizer doesn't know the char->int type promotion here is unnecessary.

Can you give me the whole version of the second testcase? I am not sure my tweaked version is the right one.

Sorry. The format of the assembly code was removed when I replied from
phabricator. repaste it here:

#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}

Without the patch, the kernel loop:
.LBB0_1: # %for.body

movsbl  B1+1024(%rax), %edx
movb    C1+1024(%rax), %cl
sarl    %cl, %edx
movb    %dl, A1+1024(%rax)
incq    %rax
jne     .LBB0_1

With the patch, the kernel loop:
.LBB0_1: # %vector.body

movd    B1+1024(%rax), %xmm1    # xmm1 = mem[0],zero,zero,zero
punpcklbw       %xmm1, %xmm1    # xmm1 =

xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]

punpcklwd       %xmm1, %xmm1    # xmm1 = xmm1[0,0,1,1,2,2,3,3]
psrad   $24, %xmm1
movd    C1+1024(%rax), %xmm2    # xmm2 = mem[0],zero,zero,zero
punpcklbw       %xmm2, %xmm2    # xmm2 =

xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]

punpcklwd       %xmm2, %xmm2    # xmm2 = xmm2[0,0,1,1,2,2,3,3]
psrad   $24, %xmm2
psrad   %xmm2, %xmm1
pand    %xmm0, %xmm1
packuswb        %xmm1, %xmm1
packuswb        %xmm1, %xmm1
movd    %xmm1, A1+1024(%rax)
addq    $4, %rax
jne     .LBB0_1

Although the vectorized version is slightly better, the cost
estimation is not precise (vectorizer cost estmiation says VF==4 (cost
is 2) is much better than VF==1 (cost is 8)).

Wei.

Ah, the code generated by this patch is incorrect. As Simon said, SSE
only has fast variable shift support for uniform values. I
misunderstood the meaning of the vectorized shift instruction. Will
rewrite the patch.

I updated the patch.

The cost of vectorized shift with uniform scalar shift amount is adjusted. lowerShift has already contained the logic to lower such shift to ISD::VSHL/VSRL/VSRA properly.

It works for the motivational case in PR23582. llvm unittest passes.

Thank you for the update. I don't know enough about the LoopVectorize code to review but the TTI cost model looks correct.

Please can you add uniform non-constant tests to:

test/Analysis/CostModel/X86/testshiftashr.ll
test/Analysis/CostModel/X86/testshiftlshr.ll
test/Analysis/CostModel/X86/testshiftshl.ll

Please can you add uniform non-constant tests to:
test/Analysis/CostModel/X86/testshiftashr.ll
test/Analysis/CostModel/X86/testshiftlshr.ll
test/Analysis/CostModel/X86/testshiftshl.ll

Thanks for the review. I add uniform variant tests in those files. Another change in this patch is to try to set operand to be UniformValue in CostModelAnalysis pass.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

17 lines

X86TargetTransformInfo.cpp

16 lines

Diff 26282

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 16,396 Lines • ▼ Show 20 Lines

	static SDValue LowerScalarVariableShift(SDValue Op, SelectionDAG &DAG,			static SDValue LowerScalarVariableShift(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget* Subtarget) {			const X86Subtarget* Subtarget) {
	MVT VT = Op.getSimpleValueType();			MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);			SDLoc dl(Op);
	SDValue R = Op.getOperand(0);			SDValue R = Op.getOperand(0);
	SDValue Amt = Op.getOperand(1);			SDValue Amt = Op.getOperand(1);

				if (Subtarget->hasSSE2() &&
				(VT == MVT::v8i16 \|\| VT == MVT::v4i32 \|\| VT == MVT::v2i16) &&
				(VT != MVT::v2i64 \|\| Op.getOpcode() != ISD::SRA)) {
				assert((VT == R.getSimpleValueType() && VT == Amt.getSimpleValueType()) &&
				"Unexpected operand type");
				switch (Op.getOpcode()) {
				default:
				llvm_unreachable("Unknown shift opcode!");
				case ISD::SHL:
				return DAG.getNode(X86ISD::VSHL, dl, VT, R, Op.getOperand(1));
				case ISD::SRL:
				return DAG.getNode(X86ISD::VSRL, dl, VT, R, Op.getOperand(1));
				case ISD::SRA:
				return DAG.getNode(X86ISD::VSRA, dl, VT, R, Op.getOperand(1));
				}
				}

				RKSimonUnsubmitted Not Done Reply Inline Actions I think this is out of date - Elena did some refactoring on this recently. RKSimon: I think this is out of date - Elena did some refactoring on this recently.
	if ((VT == MVT::v2i64 && Op.getOpcode() != ISD::SRA) \|\|			if ((VT == MVT::v2i64 && Op.getOpcode() != ISD::SRA) \|\|
	VT == MVT::v4i32 \|\| VT == MVT::v8i16 \|\|			VT == MVT::v4i32 \|\| VT == MVT::v8i16 \|\|
	(Subtarget->hasInt256() &&			(Subtarget->hasInt256() &&
	((VT == MVT::v4i64 && Op.getOpcode() != ISD::SRA) \|\|			((VT == MVT::v4i64 && Op.getOpcode() != ISD::SRA) \|\|
	VT == MVT::v8i32 \|\| VT == MVT::v16i16)) \|\|			VT == MVT::v8i32 \|\| VT == MVT::v16i16)) \|\|
	(Subtarget->hasAVX512() && (VT == MVT::v8i64 \|\| VT == MVT::v16i32))) {			(Subtarget->hasAVX512() && (VT == MVT::v8i64 \|\| VT == MVT::v16i32))) {
	SDValue BaseShAmt;			SDValue BaseShAmt;
	EVT EltVT = VT.getVectorElementType();			EVT EltVT = VT.getVectorElementType();
	▲ Show 20 Lines • Show All 8,588 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 242 Lines • ▼ Show 20 Lines	static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {
// custom.		// custom.
// For some cases, where the shift amount is a scalar we would be able		// For some cases, where the shift amount is a scalar we would be able
// to generate better code. Unfortunately, when this is the case the value		// to generate better code. Unfortunately, when this is the case the value
// (the splat) will get hoisted out of the loop, thereby making it invisible		// (the splat) will get hoisted out of the loop, thereby making it invisible
// to ISel. The cost model must return worst case assumptions because it is		// to ISel. The cost model must return worst case assumptions because it is
// used for vectorization and we don't want to make vectorized code worse		// used for vectorization and we don't want to make vectorized code worse
// than scalar code.		// than scalar code.
{ ISD::SHL, MVT::v16i8, 30 }, // cmpeqb sequence.		{ ISD::SHL, MVT::v16i8, 30 }, // cmpeqb sequence.
{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SHL, MVT::v8i16, 1 },
{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.		{ ISD::SHL, MVT::v4i32, 1 },
{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SHL, MVT::v2i64, 1 },
{ ISD::SHL, MVT::v4i64, 4*10 }, // Scalarized.		{ ISD::SHL, MVT::v4i64, 4*10 }, // Scalarized.

		RKSimonUnsubmitted Not Done Reply Inline Actions Why is v4i64 declared here? RKSimon: Why is v4i64 declared here?
{ ISD::SRL, MVT::v16i8, 16*10 }, // Scalarized.		{ ISD::SRL, MVT::v16i8, 16*10 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SRL, MVT::v8i16, 1 },
{ ISD::SRL, MVT::v4i32, 4*10 }, // Scalarized.		{ ISD::SRL, MVT::v4i32, 1 },
{ ISD::SRL, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SRL, MVT::v2i64, 1 },

{ ISD::SRA, MVT::v16i8, 16*10 }, // Scalarized.		{ ISD::SRA, MVT::v16i8, 16*10 }, // Scalarized.
{ ISD::SRA, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SRA, MVT::v8i16, 1 },
{ ISD::SRA, MVT::v4i32, 4*10 }, // Scalarized.		{ ISD::SRA, MVT::v4i32, 1 },
{ ISD::SRA, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SRA, MVT::v2i64, 2*10 }, // Scalarized.

		RKSimonUnsubmitted Not Done Reply Inline Actions SSE only has fast variable shift support for uniform values - these cost values surely should reflect the likely cost of general shift. It would be better to update the TargetTransformInfo::OK_UniformConstantValue tables + code above this to support TargetTransformInfo::OK_UniformValue as well. RKSimon: SSE only has fast variable shift support for uniform values - these cost values surely should…
// It is not a good idea to vectorize division. We have to scalarize it and		// It is not a good idea to vectorize division. We have to scalarize it and
// in the process we will often end up having to spilling regular		// in the process we will often end up having to spilling regular
// registers. The overhead of division is going to dominate most kernels		// registers. The overhead of division is going to dominate most kernels
// anyways so try hard to prevent vectorization of division - it is		// anyways so try hard to prevent vectorization of division - it is
// generally a bad idea. Assume somewhat arbitrarily that we have to be able		// generally a bad idea. Assume somewhat arbitrarily that we have to be able
// to hide "20 cycles" for each lane.		// to hide "20 cycles" for each lane.
{ ISD::SDIV, MVT::v16i8, 16*20 },		{ ISD::SDIV, MVT::v16i8, 16*20 },
{ ISD::SDIV, MVT::v8i16, 8*20 },		{ ISD::SDIV, MVT::v8i16, 8*20 },
▲ Show 20 Lines • Show All 853 Lines • Show Last 20 Lines