This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/ARM/
-
Target/
-
ARM/
-
ARMTargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/ARM/
-
CostModel/
-
ARM/
-
select.ll
-
CodeGen/ARM/
-
ARM/
-
vselect_imax.ll

Differential D14743

[ARM] Don't pessimize i32 vselect.
ClosedPublic

Authored by • chatur01 on Nov 17 2015, 1:10 AM.

Download Raw Diff

Details

Reviewers

rengolin
mzolotukhin
aschwaighofer
jmolloy
mcrosier

Commits

rG7968b981bf09: [ARM] Don't pessimize i32 vselect.
rL253349: [ARM] Don't pessimize i32 vselect.

Summary

The underlying issues surrounding codegen for 32-bit vselects have been resolved. The pessimistic costs for 64-bit vselects remain due to the bad
scalarization that is still happening there.

I tested this on A57 in T32, A32 and A64 modes. I saw no regressions, and some improvements.

From my benchmarks, I saw these improvements in A57 (T32)
spec.cpu2000.ref.177_mesa 5.95%
lnt.SingleSource/Benchmarks/Shootout/strcat 12.93%
lnt.MultiSource/Benchmarks/MiBench/telecomm-CRC32/telecomm-CRC32 11.89%

I also measured A57 A32 and A9 T32 and found no performance regressions. I see much bigger wins in third-party benchmarks with this change.

Diff Detail

Repository: rL LLVM

Event Timeline

• chatur01 updated this revision to Diff 40292.Nov 17 2015, 1:10 AM

• chatur01 retitled this revision from to [ARM] Don't pessimize i32 vselect..

• chatur01 updated this object.

• chatur01 added reviewers: mzolotukhin, aschwaighofer, jmolloy, mcrosier.

• chatur01 set the repository for this revision to rL LLVM.

• chatur01 added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptNov 17 2015, 1:10 AM

Hi Charlie,

The original commit mentions Cortex-A8's 25% improvement performance in PAQp8. It seems you've only tested in OOO cores, can you just double check on either A53 or A7/A8? It's probably a pattern we now recognise better, but would be good to be on the safe side.

If you find no regressions either, and I'm not expecting any, LGTM. Thanks!

cheers,
--renato

This revision is now accepted and ready to land.Nov 17 2015, 1:27 AM

Thanks Renato. I double-checked A53 T32 and found no regressions in SPEC 2000, 2006 or LNT.

Adobe-C++/loop_unroll was up 6.49%
ASC_Sequoia/AMGmk/AMGmk was up 5.04%

Several other improvements in LNT.

Closed by commit rL253349: [ARM] Don't pessimize i32 vselect. (authored by • chatur01). · Explain WhyNov 17 2015, 9:28 AM

This revision was automatically updated to reflect the committed changes.

I missed some mechanical changes in test/Analysis/CostModel/ARM/select.ll in this review. I committed the changes along with what's in this revision, which I felt were appropriate for post-commit review if necessary. Sorry about that!

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

ARM/

ARMTargetTransformInfo.cpp

3 lines

test/

Analysis/

CostModel/

ARM/

select.ll

6 lines

CodeGen/

ARM/

vselect_imax.ll

11 lines

Diff 40403

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.cpp

	Show First 20 Lines • Show All 268 Lines • ▼ Show 20 Lines

	int ARMTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy) {			int ARMTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy) {

	int ISD = TLI->InstructionOpcodeToISD(Opcode);			int ISD = TLI->InstructionOpcodeToISD(Opcode);
	// On NEON a a vector select gets lowered to vbsl.			// On NEON a a vector select gets lowered to vbsl.
	if (ST->hasNEON() && ValTy->isVectorTy() && ISD == ISD::SELECT) {			if (ST->hasNEON() && ValTy->isVectorTy() && ISD == ISD::SELECT) {
	// Lowering of some vector selects is currently far from perfect.			// Lowering of some vector selects is currently far from perfect.
	static const TypeConversionCostTblEntry NEONVectorSelectTbl[] = {			static const TypeConversionCostTblEntry NEONVectorSelectTbl[] = {
	{ ISD::SELECT, MVT::v16i1, MVT::v16i16, 216 + 1 + 31 + 4*1 },
	{ ISD::SELECT, MVT::v8i1, MVT::v8i32, 48 + 13 + 14 + 12 },
	{ ISD::SELECT, MVT::v16i1, MVT::v16i32, 416 + 16 + 18 + 14 },
	{ ISD::SELECT, MVT::v4i1, MVT::v4i64, 44 + 12 + 1 },			{ ISD::SELECT, MVT::v4i1, MVT::v4i64, 44 + 12 + 1 },
	{ ISD::SELECT, MVT::v8i1, MVT::v8i64, 50 },			{ ISD::SELECT, MVT::v8i1, MVT::v8i64, 50 },
	{ ISD::SELECT, MVT::v16i1, MVT::v16i64, 100 }			{ ISD::SELECT, MVT::v16i1, MVT::v16i64, 100 }
	};			};

	EVT SelCondTy = TLI->getValueType(DL, CondTy);			EVT SelCondTy = TLI->getValueType(DL, CondTy);
	EVT SelValTy = TLI->getValueType(DL, ValTy);			EVT SelValTy = TLI->getValueType(DL, ValTy);
	if (SelCondTy.isSimple() && SelValTy.isSimple()) {			if (SelCondTy.isSimple() && SelValTy.isSimple()) {
	▲ Show 20 Lines • Show All 211 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/ARM/select.ll

Show All 28 Lines	define void @casts() {
%v10 = select <16 x i1> undef, <16 x i8> undef, <16 x i8> undef		%v10 = select <16 x i1> undef, <16 x i8> undef, <16 x i8> undef

; CHECK: cost of 1 {{.*}} select		; CHECK: cost of 1 {{.*}} select
%v11 = select <2 x i1> undef, <2 x i16> undef, <2 x i16> undef		%v11 = select <2 x i1> undef, <2 x i16> undef, <2 x i16> undef
; CHECK: cost of 1 {{.*}} select		; CHECK: cost of 1 {{.*}} select
%v12 = select <4 x i1> undef, <4 x i16> undef, <4 x i16> undef		%v12 = select <4 x i1> undef, <4 x i16> undef, <4 x i16> undef
; CHECK: cost of 1 {{.*}} select		; CHECK: cost of 1 {{.*}} select
%v13 = select <8 x i1> undef, <8 x i16> undef, <8 x i16> undef		%v13 = select <8 x i1> undef, <8 x i16> undef, <8 x i16> undef
; CHECK: cost of 40 {{.*}} select		; CHECK: cost of 2 {{.*}} select
%v13b = select <16 x i1> undef, <16 x i16> undef, <16 x i16> undef		%v13b = select <16 x i1> undef, <16 x i16> undef, <16 x i16> undef

; CHECK: cost of 1 {{.*}} select		; CHECK: cost of 1 {{.*}} select
%v14 = select <2 x i1> undef, <2 x i32> undef, <2 x i32> undef		%v14 = select <2 x i1> undef, <2 x i32> undef, <2 x i32> undef
; CHECK: cost of 1 {{.*}} select		; CHECK: cost of 1 {{.*}} select
%v15 = select <4 x i1> undef, <4 x i32> undef, <4 x i32> undef		%v15 = select <4 x i1> undef, <4 x i32> undef, <4 x i32> undef
; CHECK: cost of 41 {{.*}} select		; CHECK: cost of 2 {{.*}} select
%v15b = select <8 x i1> undef, <8 x i32> undef, <8 x i32> undef		%v15b = select <8 x i1> undef, <8 x i32> undef, <8 x i32> undef
; CHECK: cost of 82 {{.*}} select		; CHECK: cost of 4 {{.*}} select
%v15c = select <16 x i1> undef, <16 x i32> undef, <16 x i32> undef		%v15c = select <16 x i1> undef, <16 x i32> undef, <16 x i32> undef

; CHECK: cost of 1 {{.*}} select		; CHECK: cost of 1 {{.*}} select
%v16 = select <2 x i1> undef, <2 x i64> undef, <2 x i64> undef		%v16 = select <2 x i1> undef, <2 x i64> undef, <2 x i64> undef
; CHECK: cost of 19 {{.*}} select		; CHECK: cost of 19 {{.*}} select
%v16a = select <4 x i1> undef, <4 x i64> undef, <4 x i64> undef		%v16a = select <4 x i1> undef, <4 x i64> undef, <4 x i64> undef
; CHECK: cost of 50 {{.*}} select		; CHECK: cost of 50 {{.*}} select
%v16b = select <8 x i1> undef, <8 x i64> undef, <8 x i64> undef		%v16b = select <8 x i1> undef, <8 x i64> undef, <8 x i64> undef
Show All 21 Lines

llvm/trunk/test/CodeGen/ARM/vselect_imax.ll

	; RUN: opt < %s -cost-model -analyze -mtriple=thumbv7-apple-ios6.0.0 -march=arm -mcpu=cortex-a8 \| FileCheck %s --check-prefix=COST			; RUN: opt < %s -cost-model -analyze -mtriple=thumbv7-apple-ios6.0.0 -march=arm -mcpu=cortex-a8 \| FileCheck %s --check-prefix=COST
	; RUN: llc -mtriple=arm-eabi -mattr=+neon %s -o - \| FileCheck %s			; RUN: llc -mtriple=arm-eabi -mattr=+neon %s -o - \| FileCheck %s
	; Make sure that ARM backend with NEON handles vselect.			; Make sure that ARM backend with NEON handles vselect.

	define void @vmax_v4i32(<4 x i32>* %m, <4 x i32> %a, <4 x i32> %b) {			define void @vmax_v4i32(<4 x i32>* %m, <4 x i32> %a, <4 x i32> %b) {
	; CHECK: vmax.s32 {{q[0-9]+}}, {{q[0-9]+}}, {{q[0-9]+}}			; CHECK: vmax.s32 {{q[0-9]+}}, {{q[0-9]+}}, {{q[0-9]+}}
	%cmpres = icmp sgt <4 x i32> %a, %b			%cmpres = icmp sgt <4 x i32> %a, %b
	%maxres = select <4 x i1> %cmpres, <4 x i32> %a, <4 x i32> %b			%maxres = select <4 x i1> %cmpres, <4 x i32> %a, <4 x i32> %b
	store <4 x i32> %maxres, <4 x i32>* %m			store <4 x i32> %maxres, <4 x i32>* %m
	ret void			ret void
	}			}

	; We adjusted the cost model of the following selects. When we improve code
	; lowering we also need to adjust the cost.
	%T0_10 = type <16 x i16>			%T0_10 = type <16 x i16>
	%T1_10 = type <16 x i1>			%T1_10 = type <16 x i1>
	; CHECK-LABEL: func_blend10:			; CHECK-LABEL: func_blend10:
	define void @func_blend10(%T0_10* %loadaddr, %T0_10* %loadaddr2,			define void @func_blend10(%T0_10* %loadaddr, %T0_10* %loadaddr2,
	%T1_10* %blend, %T0_10* %storeaddr) {			%T1_10* %blend, %T0_10* %storeaddr) {
	%v0 = load %T0_10, %T0_10* %loadaddr			%v0 = load %T0_10, %T0_10* %loadaddr
	%v1 = load %T0_10, %T0_10* %loadaddr2			%v1 = load %T0_10, %T0_10* %loadaddr2
	%c = icmp slt %T0_10 %v0, %v1			%c = icmp slt %T0_10 %v0, %v1
	; CHECK: vmin.s16			; CHECK: vmin.s16
	; CHECK: vmin.s16			; CHECK: vmin.s16
	; COST: func_blend10			; COST: func_blend10
	; COST: cost of 40 {{.*}} select			; COST: cost of 2 {{.*}} select
	%r = select %T1_10 %c, %T0_10 %v0, %T0_10 %v1			%r = select %T1_10 %c, %T0_10 %v0, %T0_10 %v1
	store %T0_10 %r, %T0_10* %storeaddr			store %T0_10 %r, %T0_10* %storeaddr
	ret void			ret void
	}			}
	%T0_14 = type <8 x i32>			%T0_14 = type <8 x i32>
	%T1_14 = type <8 x i1>			%T1_14 = type <8 x i1>
	; CHECK-LABEL: func_blend14:			; CHECK-LABEL: func_blend14:
	define void @func_blend14(%T0_14* %loadaddr, %T0_14* %loadaddr2,			define void @func_blend14(%T0_14* %loadaddr, %T0_14* %loadaddr2,
	%T1_14* %blend, %T0_14* %storeaddr) {			%T1_14* %blend, %T0_14* %storeaddr) {
	%v0 = load %T0_14, %T0_14* %loadaddr			%v0 = load %T0_14, %T0_14* %loadaddr
	%v1 = load %T0_14, %T0_14* %loadaddr2			%v1 = load %T0_14, %T0_14* %loadaddr2
	%c = icmp slt %T0_14 %v0, %v1			%c = icmp slt %T0_14 %v0, %v1
	; CHECK: vmin.s32			; CHECK: vmin.s32
	; CHECK: vmin.s32			; CHECK: vmin.s32
	; COST: func_blend14			; COST: func_blend14
	; COST: cost of 41 {{.*}} select			; COST: cost of 2 {{.*}} select
	%r = select %T1_14 %c, %T0_14 %v0, %T0_14 %v1			%r = select %T1_14 %c, %T0_14 %v0, %T0_14 %v1
	store %T0_14 %r, %T0_14* %storeaddr			store %T0_14 %r, %T0_14* %storeaddr
	ret void			ret void
	}			}
	%T0_15 = type <16 x i32>			%T0_15 = type <16 x i32>
	%T1_15 = type <16 x i1>			%T1_15 = type <16 x i1>
	; CHECK-LABEL: func_blend15:			; CHECK-LABEL: func_blend15:
	define void @func_blend15(%T0_15* %loadaddr, %T0_15* %loadaddr2,			define void @func_blend15(%T0_15* %loadaddr, %T0_15* %loadaddr2,
	%T1_15* %blend, %T0_15* %storeaddr) {			%T1_15* %blend, %T0_15* %storeaddr) {
	; CHECK: vmin.s32			; CHECK: vmin.s32
	; CHECK: vmin.s32			; CHECK: vmin.s32
	%v0 = load %T0_15, %T0_15* %loadaddr			%v0 = load %T0_15, %T0_15* %loadaddr
	%v1 = load %T0_15, %T0_15* %loadaddr2			%v1 = load %T0_15, %T0_15* %loadaddr2
	%c = icmp slt %T0_15 %v0, %v1			%c = icmp slt %T0_15 %v0, %v1
	; COST: func_blend15			; COST: func_blend15
	; COST: cost of 82 {{.*}} select			; COST: cost of 4 {{.*}} select
	%r = select %T1_15 %c, %T0_15 %v0, %T0_15 %v1			%r = select %T1_15 %c, %T0_15 %v0, %T0_15 %v1
	store %T0_15 %r, %T0_15* %storeaddr			store %T0_15 %r, %T0_15* %storeaddr
	ret void			ret void
	}			}

				; We adjusted the cost model of the following selects. When we improve code
				; lowering we also need to adjust the cost.
	%T0_18 = type <4 x i64>			%T0_18 = type <4 x i64>
	%T1_18 = type <4 x i1>			%T1_18 = type <4 x i1>
	; CHECK-LABEL: func_blend18:			; CHECK-LABEL: func_blend18:
	define void @func_blend18(%T0_18* %loadaddr, %T0_18* %loadaddr2,			define void @func_blend18(%T0_18* %loadaddr, %T0_18* %loadaddr2,
	%T1_18* %blend, %T0_18* %storeaddr) {			%T1_18* %blend, %T0_18* %storeaddr) {
	; CHECK: vbsl			; CHECK: vbsl
	; CHECK: vbsl			; CHECK: vbsl
	%v0 = load %T0_18, %T0_18* %loadaddr			%v0 = load %T0_18, %T0_18* %loadaddr
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Don't pessimize i32 vselect.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 40403

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/trunk/test/Analysis/CostModel/ARM/select.ll

llvm/trunk/test/CodeGen/ARM/vselect_imax.ll

[ARM] Don't pessimize i32 vselect.
ClosedPublic