This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
32-bit.ll
-
vectorizable-selects-uniform-cmps.ll

Differential D145614

[AARCH64] Enable STORE of v4i8 to help more vectorization opportunities
ClosedPublic

Authored by Carrot on Mar 8 2023, 2:02 PM.

Download Raw Diff

Details

Reviewers

dmgreen

Commits

rGa72162cc529c: [AARCH64] Enable STORE of v4i8 to help more vectorization opportunities

Summary

For the attached test case, currently llvm generates instructions to load/or/store the bytes one by one. Although NEON doesn't support v4i8 natively, we can mix v8i8 alu instructions with 32 bit load/store instructions to handle them, just like x86 backend. So this patch does the same thing as in x86, enable custom STORE of v4i8, and finally I can get vectorized instructions.

Diff Detail

Event Timeline

Carrot created this revision.Mar 8 2023, 2:02 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 2:02 PM

Herald added subscribers: pengfei, hiraditya, kristof.beyls. · View Herald Transcript

Carrot requested review of this revision.Mar 8 2023, 2:02 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 2:02 PM

Herald added subscribers: llvm-commits, • pcwang-thead. · View Herald Transcript

Harbormaster completed remote builds in B218207: Diff 503505.Mar 8 2023, 2:47 PM

Hello. This looks like a nice idea. We have done some work to make v4i8 better in the recent past. I hadn't realized that the slp vectorizer wasn't making use of that for stores.

From what I can tell it looks like this is trying to use setOperationAction(ISD::STORE, MVT::v4i8, Custom) as way to get TargetTrasformInfo::getStoreMinimumVF to allow starting vectorization at v8i4 stores. We already make the truncstore custom in setTruncStoreAction(MVT::v4i16, MVT::v4i8, Custom) further down the file. Can we use that in getStoreMinimumVF instead? Either have it use isTruncStoreLegalOrCustom as opposed to isTruncStoreLegal, or override it in an AArch64 version of the method?

I tried isTruncStoreLegalOrCustom and it works, generates the same code as my patch. The final code is

ldr s0, [x1]
ldr s1, [x0]
ushll   v0.8h, v0.8b, #0       // *
ushll   v1.8h, v1.8b, #0       // *
orr v0.8b, v1.8b, v0.8b
xtn v0.8b, v0.8h               // *
str s0, [x0]
ret

But I still prefer my patch. In the final code you can see some redundant instructions, these are generated by SelectionDAG, which promotes v4i8 to v4i16, processes the data, and truncs it to v4i8 before store. If we widen the vector v4i8 to v8i8, all the extra instructions can be deleted. I will work on it in a separate patch.

ping

Hi - sorry I must have misunderstood your last comment. I was under the impression that you were going to go and look into the other optimizations.

I don't think it is a good idea to change SDAG code generation (setOperationAction) in order to change the SLP cost model (getStoreMinimumVF). The setOperationAction isn't really valid on its own, considering we don't have a custom way of storing v4i8. It would be better to modify the costmodel directly through getStoreMinimumVF, either through the generic version or an AArch64 override.

Allen added a subscriber: Allen.Mar 22 2023, 4:34 AM

X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.

t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address

But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.

In D145614#4214202, @Carrot wrote:
X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.
t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address
But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.

I believe that X86 will treat vector lanes differently to Arm/AArch64. For smaller types the vector will be widened by adding more elements (v4i8 [a,b,c,d] will become v8i8 [a,b,c,d,u,u,u,u]) as opposed to being promoted under aarch64 to larger sizes (v4i8 is promoted to v4i16, with the top half of each lane unused). They can both have their advantages and disadvantages. With SVE having t/b instructions the promotion can make more sense, and it is good to keep SVE and NEON inline. The same is true under MVE which only has 128bit vectors so more types are promoted, but this plays nicely into the how the t/b instructions operate.

I think it is worth separating the cost model controls for the SLP vectorizer and the codegen issues of the produced code.

Override getStoreMinimumVF to support vectorization of store v4i8 for AArch64.

In D145614#4224190, @dmgreen wrote:
In D145614#4214202, @Carrot wrote:
X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.
t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address
But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.
I believe that X86 will treat vector lanes differently to Arm/AArch64. For smaller types the vector will be widened by adding more elements (v4i8 [a,b,c,d] will become v8i8 [a,b,c,d,u,u,u,u]) as opposed to being promoted under aarch64 to larger sizes (v4i8 is promoted to v4i16, with the top half of each lane unused). They can both have their advantages and disadvantages. With SVE having t/b instructions the promotion can make more sense, and it is good to keep SVE and NEON inline. The same is true under MVE which only has 128bit vectors so more types are promoted, but this plays nicely into the how the t/b instructions operate.

What is t/b instructions?

Harbormaster completed remote builds in B222101: Diff 508778.Mar 27 2023, 2:52 PM

Thanks. LGTM

What is t/b instructions?

They are top/bottom instructions. They can do things like trunc and insert into the bottom half of each lane, to extend the top half into a full lane.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
402 ↗	(On Diff #508778)	Slight formatting on the `VF >=4` -> `VF >= 4`, but maybe just use `VF == 4` as I believe all other power2 sizes will already be handled?

This revision is now accepted and ready to land.Mar 28 2023, 2:04 AM

Matt added a subscriber: Matt.Mar 28 2023, 12:30 PM

Thanks for the review! Will commit this version.

Harbormaster completed remote builds in B222596: Diff 509477.Mar 29 2023, 3:29 PM

Closed by commit rGa72162cc529c: [AARCH64] Enable STORE of v4i8 to help more vectorization opportunities (authored by Carrot). · Explain WhyMar 31 2023, 10:03 AM

This revision was automatically updated to reflect the committed changes.

Carrot added a commit: rGa72162cc529c: [AARCH64] Enable STORE of v4i8 to help more vectorization opportunities.

In D145614#4224190, @dmgreen wrote:
In D145614#4214202, @Carrot wrote:
X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.
t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address
But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.
I believe that X86 will treat vector lanes differently to Arm/AArch64. For smaller types the vector will be widened by adding more elements (v4i8 [a,b,c,d] will become v8i8 [a,b,c,d,u,u,u,u]) as opposed to being promoted under aarch64 to larger sizes (v4i8 is promoted to v4i16, with the top half of each lane unused). They can both have their advantages and disadvantages. With SVE having t/b instructions the promotion can make more sense, and it is good to keep SVE and NEON inline. The same is true under MVE which only has 128bit vectors so more types are promoted, but this plays nicely into the how the t/b instructions operate.

Why with SVE t/b instructions the promotion of v4i8 -> v4i16 make more sense than the widen of v4i8 -> v8i8?

In my understanding the widen of v4i8 -> v8i8 usually is a nop, but the promotion of v4i8 -> v4i16 needs a real instruction.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

3 lines

test/

Transforms/

SLPVectorizer/

AArch64/

32-bit.ll

56 lines

vectorizable-selects-uniform-cmps.ll

128 lines

Diff 503505

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 369 Lines • ▼ Show 20 Lines	if (Subtarget->hasNEON()) {
addQRTypeForNEON(MVT::v2f64);		addQRTypeForNEON(MVT::v2f64);
addQRTypeForNEON(MVT::v16i8);		addQRTypeForNEON(MVT::v16i8);
addQRTypeForNEON(MVT::v8i16);		addQRTypeForNEON(MVT::v8i16);
addQRTypeForNEON(MVT::v4i32);		addQRTypeForNEON(MVT::v4i32);
addQRTypeForNEON(MVT::v2i64);		addQRTypeForNEON(MVT::v2i64);
addQRTypeForNEON(MVT::v8f16);		addQRTypeForNEON(MVT::v8f16);
if (Subtarget->hasBF16())		if (Subtarget->hasBF16())
addQRTypeForNEON(MVT::v8bf16);		addQRTypeForNEON(MVT::v8bf16);

		// Add 32-bit vector store to help vectorization opportunities.
		setOperationAction(ISD::STORE, MVT::v4i8, Custom);
}		}

if (Subtarget->hasSVEorSME()) {		if (Subtarget->hasSVEorSME()) {
// Add legal sve predicate types		// Add legal sve predicate types
addRegisterClass(MVT::nxv1i1, &AArch64::PPRRegClass);		addRegisterClass(MVT::nxv1i1, &AArch64::PPRRegClass);
addRegisterClass(MVT::nxv2i1, &AArch64::PPRRegClass);		addRegisterClass(MVT::nxv2i1, &AArch64::PPRRegClass);
addRegisterClass(MVT::nxv4i1, &AArch64::PPRRegClass);		addRegisterClass(MVT::nxv4i1, &AArch64::PPRRegClass);
addRegisterClass(MVT::nxv8i1, &AArch64::PPRRegClass);		addRegisterClass(MVT::nxv8i1, &AArch64::PPRRegClass);
▲ Show 20 Lines • Show All 24,121 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/32-bit.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
				; RUN: opt -passes=slp-vectorizer -S < %s \| FileCheck %s

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				%S = type { i8, i8, i8, i8 }

				define ptr @foo(ptr %this, ptr %rhs) {
				; CHECK-LABEL: define ptr @foo
				; CHECK-SAME: (ptr [[THIS:%.]], ptr [[RHS:%.]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.*]] = load <4 x i8>, ptr [[RHS]], align 1, !tbaa [[TBAA0:![0-9]+]]
				; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i8>, ptr [[THIS]], align 1, !tbaa [[TBAA0]]
				; CHECK-NEXT: [[TMP2:%.*]] = or <4 x i8> [[TMP0]], [[TMP1]]
				; CHECK-NEXT: store <4 x i8> [[TMP2]], ptr [[THIS]], align 1, !tbaa [[TBAA0]]
				; CHECK-NEXT: ret ptr [[THIS]]
				;
				entry:
				%right1 = load i8, ptr %rhs, align 1, !tbaa !6, !range !11, !noundef !12
				%left1 = load i8, ptr %this, align 1, !tbaa !6, !range !11, !noundef !12
				%res1 = or i8 %right1, %left1
				store i8 %res1, ptr %this, align 1, !tbaa !6
				%b = getelementptr inbounds %S, ptr %rhs, i64 0, i32 1
				%right2 = load i8, ptr %b, align 1, !tbaa !13, !range !11, !noundef !12
				%b8 = getelementptr inbounds %S, ptr %this, i64 0, i32 1
				%left2 = load i8, ptr %b8, align 1, !tbaa !13, !range !11, !noundef !12
				%res2 = or i8 %right2, %left2
				store i8 %res2, ptr %b8, align 1, !tbaa !13
				%c = getelementptr inbounds %S, ptr %rhs, i64 0, i32 2
				%right3 = load i8, ptr %c, align 1, !tbaa !14, !range !11, !noundef !12
				%c16 = getelementptr inbounds %S, ptr %this, i64 0, i32 2
				%left3 = load i8, ptr %c16, align 1, !tbaa !14, !range !11, !noundef !12
				%res3 = or i8 %right3, %left3
				store i8 %res3, ptr %c16, align 1, !tbaa !14
				%d = getelementptr inbounds %S, ptr %rhs, i64 0, i32 3
				%right4 = load i8, ptr %d, align 1, !tbaa !15, !range !11, !noundef !12
				%d24 = getelementptr inbounds %S, ptr %this, i64 0, i32 3
				%left4 = load i8, ptr %d24, align 1, !tbaa !15, !range !11, !noundef !12
				%res4 = or i8 %right4, %left4
				store i8 %res4, ptr %d24, align 1, !tbaa !15
				ret ptr %this
				}


				!6 = !{!7, !8, i64 0}
				!7 = !{!"S", !8, i64 0, !8, i64 1, !8, i64 2, !8, i64 3}
				!8 = !{!"bool", !9, i64 0}
				!9 = !{!"omnipotent char", !10, i64 0}
				!10 = !{!"Simple C++ TBAA"}
				!11 = !{i8 0, i8 2}
				!12 = !{}
				!13 = !{!7, !8, i64 1}
				!14 = !{!7, !8, i64 2}
				!15 = !{!7, !8, i64 3}

llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	entry:
%s.7 = select i1 %cmp.7, i16 %l.7, i16 %x		%s.7 = select i1 %cmp.7, i16 %l.7, i16 %x
store i16 %s.7, ptr %gep.7, align 2		store i16 %s.7, ptr %gep.7, align 2
ret void		ret void
}		}

define void @select_uniform_ugt_7xi8(ptr %ptr, i8 %x) {		define void @select_uniform_ugt_7xi8(ptr %ptr, i8 %x) {
; CHECK-LABEL: @select_uniform_ugt_7xi8(		; CHECK-LABEL: @select_uniform_ugt_7xi8(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[L_0:%.]] = load i8, ptr [[PTR:%.]], align 1		; CHECK-NEXT: [[TMP0:%.]] = load <4 x i8>, ptr [[PTR:%.]], align 1
; CHECK-NEXT: [[CMP_0:%.*]] = icmp ugt i8 [[L_0]], -1		; CHECK-NEXT: [[TMP1:%.*]] = icmp ugt <4 x i8> [[TMP0]], <i8 -1, i8 -1, i8 -1, i8 -1>
; CHECK-NEXT: [[S_0:%.]] = select i1 [[CMP_0]], i8 [[L_0]], i8 [[X:%.]]		; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i8> poison, i8 [[X:%.]], i32 0
; CHECK-NEXT: store i8 [[S_0]], ptr [[PTR]], align 2		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i8> [[TMP2]], <4 x i8> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: [[GEP_1:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 1		; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP1]], <4 x i8> [[TMP0]], <4 x i8> [[TMP3]]
; CHECK-NEXT: [[L_1:%.*]] = load i8, ptr [[GEP_1]], align 1		; CHECK-NEXT: store <4 x i8> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: [[CMP_1:%.*]] = icmp ugt i8 [[L_1]], -1
; CHECK-NEXT: [[S_1:%.*]] = select i1 [[CMP_1]], i8 [[L_1]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_1]], ptr [[GEP_1]], align 2
; CHECK-NEXT: [[GEP_2:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 2
; CHECK-NEXT: [[L_2:%.*]] = load i8, ptr [[GEP_2]], align 1
; CHECK-NEXT: [[CMP_2:%.*]] = icmp ugt i8 [[L_2]], -1
; CHECK-NEXT: [[S_2:%.*]] = select i1 [[CMP_2]], i8 [[L_2]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_2]], ptr [[GEP_2]], align 2
; CHECK-NEXT: [[GEP_3:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 3
; CHECK-NEXT: [[L_3:%.*]] = load i8, ptr [[GEP_3]], align 1
; CHECK-NEXT: [[CMP_3:%.*]] = icmp ugt i8 [[L_3]], -1
; CHECK-NEXT: [[S_3:%.*]] = select i1 [[CMP_3]], i8 [[L_3]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_3]], ptr [[GEP_3]], align 2
; CHECK-NEXT: [[GEP_4:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 4		; CHECK-NEXT: [[GEP_4:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 4
; CHECK-NEXT: [[L_4:%.*]] = load i8, ptr [[GEP_4]], align 1		; CHECK-NEXT: [[L_4:%.*]] = load i8, ptr [[GEP_4]], align 1
; CHECK-NEXT: [[CMP_4:%.*]] = icmp ugt i8 [[L_4]], -1		; CHECK-NEXT: [[CMP_4:%.*]] = icmp ugt i8 [[L_4]], -1
; CHECK-NEXT: [[S_4:%.*]] = select i1 [[CMP_4]], i8 [[L_4]], i8 [[X]]		; CHECK-NEXT: [[S_4:%.*]] = select i1 [[CMP_4]], i8 [[L_4]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_4]], ptr [[GEP_4]], align 2		; CHECK-NEXT: store i8 [[S_4]], ptr [[GEP_4]], align 2
; CHECK-NEXT: [[GEP_5:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 5		; CHECK-NEXT: [[GEP_5:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 5
; CHECK-NEXT: [[L_5:%.*]] = load i8, ptr [[GEP_5]], align 1		; CHECK-NEXT: [[L_5:%.*]] = load i8, ptr [[GEP_5]], align 1
; CHECK-NEXT: [[CMP_5:%.*]] = icmp ugt i8 [[L_5]], -1		; CHECK-NEXT: [[CMP_5:%.*]] = icmp ugt i8 [[L_5]], -1
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
}		}


; Positive tests.		; Positive tests.

define void @select_uniform_ugt_8xi8(ptr %ptr, i8 %x) {		define void @select_uniform_ugt_8xi8(ptr %ptr, i8 %x) {
; CHECK-LABEL: @select_uniform_ugt_8xi8(		; CHECK-LABEL: @select_uniform_ugt_8xi8(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP1:%.]] = load <8 x i8>, ptr [[PTR:%.]], align 1		; CHECK-NEXT: [[TMP0:%.]] = load <8 x i8>, ptr [[PTR:%.]], align 1
; CHECK-NEXT: [[TMP2:%.*]] = icmp ugt <8 x i8> [[TMP1]], <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>		; CHECK-NEXT: [[TMP1:%.*]] = icmp ugt <8 x i8> [[TMP0]], <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i8> poison, i8 [[X:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i8> poison, i8 [[X:%.]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x i8> [[TMP3]], <8 x i8> poison, <8 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i8> [[TMP2]], <8 x i8> poison, <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = select <8 x i1> [[TMP2]], <8 x i8> [[TMP1]], <8 x i8> [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = select <8 x i1> [[TMP1]], <8 x i8> [[TMP0]], <8 x i8> [[TMP3]]
; CHECK-NEXT: store <8 x i8> [[TMP4]], ptr [[PTR]], align 2		; CHECK-NEXT: store <8 x i8> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%l.0 = load i8, ptr %ptr		%l.0 = load i8, ptr %ptr
%cmp.0 = icmp ugt i8 %l.0, 16383		%cmp.0 = icmp ugt i8 %l.0, 16383
%s.0 = select i1 %cmp.0, i8 %l.0, i8 %x		%s.0 = select i1 %cmp.0, i8 %l.0, i8 %x
store i8 %s.0, ptr %ptr, align 2		store i8 %s.0, ptr %ptr, align 2
Show All 40 Lines	entry:
%s.7 = select i1 %cmp.7, i8 %l.7, i8 %x		%s.7 = select i1 %cmp.7, i8 %l.7, i8 %x
store i8 %s.7, ptr %gep.7, align 2		store i8 %s.7, ptr %gep.7, align 2
ret void		ret void
}		}

define void @select_uniform_ugt_16xi8(ptr %ptr, i8 %x) {		define void @select_uniform_ugt_16xi8(ptr %ptr, i8 %x) {
; CHECK-LABEL: @select_uniform_ugt_16xi8(		; CHECK-LABEL: @select_uniform_ugt_16xi8(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP1:%.]] = load <8 x i8>, ptr [[PTR:%.]], align 1		; CHECK-NEXT: [[TMP0:%.]] = load <8 x i8>, ptr [[PTR:%.]], align 1
; CHECK-NEXT: [[TMP2:%.*]] = icmp ugt <8 x i8> [[TMP1]], <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>		; CHECK-NEXT: [[TMP1:%.*]] = icmp ugt <8 x i8> [[TMP0]], <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i8> poison, i8 [[X:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i8> poison, i8 [[X:%.]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x i8> [[TMP3]], <8 x i8> poison, <8 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i8> [[TMP2]], <8 x i8> poison, <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = select <8 x i1> [[TMP2]], <8 x i8> [[TMP1]], <8 x i8> [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = select <8 x i1> [[TMP1]], <8 x i8> [[TMP0]], <8 x i8> [[TMP3]]
; CHECK-NEXT: store <8 x i8> [[TMP4]], ptr [[PTR]], align 2		; CHECK-NEXT: store <8 x i8> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: [[GEP_8:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 8		; CHECK-NEXT: [[GEP_8:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 8
; CHECK-NEXT: [[L_8:%.*]] = load i8, ptr [[GEP_8]], align 1		; CHECK-NEXT: [[L_8:%.*]] = load i8, ptr [[GEP_8]], align 1
; CHECK-NEXT: [[CMP_8:%.*]] = icmp ugt i8 [[L_8]], -1		; CHECK-NEXT: [[CMP_8:%.*]] = icmp ugt i8 [[L_8]], -1
; CHECK-NEXT: [[TMP6:%.*]] = extractelement <8 x i8> [[TMP1]], i32 0		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i8> [[TMP0]], i32 0
; CHECK-NEXT: [[S_8:%.*]] = select i1 [[CMP_8]], i8 [[TMP6]], i8 [[X]]		; CHECK-NEXT: [[S_8:%.*]] = select i1 [[CMP_8]], i8 [[TMP5]], i8 [[X]]
; CHECK-NEXT: [[TMP7:%.*]] = extractelement <8 x i8> [[TMP4]], i32 0		; CHECK-NEXT: [[TMP6:%.*]] = extractelement <8 x i8> [[TMP4]], i32 0
; CHECK-NEXT: store i8 [[TMP7]], ptr [[GEP_8]], align 2		; CHECK-NEXT: store i8 [[TMP6]], ptr [[GEP_8]], align 2
; CHECK-NEXT: [[GEP_9:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 9		; CHECK-NEXT: [[GEP_9:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 9
; CHECK-NEXT: [[L_9:%.*]] = load i8, ptr [[GEP_9]], align 1		; CHECK-NEXT: [[TMP7:%.*]] = load <4 x i8>, ptr [[GEP_9]], align 1
; CHECK-NEXT: [[CMP_9:%.*]] = icmp ugt i8 [[L_9]], -1		; CHECK-NEXT: [[TMP8:%.*]] = icmp ugt <4 x i8> [[TMP7]], <i8 -1, i8 -1, i8 -1, i8 -1>
; CHECK-NEXT: [[S_9:%.*]] = select i1 [[CMP_9]], i8 [[L_9]], i8 [[X]]		; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i8> poison, i8 [[X]], i32 0
; CHECK-NEXT: store i8 [[S_9]], ptr [[GEP_9]], align 2		; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i8> [[TMP9]], <4 x i8> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: [[GEP_10:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 10		; CHECK-NEXT: [[TMP11:%.*]] = select <4 x i1> [[TMP8]], <4 x i8> [[TMP7]], <4 x i8> [[TMP10]]
; CHECK-NEXT: [[L_10:%.*]] = load i8, ptr [[GEP_10]], align 1		; CHECK-NEXT: store <4 x i8> [[TMP11]], ptr [[GEP_9]], align 2
; CHECK-NEXT: [[CMP_10:%.*]] = icmp ugt i8 [[L_10]], -1
; CHECK-NEXT: [[S_10:%.*]] = select i1 [[CMP_10]], i8 [[L_10]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_10]], ptr [[GEP_10]], align 2
; CHECK-NEXT: [[GEP_11:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 11
; CHECK-NEXT: [[L_11:%.*]] = load i8, ptr [[GEP_11]], align 1
; CHECK-NEXT: [[CMP_11:%.*]] = icmp ugt i8 [[L_11]], -1
; CHECK-NEXT: [[S_11:%.*]] = select i1 [[CMP_11]], i8 [[L_11]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_11]], ptr [[GEP_11]], align 2
; CHECK-NEXT: [[GEP_12:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 12
; CHECK-NEXT: [[L_12:%.*]] = load i8, ptr [[GEP_12]], align 1
; CHECK-NEXT: [[CMP_12:%.*]] = icmp ugt i8 [[L_12]], -1
; CHECK-NEXT: [[S_12:%.*]] = select i1 [[CMP_12]], i8 [[L_12]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_12]], ptr [[GEP_12]], align 2
; CHECK-NEXT: [[GEP_13:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 13		; CHECK-NEXT: [[GEP_13:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 13
; CHECK-NEXT: [[L_13:%.*]] = load i8, ptr [[GEP_13]], align 1		; CHECK-NEXT: [[L_13:%.*]] = load i8, ptr [[GEP_13]], align 1
; CHECK-NEXT: [[CMP_13:%.*]] = icmp ugt i8 [[L_13]], -1		; CHECK-NEXT: [[CMP_13:%.*]] = icmp ugt i8 [[L_13]], -1
; CHECK-NEXT: [[S_13:%.*]] = select i1 [[CMP_13]], i8 [[L_13]], i8 [[X]]		; CHECK-NEXT: [[S_13:%.*]] = select i1 [[CMP_13]], i8 [[L_13]], i8 [[X]]
; CHECK-NEXT: store i8 [[S_13]], ptr [[GEP_13]], align 2		; CHECK-NEXT: store i8 [[S_13]], ptr [[GEP_13]], align 2
; CHECK-NEXT: [[GEP_14:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 14		; CHECK-NEXT: [[GEP_14:%.*]] = getelementptr inbounds i8, ptr [[PTR]], i8 14
; CHECK-NEXT: [[L_14:%.*]] = load i8, ptr [[GEP_14]], align 1		; CHECK-NEXT: [[L_14:%.*]] = load i8, ptr [[GEP_14]], align 1
; CHECK-NEXT: [[CMP_14:%.*]] = icmp ugt i8 [[L_14]], -1		; CHECK-NEXT: [[CMP_14:%.*]] = icmp ugt i8 [[L_14]], -1
▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines	entry:

ret void		ret void
}		}


define void @select_uniform_ugt_4xi16(ptr %ptr, i16 %x) {		define void @select_uniform_ugt_4xi16(ptr %ptr, i16 %x) {
; CHECK-LABEL: @select_uniform_ugt_4xi16(		; CHECK-LABEL: @select_uniform_ugt_4xi16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP1:%.]] = load <4 x i16>, ptr [[PTR:%.]], align 2		; CHECK-NEXT: [[TMP0:%.]] = load <4 x i16>, ptr [[PTR:%.]], align 2
; CHECK-NEXT: [[TMP2:%.*]] = icmp ugt <4 x i16> [[TMP1]], <i16 16383, i16 16383, i16 16383, i16 16383>		; CHECK-NEXT: [[TMP1:%.*]] = icmp ugt <4 x i16> [[TMP0]], <i16 16383, i16 16383, i16 16383, i16 16383>
; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i16> poison, i16 [[X:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i16> poison, i16 [[X:%.]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i16> [[TMP3]], <4 x i16> poison, <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i16> [[TMP2]], <4 x i16> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP2]], <4 x i16> [[TMP1]], <4 x i16> [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP1]], <4 x i16> [[TMP0]], <4 x i16> [[TMP3]]
; CHECK-NEXT: store <4 x i16> [[TMP4]], ptr [[PTR]], align 2		; CHECK-NEXT: store <4 x i16> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%l.0 = load i16, ptr %ptr		%l.0 = load i16, ptr %ptr
%cmp.0 = icmp ugt i16 %l.0, 16383		%cmp.0 = icmp ugt i16 %l.0, 16383
%s.0 = select i1 %cmp.0, i16 %l.0, i16 %x		%s.0 = select i1 %cmp.0, i16 %l.0, i16 %x
store i16 %s.0, ptr %ptr, align 2		store i16 %s.0, ptr %ptr, align 2
Show All 17 Lines	entry:
store i16 %s.3, ptr %gep.3, align 2		store i16 %s.3, ptr %gep.3, align 2

ret void		ret void
}		}

define void @select_uniform_ult_8xi16(ptr %ptr, i16 %x) {		define void @select_uniform_ult_8xi16(ptr %ptr, i16 %x) {
; CHECK-LABEL: @select_uniform_ult_8xi16(		; CHECK-LABEL: @select_uniform_ult_8xi16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP1:%.]] = load <8 x i16>, ptr [[PTR:%.]], align 2		; CHECK-NEXT: [[TMP0:%.]] = load <8 x i16>, ptr [[PTR:%.]], align 2
; CHECK-NEXT: [[TMP2:%.*]] = icmp ult <8 x i16> [[TMP1]], <i16 16383, i16 16383, i16 16383, i16 16383, i16 16383, i16 16383, i16 16383, i16 16383>		; CHECK-NEXT: [[TMP1:%.*]] = icmp ult <8 x i16> [[TMP0]], <i16 16383, i16 16383, i16 16383, i16 16383, i16 16383, i16 16383, i16 16383, i16 16383>
; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i16> poison, i16 [[X:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i16> poison, i16 [[X:%.]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x i16> [[TMP3]], <8 x i16> poison, <8 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i16> [[TMP2]], <8 x i16> poison, <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = select <8 x i1> [[TMP2]], <8 x i16> [[TMP1]], <8 x i16> [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = select <8 x i1> [[TMP1]], <8 x i16> [[TMP0]], <8 x i16> [[TMP3]]
; CHECK-NEXT: store <8 x i16> [[TMP4]], ptr [[PTR]], align 2		; CHECK-NEXT: store <8 x i16> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%l.0 = load i16, ptr %ptr		%l.0 = load i16, ptr %ptr
%cmp.0 = icmp ult i16 %l.0, 16383		%cmp.0 = icmp ult i16 %l.0, 16383
%s.0 = select i1 %cmp.0, i16 %l.0, i16 %x		%s.0 = select i1 %cmp.0, i16 %l.0, i16 %x
store i16 %s.0, ptr %ptr, align 2		store i16 %s.0, ptr %ptr, align 2
Show All 40 Lines	entry:
%s.7 = select i1 %cmp.7, i16 %l.7, i16 %x		%s.7 = select i1 %cmp.7, i16 %l.7, i16 %x
store i16 %s.7, ptr %gep.7, align 2		store i16 %s.7, ptr %gep.7, align 2
ret void		ret void
}		}

define void @select_uniform_eq_2xi32(ptr %ptr, i32 %x) {		define void @select_uniform_eq_2xi32(ptr %ptr, i32 %x) {
; CHECK-LABEL: @select_uniform_eq_2xi32(		; CHECK-LABEL: @select_uniform_eq_2xi32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, ptr [[PTR:%.]], align 4		; CHECK-NEXT: [[TMP0:%.]] = load <2 x i32>, ptr [[PTR:%.]], align 4
; CHECK-NEXT: [[TMP2:%.*]] = icmp eq <2 x i32> [[TMP1]], <i32 16383, i32 16383>		; CHECK-NEXT: [[TMP1:%.*]] = icmp eq <2 x i32> [[TMP0]], <i32 16383, i32 16383>
; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x i32> poison, i32 [[X:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x i32> poison, i32 [[X:%.]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = select <2 x i1> [[TMP2]], <2 x i32> [[TMP1]], <2 x i32> [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = select <2 x i1> [[TMP1]], <2 x i32> [[TMP0]], <2 x i32> [[TMP3]]
; CHECK-NEXT: store <2 x i32> [[TMP4]], ptr [[PTR]], align 2		; CHECK-NEXT: store <2 x i32> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%l.0 = load i32, ptr %ptr		%l.0 = load i32, ptr %ptr
%cmp.0 = icmp eq i32 %l.0, 16383		%cmp.0 = icmp eq i32 %l.0, 16383
%s.0 = select i1 %cmp.0, i32 %l.0, i32 %x		%s.0 = select i1 %cmp.0, i32 %l.0, i32 %x
store i32 %s.0, ptr %ptr, align 2		store i32 %s.0, ptr %ptr, align 2

%gep.1 = getelementptr inbounds i32, ptr %ptr, i32 1		%gep.1 = getelementptr inbounds i32, ptr %ptr, i32 1
%l.1 = load i32, ptr %gep.1		%l.1 = load i32, ptr %gep.1
%cmp.1 = icmp eq i32 %l.1, 16383		%cmp.1 = icmp eq i32 %l.1, 16383
%s.1 = select i1 %cmp.1, i32 %l.1, i32 %x		%s.1 = select i1 %cmp.1, i32 %l.1, i32 %x
store i32 %s.1, ptr %gep.1, align 2		store i32 %s.1, ptr %gep.1, align 2

ret void		ret void
}		}

define void @select_uniform_eq_4xi32(ptr %ptr, i32 %x) {		define void @select_uniform_eq_4xi32(ptr %ptr, i32 %x) {
; CHECK-LABEL: @select_uniform_eq_4xi32(		; CHECK-LABEL: @select_uniform_eq_4xi32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, ptr [[PTR:%.]], align 4		; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, ptr [[PTR:%.]], align 4
; CHECK-NEXT: [[TMP2:%.*]] = icmp eq <4 x i32> [[TMP1]], <i32 16383, i32 16383, i32 16383, i32 16383>		; CHECK-NEXT: [[TMP1:%.*]] = icmp eq <4 x i32> [[TMP0]], <i32 16383, i32 16383, i32 16383, i32 16383>
; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i32> poison, i32 [[X:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i32> poison, i32 [[X:%.]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> poison, <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP2]], <4 x i32> [[TMP1]], <4 x i32> [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> [[TMP0]], <4 x i32> [[TMP3]]
; CHECK-NEXT: store <4 x i32> [[TMP4]], ptr [[PTR]], align 2		; CHECK-NEXT: store <4 x i32> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%l.0 = load i32, ptr %ptr		%l.0 = load i32, ptr %ptr
%cmp.0 = icmp eq i32 %l.0, 16383		%cmp.0 = icmp eq i32 %l.0, 16383
%s.0 = select i1 %cmp.0, i32 %l.0, i32 %x		%s.0 = select i1 %cmp.0, i32 %l.0, i32 %x
store i32 %s.0, ptr %ptr, align 2		store i32 %s.0, ptr %ptr, align 2
Show All 16 Lines	entry:
%s.3 = select i1 %cmp.3, i32 %l.3, i32 %x		%s.3 = select i1 %cmp.3, i32 %l.3, i32 %x
store i32 %s.3, ptr %gep.3, align 2		store i32 %s.3, ptr %gep.3, align 2
ret void		ret void
}		}

define void @select_uniform_ne_2xi64(ptr %ptr, i64 %x) {		define void @select_uniform_ne_2xi64(ptr %ptr, i64 %x) {
; CHECK-LABEL: @select_uniform_ne_2xi64(		; CHECK-LABEL: @select_uniform_ne_2xi64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP1:%.]] = load <2 x i64>, ptr [[PTR:%.]], align 8		; CHECK-NEXT: [[TMP0:%.]] = load <2 x i64>, ptr [[PTR:%.]], align 8
; CHECK-NEXT: [[TMP2:%.*]] = icmp ne <2 x i64> [[TMP1]], <i64 16383, i64 16383>		; CHECK-NEXT: [[TMP1:%.*]] = icmp ne <2 x i64> [[TMP0]], <i64 16383, i64 16383>
; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x i64> poison, i64 [[X:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x i64> poison, i64 [[X:%.]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP2]], <2 x i64> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = select <2 x i1> [[TMP2]], <2 x i64> [[TMP1]], <2 x i64> [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = select <2 x i1> [[TMP1]], <2 x i64> [[TMP0]], <2 x i64> [[TMP3]]
; CHECK-NEXT: store <2 x i64> [[TMP4]], ptr [[PTR]], align 2		; CHECK-NEXT: store <2 x i64> [[TMP4]], ptr [[PTR]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%l.0 = load i64, ptr %ptr		%l.0 = load i64, ptr %ptr
%cmp.0 = icmp ne i64 %l.0, 16383		%cmp.0 = icmp ne i64 %l.0, 16383
%s.0 = select i1 %cmp.0, i64 %l.0, i64 %x		%s.0 = select i1 %cmp.0, i64 %l.0, i64 %x
store i64 %s.0, ptr %ptr, align 2		store i64 %s.0, ptr %ptr, align 2
Show All 9 Lines