This is an archive of the discontinued LLVM Phabricator instance.

InstCombine: Simplify vector load based on demanded elements
AbandonedPublic

Authored by ruiling on Nov 16 2022, 10:44 PM.

Download Raw Diff

Details

Reviewers

arsenm
foad
spatel
piotr
RKSimon

Summary

The change tries to load less vector elements based on demanded elements.
This might not help code generation quality for targets that will finally
scalarize the load. But this helps a lot for target like AMDGPU which will map
to native vector load.

The motivating case for the change is we observe below pattern in
compute workload:

%a = load <i32 x 4>
%b = load <i32 x 4>

use(%a.012)

As the last element of %a was not used, the register allocator reuse the
physical register for the unused element, then it cause an unncessary
s_waitcnt inserted between the two loads.

$v0_v1_v2_v3 = load <i32 x 4>
s_waitcnt
$v3_v4_v5_v6 = load <i32 x 4>

The change here would help avoiding such case in backend, and in general
this should also help reducing memory traffic.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	520 ms	x64 debian > Clang.CodeGen/SystemZ::builtins-systemz-zvector-constrained.c
	3,810 ms	x64 debian > Clang.CodeGen/SystemZ::builtins-systemz-zvector.c
	350 ms	x64 debian > Clang.CodeGen/SystemZ::builtins-systemz-zvector2-constrained.c
	470 ms	x64 debian > Clang.CodeGen/SystemZ::builtins-systemz-zvector2.c
	730 ms	x64 debian > Clang.CodeGen/SystemZ::builtins-systemz-zvector3.c
		View Full Test Results (15 Failed)

Event Timeline

ruiling created this revision.Nov 16 2022, 10:44 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2022, 10:44 PM

Herald added subscribers: kosarev, kerbowa, hiraditya and 2 others. · View Herald Transcript

ruiling requested review of this revision.Nov 16 2022, 10:44 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2022, 10:44 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B198133: Diff 476015.Nov 16 2022, 11:17 PM

piotr added inline comments.Nov 17 2022, 12:39 AM

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
1716	Should the original load be replaced rather than kept around?

I've always thought we didn't have such a transformation in the generic instcombine, because it may hurt some targets, but I don't immediately see how. (We do have a similar one for intrinsic loads in the AMDGPU-specific file).

In D138186#3933180, @piotr wrote:

I've always thought we didn't have such a transformation in the generic instcombine, because it may hurt some targets, but I don't immediately see how. (We do have a similar one for intrinsic loads in the AMDGPU-specific file).

For X86, the vector load will be scalarized, thus it is easy to be optimized during instruction selection or later on MachineIR. I guess this may contribute to why we were not doing such optimization before. I could not see why this would hurt other targets.

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
1716	The original load becomes dead and will be deleted afterwards in this pass.

spatel added a reviewer: RKSimon.Nov 17 2022, 6:06 AM

In D138186#3933180, @piotr wrote:

I've always thought we didn't have such a transformation in the generic instcombine, because it may hurt some targets, but I don't immediately see how. (We do have a similar one for intrinsic loads in the AMDGPU-specific file).

This is correct - this is not universally beneficial/recoverable.

Looking at the diffs here for example, we can see that a <4 x float> load (legal on x86) is replaced by a <3 x float> load (not legal on x86). There is no way to recover the wider load at any later point in optimization because there is nothing in the IR that says it is safe to load the extra bytes. A single 128-bit load reduced to 96-bit load could then have to be split into a 64-bit load and 32-bit load, and that's likely worse for perf.

We've been trying to solve bugs like that (the source is written in a way that is hard to optimize) for a long time, so this would make things worse. Example:
https://github.com/llvm/llvm-project/issues/17113

I recommend trying this in VectorCombine or AggressiveInstCombine with TTI legality checks (assuming it needs to happen early in IR to unlock other transforms).

This revision now requires changes to proceed.Nov 17 2022, 6:18 AM

Agreed - as it is this patch puts even more strain on the DAG to load combine back to legality.

We don't even seem to have a way to track original dereferencable ranges :(

We also wouldn't want to do this if the load is a scalar load candidate, since there are no 96-bit scalar loads

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

In D138186#3969988, @ruiling wrote:

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

I think the opposite. The backend load passes have to deal with way more patterns and addressing modes

In D138186#3970779, @arsenm wrote:

In D138186#3969988, @ruiling wrote:

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

I think the opposite. The backend load passes have to deal with way more patterns and addressing modes

It could be done in the LLVM IR part of the code generation pipeline?

In D138186#3973184, @nhaehnle wrote:

In D138186#3970779, @arsenm wrote:

In D138186#3969988, @ruiling wrote:

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

I think the opposite. The backend load passes have to deal with way more patterns and addressing modes

It could be done in the LLVM IR part of the code generation pipeline?

Don't we already do that in VectorCombine?

In D138186#3974937, @lebedev.ri wrote:

Don't we already do that in VectorCombine?

We have load scalarization, but there's no subvector-narrowing kind of transform AFAIK:
https://github.com/llvm/llvm-project/blob/ee31a4a7029f2f6fda5f416e7eb67ca3907d9e36/llvm/lib/Transforms/Vectorize/VectorCombine.cpp#L1118

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineSimplifyDemanded.cpp

51 lines

test/

Transforms/

InstCombine/

AMDGPU/

simplify-load.ll

16 lines

obfuscated_splat-inseltpoison.ll

2 lines

obfuscated_splat.ll

2 lines

scalarization-inseltpoison.ll

17 lines

scalarization.ll

17 lines

shuffle-binop.ll

10 lines

sink-into-catchswitch.ll

5 lines

Diff 476015

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp

Show First 20 Lines • Show All 1,659 Lines • ▼ Show 20 Lines	if (VWidth == InVWidth) {
llvm_unreachable("Unimp");		llvm_unreachable("Unimp");
}		}
break;		break;
}		}
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::FPExt:		case Instruction::FPExt:
simplifyAndSetOp(I, 0, DemandedElts, UndefElts);		simplifyAndSetOp(I, 0, DemandedElts, UndefElts);
break;		break;
		case Instruction::Load: {
		// See if we can reduce the number of elements being loaded.
		auto *LI = cast<LoadInst>(I);
		unsigned UsedBegin = DemandedElts.countTrailingZeros();
		unsigned ActiveWidth = DemandedElts.getActiveBits();
		auto *OldTy = cast<FixedVectorType>(LI->getType());
		unsigned OldNumElts = OldTy->getNumElements();

		// TODO: Optimize leading elements, which needs adjusting pointers and
		// alignment.
		if (ActiveWidth >= OldNumElts)
		return nullptr;

		unsigned NewNumElts = ActiveWidth;
		Type *EltTy = OldTy->getElementType();
		Type *NewTy =
		(NewNumElts == 1) ? EltTy : FixedVectorType::get(EltTy, NewNumElts);

		auto *OldPtr = LI->getPointerOperand();
		Instruction *NewPtr = nullptr;
		if (!cast<PointerType>(OldPtr->getType())->isOpaque()) {
		NewPtr = CastInst::CreatePointerCast(
		OldPtr, PointerType::get(NewTy, LI->getPointerAddressSpace()));
		}

		auto *NewLI = new LoadInst(NewTy, NewPtr ? NewPtr : OldPtr, LI->getName(),
		LI->isVolatile(), LI->getAlign(),
		LI->getOrdering(), LI->getSyncScopeID());

		Instruction *NewI = nullptr;
		if (NewNumElts == 1) {
		NewI = InsertElementInst::Create(
		PoisonValue::get(OldTy), NewLI,
		ConstantInt::get(Type::getInt32Ty(EltTy->getContext()), UsedBegin));
		} else {
		SmallVector<int> Mask;
		for (unsigned I = 0; I < OldNumElts; ++I) {
		// TODO: This needs to be adjusted when optimizing leading elements.
		if (!!DemandedElts[I])
		Mask.push_back(I);
		else
		Mask.push_back(NewNumElts);
		}
		NewI = new ShuffleVectorInst(NewLI, PoisonValue::get(NewTy), Mask);
		}
		copyMetadataForLoad(NewLI, LI);
		if (NewPtr)
		InsertNewInstWith(NewPtr, *LI);
		InsertNewInstWith(NewLI, *LI);
		piotrUnsubmitted Not Done Reply Inline Actions Should the original load be replaced rather than kept around? piotr: Should the original load be replaced rather than kept around?
		ruilingAuthorUnsubmitted Done Reply Inline Actions The original load becomes dead and will be deleted afterwards in this pass. ruiling: The original load becomes dead and will be deleted afterwards in this pass.
		InsertNewInstWith(NewI, *LI);
		return NewI;
		}
case Instruction::Call: {		case Instruction::Call: {
IntrinsicInst *II = dyn_cast<IntrinsicInst>(I);		IntrinsicInst *II = dyn_cast<IntrinsicInst>(I);
if (!II) break;		if (!II) break;
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::masked_gather: // fallthrough		case Intrinsic::masked_gather: // fallthrough
case Intrinsic::masked_load: {		case Intrinsic::masked_load: {
// Subtlety: If we load from a pointer, the pointer must be valid		// Subtlety: If we load from a pointer, the pointer must be valid
// regardless of whether the element is demanded. Doing otherwise risks		// regardless of whether the element is demanded. Doing otherwise risks
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/AMDGPU/simplify-load.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -passes=instcombine -mtriple=amdgcn-amd-amdhsa %s \| FileCheck %s

				; Make sure the load simplification works for typed pointers.
				define amdgpu_kernel void @simplify_load_vector(<2 x i32> addrspace(1)* %out, <4 x i32> addrspace(1)* %in0, <4 x i32> addrspace(1)* %in1) {
				; CHECK-LABEL: @simplify_load_vector(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x i32> addrspace(1) [[IN0:%.]] to <2 x i32> addrspace(1)
				; CHECK-NEXT: [[A1:%.]] = load <2 x i32>, <2 x i32> addrspace(1) [[TMP1]], align 16
				; CHECK-NEXT: store <2 x i32> [[A1]], <2 x i32> addrspace(1)* [[OUT:%.*]], align 8
				; CHECK-NEXT: ret void
				;
				%a = load <4 x i32>, <4 x i32> addrspace(1)* %in0, align 16
				%r = shufflevector <4 x i32> %a, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
				store <2 x i32> %r, <2 x i32> addrspace(1)* %out, align 8
				ret void
				}

llvm/test/Transforms/InstCombine/obfuscated_splat-inseltpoison.ll

	; RUN: opt -passes=instcombine -S < %s \| FileCheck %s			; RUN: opt -passes=instcombine -S < %s \| FileCheck %s

	define void @test(ptr %in_ptr, ptr %out_ptr) {			define void @test(ptr %in_ptr, ptr %out_ptr) {
	%A = load <4 x float>, ptr %in_ptr, align 16			%A = load <4 x float>, ptr %in_ptr, align 16
	%B = shufflevector <4 x float> %A, <4 x float> poison, <4 x i32> <i32 0, i32 0, i32 undef, i32 undef>			%B = shufflevector <4 x float> %A, <4 x float> poison, <4 x i32> <i32 0, i32 0, i32 undef, i32 undef>
	%C = shufflevector <4 x float> %B, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>			%C = shufflevector <4 x float> %B, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>
	%D = shufflevector <4 x float> %C, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 2, i32 4>			%D = shufflevector <4 x float> %C, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
	; CHECK: %D = shufflevector <4 x float> %A, <4 x float> poison, <4 x i32> zeroinitializer			; CHECK: %D = shufflevector <4 x float> {{%.*}}, <4 x float> poison, <4 x i32> zeroinitializer
	store <4 x float> %D, ptr %out_ptr			store <4 x float> %D, ptr %out_ptr
	ret void			ret void
	}			}

llvm/test/Transforms/InstCombine/obfuscated_splat.ll

	; RUN: opt -passes=instcombine -S < %s \| FileCheck %s			; RUN: opt -passes=instcombine -S < %s \| FileCheck %s

	define void @test(ptr %in_ptr, ptr %out_ptr) {			define void @test(ptr %in_ptr, ptr %out_ptr) {
	%A = load <4 x float>, ptr %in_ptr, align 16			%A = load <4 x float>, ptr %in_ptr, align 16
	%B = shufflevector <4 x float> %A, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 undef, i32 undef>			%B = shufflevector <4 x float> %A, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 undef, i32 undef>
	%C = shufflevector <4 x float> %B, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>			%C = shufflevector <4 x float> %B, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>
	%D = shufflevector <4 x float> %C, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 2, i32 4>			%D = shufflevector <4 x float> %C, <4 x float> %A, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
	; CHECK: %D = shufflevector <4 x float> %A, <4 x float> poison, <4 x i32> zeroinitializer			; CHECK: %D = shufflevector <4 x float> {{%.*}}, <4 x float> poison, <4 x i32> zeroinitializer
	store <4 x float> %D, ptr %out_ptr			store <4 x float> %D, ptr %out_ptr
	ret void			ret void
	}			}

llvm/test/Transforms/InstCombine/scalarization-inseltpoison.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -passes=instcombine -S < %s \| FileCheck %s		; RUN: opt -passes=instcombine -S < %s \| FileCheck %s

define i32 @extract_load(ptr %p) {		define i32 @extract_load(ptr %p) {
;		;
; CHECK-LABEL: @extract_load(		; CHECK-LABEL: @extract_load(
; CHECK-NEXT: [[X:%.]] = load <4 x i32>, ptr [[P:%.]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, ptr [[P:%.]], align 4
; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x i32> [[X]], i64 1		; CHECK-NEXT: [[EXT:%.*]] = extractelement <2 x i32> [[TMP1]], i64 1
; CHECK-NEXT: ret i32 [[EXT]]		; CHECK-NEXT: ret i32 [[EXT]]
;		;
%x = load <4 x i32>, ptr %p, align 4		%x = load <4 x i32>, ptr %p, align 4
%ext = extractelement <4 x i32> %x, i32 1		%ext = extractelement <4 x i32> %x, i32 1
ret i32 %ext		ret i32 %ext
}		}

define double @extract_load_fp(ptr %p) {		define double @extract_load_fp(ptr %p) {
;		;
; CHECK-LABEL: @extract_load_fp(		; CHECK-LABEL: @extract_load_fp(
; CHECK-NEXT: [[X:%.]] = load <4 x double>, ptr [[P:%.]], align 32		; CHECK-NEXT: [[X:%.]] = load <4 x double>, ptr [[P:%.]], align 32
; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x double> [[X]], i64 3		; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x double> [[X]], i64 3
; CHECK-NEXT: ret double [[EXT]]		; CHECK-NEXT: ret double [[EXT]]
;		;
%x = load <4 x double>, ptr %p, align 32		%x = load <4 x double>, ptr %p, align 32
%ext = extractelement <4 x double> %x, i32 3		%ext = extractelement <4 x double> %x, i32 3
ret double %ext		ret double %ext
}		}

define double @extract_load_volatile(ptr %p) {		define double @extract_load_volatile(ptr %p) {
;		;
; CHECK-LABEL: @extract_load_volatile(		; CHECK-LABEL: @extract_load_volatile(
; CHECK-NEXT: [[X:%.]] = load volatile <4 x double>, ptr [[P:%.]], align 32		; CHECK-NEXT: [[TMP1:%.]] = load volatile <3 x double>, ptr [[P:%.]], align 32
; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x double> [[X]], i64 2		; CHECK-NEXT: [[X:%.*]] = load volatile <4 x double>, ptr [[P]], align 32
		; CHECK-NEXT: [[EXT:%.*]] = extractelement <3 x double> [[TMP1]], i64 2
; CHECK-NEXT: ret double [[EXT]]		; CHECK-NEXT: ret double [[EXT]]
;		;
%x = load volatile <4 x double>, ptr %p		%x = load volatile <4 x double>, ptr %p
%ext = extractelement <4 x double> %x, i32 2		%ext = extractelement <4 x double> %x, i32 2
ret double %ext		ret double %ext
}		}

define double @extract_load_extra_use(ptr %p, ptr %p2) {		define double @extract_load_extra_use(ptr %p, ptr %p2) {
▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	;
%b = lshr <4 x i8> %x, <i8 4, i8 3, i8 undef, i8 2>		%b = lshr <4 x i8> %x, <i8 4, i8 3, i8 undef, i8 2>
%r = extractelement <4 x i8> %b, i32 %y		%r = extractelement <4 x i8> %b, i32 %y
ret i8 %r		ret i8 %r
}		}

define float @extract_element_load(<4 x float> %x, ptr %ptr) {		define float @extract_element_load(<4 x float> %x, ptr %ptr) {
;		;
; CHECK-LABEL: @extract_element_load(		; CHECK-LABEL: @extract_element_load(
; CHECK-NEXT: [[LOAD:%.]] = load <4 x float>, ptr [[PTR:%.]], align 16		; CHECK-NEXT: [[TMP1:%.]] = load <3 x float>, ptr [[PTR:%.]], align 16
; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x float> [[LOAD]], i64 2		; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x float> [[TMP1]], i64 2
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[X:%.]], i64 2		; CHECK-NEXT: [[TMP3:%.]] = extractelement <4 x float> [[X:%.]], i64 2
; CHECK-NEXT: [[R:%.*]] = fadd float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[R:%.*]] = fadd float [[TMP2]], [[TMP3]]
; CHECK-NEXT: ret float [[R]]		; CHECK-NEXT: ret float [[R]]
;		;
%load = load <4 x float>, ptr %ptr		%load = load <4 x float>, ptr %ptr
%add = fadd <4 x float> %x, %load		%add = fadd <4 x float> %x, %load
%r = extractelement <4 x float> %add, i32 2		%r = extractelement <4 x float> %add, i32 2
ret float %r		ret float %r
}		}

▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/scalarization.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -passes=instcombine -S < %s \| FileCheck %s		; RUN: opt -passes=instcombine -S < %s \| FileCheck %s

define i32 @extract_load(ptr %p) {		define i32 @extract_load(ptr %p) {
;		;
; CHECK-LABEL: @extract_load(		; CHECK-LABEL: @extract_load(
; CHECK-NEXT: [[X:%.]] = load <4 x i32>, ptr [[P:%.]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, ptr [[P:%.]], align 4
; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x i32> [[X]], i64 1		; CHECK-NEXT: [[EXT:%.*]] = extractelement <2 x i32> [[TMP1]], i64 1
; CHECK-NEXT: ret i32 [[EXT]]		; CHECK-NEXT: ret i32 [[EXT]]
;		;
%x = load <4 x i32>, ptr %p, align 4		%x = load <4 x i32>, ptr %p, align 4
%ext = extractelement <4 x i32> %x, i32 1		%ext = extractelement <4 x i32> %x, i32 1
ret i32 %ext		ret i32 %ext
}		}

define double @extract_load_fp(ptr %p) {		define double @extract_load_fp(ptr %p) {
;		;
; CHECK-LABEL: @extract_load_fp(		; CHECK-LABEL: @extract_load_fp(
; CHECK-NEXT: [[X:%.]] = load <4 x double>, ptr [[P:%.]], align 32		; CHECK-NEXT: [[X:%.]] = load <4 x double>, ptr [[P:%.]], align 32
; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x double> [[X]], i64 3		; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x double> [[X]], i64 3
; CHECK-NEXT: ret double [[EXT]]		; CHECK-NEXT: ret double [[EXT]]
;		;
%x = load <4 x double>, ptr %p, align 32		%x = load <4 x double>, ptr %p, align 32
%ext = extractelement <4 x double> %x, i32 3		%ext = extractelement <4 x double> %x, i32 3
ret double %ext		ret double %ext
}		}

define double @extract_load_volatile(ptr %p) {		define double @extract_load_volatile(ptr %p) {
;		;
; CHECK-LABEL: @extract_load_volatile(		; CHECK-LABEL: @extract_load_volatile(
; CHECK-NEXT: [[X:%.]] = load volatile <4 x double>, ptr [[P:%.]], align 32		; CHECK-NEXT: [[TMP1:%.]] = load volatile <3 x double>, ptr [[P:%.]], align 32
; CHECK-NEXT: [[EXT:%.*]] = extractelement <4 x double> [[X]], i64 2		; CHECK-NEXT: [[X:%.*]] = load volatile <4 x double>, ptr [[P]], align 32
		; CHECK-NEXT: [[EXT:%.*]] = extractelement <3 x double> [[TMP1]], i64 2
; CHECK-NEXT: ret double [[EXT]]		; CHECK-NEXT: ret double [[EXT]]
;		;
%x = load volatile <4 x double>, ptr %p		%x = load volatile <4 x double>, ptr %p
%ext = extractelement <4 x double> %x, i32 2		%ext = extractelement <4 x double> %x, i32 2
ret double %ext		ret double %ext
}		}

define double @extract_load_extra_use(ptr %p, ptr %p2) {		define double @extract_load_extra_use(ptr %p, ptr %p2) {
▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	;
%b = lshr <4 x i8> %x, <i8 4, i8 3, i8 undef, i8 2>		%b = lshr <4 x i8> %x, <i8 4, i8 3, i8 undef, i8 2>
%r = extractelement <4 x i8> %b, i32 %y		%r = extractelement <4 x i8> %b, i32 %y
ret i8 %r		ret i8 %r
}		}

define float @extract_element_load(<4 x float> %x, ptr %ptr) {		define float @extract_element_load(<4 x float> %x, ptr %ptr) {
;		;
; CHECK-LABEL: @extract_element_load(		; CHECK-LABEL: @extract_element_load(
; CHECK-NEXT: [[LOAD:%.]] = load <4 x float>, ptr [[PTR:%.]], align 16		; CHECK-NEXT: [[TMP1:%.]] = load <3 x float>, ptr [[PTR:%.]], align 16
; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x float> [[LOAD]], i64 2		; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x float> [[TMP1]], i64 2
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[X:%.]], i64 2		; CHECK-NEXT: [[TMP3:%.]] = extractelement <4 x float> [[X:%.]], i64 2
; CHECK-NEXT: [[R:%.*]] = fadd float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[R:%.*]] = fadd float [[TMP2]], [[TMP3]]
; CHECK-NEXT: ret float [[R]]		; CHECK-NEXT: ret float [[R]]
;		;
%load = load <4 x float>, ptr %ptr		%load = load <4 x float>, ptr %ptr
%add = fadd <4 x float> %x, %load		%add = fadd <4 x float> %x, %load
%r = extractelement <4 x float> %add, i32 2		%r = extractelement <4 x float> %add, i32 2
ret float %r		ret float %r
}		}

▲ Show 20 Lines • Show All 161 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/shuffle-binop.ll

Show First 20 Lines • Show All 182 Lines • ▼ Show 20 Lines	;
%b = lshr <vscale x 4 x i32> %xsplat, %ysplat		%b = lshr <vscale x 4 x i32> %xsplat, %ysplat
%bsplat = shufflevector <vscale x 4 x i32> %b, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer		%bsplat = shufflevector <vscale x 4 x i32> %b, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
ret <vscale x 4 x i32> %bsplat		ret <vscale x 4 x i32> %bsplat
}		}

define <2 x double> @shuffle_op2_0th_element_mask(ptr %a, ptr %b) {		define <2 x double> @shuffle_op2_0th_element_mask(ptr %a, ptr %b) {
;%0 = load <2 x double>, ptr @d, align 16		;%0 = load <2 x double>, ptr @d, align 16
; CHECK-LABEL: @shuffle_op2_0th_element_mask(		; CHECK-LABEL: @shuffle_op2_0th_element_mask(
; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, ptr [[A:%.]], align 16		; CHECK-NEXT: [[TMP1:%.]] = load double, ptr [[A:%.]], align 16
; CHECK-NEXT: [[TMP2:%.]] = load <2 x double>, ptr [[B:%.]], align 16		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[TMP1]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fsub <2 x double> [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.]] = load double, ptr [[B:%.]], align 16
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP3]], <2 x double> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> poison, double [[TMP3]], i64 0
		; CHECK-NEXT: [[TMP5:%.*]] = fsub <2 x double> [[TMP4]], [[TMP2]]
		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: ret <2 x double> [[SHUFFLE]]		; CHECK-NEXT: ret <2 x double> [[SHUFFLE]]
;		;
%1 = load <2 x double>, ptr %a, align 16		%1 = load <2 x double>, ptr %a, align 16
%2 = shufflevector <2 x double> %1, <2 x double> poison, <2 x i32> zeroinitializer		%2 = shufflevector <2 x double> %1, <2 x double> poison, <2 x i32> zeroinitializer
%3 = load <2 x double>, ptr %b, align 16		%3 = load <2 x double>, ptr %b, align 16
%sub = fsub <2 x double> %3, %2		%sub = fsub <2 x double> %3, %2
%shuffle = shufflevector <2 x double> %sub, <2 x double> %sub, <2 x i32> <i32 2, i32 2>		%shuffle = shufflevector <2 x double> %sub, <2 x double> %sub, <2 x i32> <i32 2, i32 2>
ret <2 x double> %shuffle		ret <2 x double> %shuffle
}		}

declare void @use(<4 x i8>)		declare void @use(<4 x i8>)
declare void @use_v(<vscale x 4 x i32>)		declare void @use_v(<vscale x 4 x i32>)

llvm/test/Transforms/InstCombine/sink-into-catchswitch.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -passes=instcombine -S < %s \| FileCheck %s			; RUN: opt -passes=instcombine -S < %s \| FileCheck %s

	target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-pc-windows-msvc18.0.0"			target triple = "x86_64-pc-windows-msvc18.0.0"

	%struct.B = type { i64, i64 }			%struct.B = type { i64, i64 }

	define void @test1(ptr %p) personality ptr @__CxxFrameHandler3 {			define void @test1(ptr %p) personality ptr @__CxxFrameHandler3 {
	; CHECK-LABEL: @test1(			; CHECK-LABEL: @test1(
	; CHECK-NEXT: invoke.cont:			; CHECK-NEXT: invoke.cont:
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i64>, ptr [[P:%.]], align 8			; CHECK-NEXT: [[TMP0:%.]] = load i64, ptr [[P:%.]], align 8
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i64> [[TMP1]], i64 0
	; CHECK-NEXT: invoke void @throw()			; CHECK-NEXT: invoke void @throw()
	; CHECK-NEXT: to label [[UNREACHABLE:%.]] unwind label [[CATCH_DISPATCH:%.]]			; CHECK-NEXT: to label [[UNREACHABLE:%.]] unwind label [[CATCH_DISPATCH:%.]]
	; CHECK: catch.dispatch:			; CHECK: catch.dispatch:
	; CHECK-NEXT: [[CS:%.]] = catchswitch within none [label %invoke.cont1] unwind label [[EHCLEANUP:%.]]			; CHECK-NEXT: [[CS:%.]] = catchswitch within none [label %invoke.cont1] unwind label [[EHCLEANUP:%.]]
	; CHECK: invoke.cont1:			; CHECK: invoke.cont1:
	; CHECK-NEXT: [[CATCH:%.*]] = catchpad within [[CS]] [ptr null, i32 64, ptr null]			; CHECK-NEXT: [[CATCH:%.*]] = catchpad within [[CS]] [ptr null, i32 64, ptr null]
	; CHECK-NEXT: invoke void @throw() [ "funclet"(token [[CATCH]]) ]			; CHECK-NEXT: invoke void @throw() [ "funclet"(token [[CATCH]]) ]
	; CHECK-NEXT: to label [[UNREACHABLE]] unwind label [[EHCLEANUP]]			; CHECK-NEXT: to label [[UNREACHABLE]] unwind label [[EHCLEANUP]]
	; CHECK: ehcleanup:			; CHECK: ehcleanup:
	; CHECK-NEXT: [[PHI:%.]] = phi i64 [ [[TMP2]], [[CATCH_DISPATCH]] ], [ 9, [[INVOKE_CONT1:%.]] ]			; CHECK-NEXT: [[PHI:%.]] = phi i64 [ [[TMP0]], [[CATCH_DISPATCH]] ], [ 9, [[INVOKE_CONT1:%.]] ]
	; CHECK-NEXT: [[CLEANUP:%.*]] = cleanuppad within none []			; CHECK-NEXT: [[CLEANUP:%.*]] = cleanuppad within none []
	; CHECK-NEXT: call void @release(i64 [[PHI]]) [ "funclet"(token [[CLEANUP]]) ]			; CHECK-NEXT: call void @release(i64 [[PHI]]) [ "funclet"(token [[CLEANUP]]) ]
	; CHECK-NEXT: cleanupret from [[CLEANUP]] unwind to caller			; CHECK-NEXT: cleanupret from [[CLEANUP]] unwind to caller
	; CHECK: unreachable:			; CHECK: unreachable:
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	;			;
	invoke.cont:			invoke.cont:
	%0 = load <2 x i64>, ptr %p, align 8			%0 = load <2 x i64>, ptr %p, align 8
	Show All 25 Lines