This is an archive of the discontinued LLVM Phabricator instance.

[CostModel] Use isExtLoad in BasicTTI
Needs ReviewPublic

Authored by samparker on Apr 27 2020, 8:57 AM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
spatel
Ayal
jsji
hfinkel
gilr

Summary

Compared against D78922, use isExtLoad instead of isLoadExtLegal which causes some changes around vectorization for X86.

Diff Detail

Event Timeline

samparker created this revision.Apr 27 2020, 8:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2020, 8:57 AM

samparker mentioned this in D78922: [CostModel] Remove getExtCost.Apr 27 2020, 8:58 AM

RKSimon added inline comments.May 7 2020, 11:11 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
700	Maybe keep this separate from the int extends?

Now rebased against master. This is making some (surprising to me) changes to vector code which I can only assume is due to the concerns which @dmgreen raised in D78922. I tried a heavy-handed approach of not using TLI calls for vector casts but that didn't look like a suitable option. So it would be really good to hear from the X86 and PPC guys on whether the vector code changes look sane or whether TLI is just causing more confusion.

Herald added a subscriber: nemanjai. · View Herald TranscriptMay 21 2020, 1:29 AM

This would conflict with D79162. That's not the prettiest patch in the world, but it solves some real problems we have in the vectorizer and moving further away from it seems like a mistake.

I get that given an (accurate) context instruction I, it's good to make use of it to produce better costs. But ideally the costmodel should always be costing hypothetical instructions, not relying on real ones.

On a concrete level, this might now be using the type of the actually load (say i32), and not the type the costmodel has provided (v4i32, for example)?

Herald added a subscriber: • wuzish. · View Herald TranscriptMay 21 2020, 2:26 AM

ideally the costmodel should always be costing hypothetical instructions, not relying on real ones.

This probably isn't true for most optimisations though, but of course the vectorizer is an important (but small) user of this. I would suspect that vectorizer should only really be passing the instruction when calculating the scalar cost of the loop. But I'll revisit this once the context stuff is in.

@samparker reverse ping

I've tried to address @dmgreen comments about TLI being queried for an instruction which doesn't match the queried types, so this is now checked. But this wasn't good for X86, as a lot of SLP tests remained scalars, so I've added the TLI check into the X86 backend too. This also reverts the previous test change for PPC.

Herald added a subscriber: hiraditya. · View Herald TranscriptJun 18 2020, 2:50 AM

I was trying yesterday to rebase D79162 on top of some fp16 work I have been doing lately, but that is not ready yet. I do think it makes sense to try and get those problems fixed first, and don't believe the changes here are really a good thing. It is dangerous with how we have things set up at the moment for anything to look at the _type_ of the context instruction. It is wrong in too many places. Ideally you would only even look at the opcode of surrounding instructions, and even those are sometimes dubious.

If we do really want to do this, we would probably need to remove the context from any place where it wasn't the actual instruction being costed. That would have to be done without causing regressions anywhere else too. And so might need two code paths - the existing one that went though isLoadExtLegal and the call to isExtLoad.

As I've said elsewhere my idea in the long run is to somehow create a better framework for costing multiple instruction at the same time. Which would probably be related to for how vplan cost modelling turns out. This will likely not happen any time soon though.

It is dangerous with how we have things set up at the moment for anything to look at the _type_ of the context instruction

But now we're checking that the queried type matches the type of context instruction too, so I'm not really what other conclusion to draw other than that the user asking: what is the cost of this instruction? Based on the currently limited API, what scenario do you think this would not generally be true?

In D78937#2100582, @samparker wrote:

But now we're checking that the queried type matches the type of context instruction too, so I'm not really what other conclusion to draw other than that the user asking: what is the cost of this instruction? Based on the currently limited API, what scenario do you think this would not generally be true?

But you removed the code that would check if a the extend is free if they do not match up. So the vectorizer calling getCastCost with vector types but the original scalar instruction would give different results? I would imagine this might be where a lot of the X86 changes are coming from.

This function has always been in the past - "give me the cost of a hypothetical instruction, using I as context". It has never been "give me the cost of I". I'm not really saying we shouldn't change that, I really just want fix some of the regressions we have already seen and improve how we can handle masked loads for tail predication. But I do think that the first makes for a better design if we can possibly stick to it. We might well want to change it to "give me the cost of this hypothetical instruction or this this given instruction I", but that means we would need more joined up thinking for the places calling this (vectorizer, slp, etc) to not pass I if the instructions do not match up. And that on it's own loses an amount of useful information.

At the moment this looks like this is purely a x86 patch, and the BasicTTIImpl.h diff is superfluous. I think the arm\cast.ll test changes go away on rebase

Herald added a subscriber: pengfei. · View Herald TranscriptMar 30 2021, 2:39 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

BasicTTIImpl.h

9 lines

test/

Transforms/

LoopVectorize/

PowerPC/

reg-usage.ll

2 lines

X86/

vectorization-remarks-loopid-dbg.ll

2 lines

vectorization-remarks.ll

2 lines

SLPVectorizer/

X86/

sext.ll

246 lines

Diff 265455

llvm/include/llvm/CodeGen/BasicTTIImpl.h

	Show First 20 Lines • Show All 483 Lines • ▼ Show 20 Lines
	case Instruction::FPExt:			case Instruction::FPExt:
	if (I && getTLI()->isExtFree(I))			if (I && getTLI()->isExtFree(I))
	return 0;			return 0;
	break;			break;
	case Instruction::ZExt:			case Instruction::ZExt:
	if (TLI->isZExtFree(SrcLT.second, DstLT.second))			if (TLI->isZExtFree(SrcLT.second, DstLT.second))
	return 0;			return 0;
	LLVM_FALLTHROUGH;			LLVM_FALLTHROUGH;
	case Instruction::SExt:			case Instruction::SExt:
	RKSimonUnsubmitted Not Done Reply Inline Actions Maybe keep this separate from the int extends? RKSimon: Maybe keep this separate from the int extends?
	if (!I)			if (!I)
	break;			break;

	if (getTLI()->isExtFree(I))			if (getTLI()->isExtFree(I))
	return 0;			return 0;

	// If this is a zext/sext of a load, return 0 if the corresponding			// If this is a zext/sext of a load, return 0 if the corresponding
	// extending load exists on target.			// extending load exists on target.
	if (I && isa<LoadInst>(I->getOperand(0))) {			if (auto LI = dyn_cast<LoadInst>(I->getOperand(0)))
	EVT ExtVT = EVT::getEVT(Dst);			if (getTLI()->isExtLoad(LI, I, DL))
	EVT LoadVT = EVT::getEVT(Src);
	unsigned LType =
	((Opcode == Instruction::ZExt) ? ISD::ZEXTLOAD : ISD::SEXTLOAD);
	if (TLI->isLoadExtLegal(LType, ExtVT, LoadVT))
	return 0;			return 0;
	}
	break;			break;
	case Instruction::AddrSpaceCast:			case Instruction::AddrSpaceCast:
	if (TLI->isFreeAddrSpaceCast(Src->getPointerAddressSpace(),			if (TLI->isFreeAddrSpaceCast(Src->getPointerAddressSpace(),
	Dst->getPointerAddressSpace()))			Dst->getPointerAddressSpace()))
	return 0;			return 0;
	break;			break;
	}			}

	▲ Show 20 Lines • Show All 492 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll

	; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=powerpc64-unknown-linux -S -mcpu=pwr8 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR8			; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=powerpc64-unknown-linux -S -mcpu=pwr8 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR8
	; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=powerpc64le-unknown-linux -S -mcpu=pwr9 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR9			; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=powerpc64le-unknown-linux -S -mcpu=pwr9 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR9
	; REQUIRES: asserts			; REQUIRES: asserts

	@a = global [1024 x i8] zeroinitializer, align 16			@a = global [1024 x i8] zeroinitializer, align 16
	@b = global [1024 x i8] zeroinitializer, align 16			@b = global [1024 x i8] zeroinitializer, align 16

	define i32 @foo() {			define i32 @foo() {
	; CHECK-LABEL: foo			; CHECK-LABEL: foo

	; CHECK-PWR8: Setting best plan to VF=16, UF=4			; CHECK-PWR8: Setting best plan to VF=16, UF=4

	; CHECK-PWR9: Setting best plan to VF=8, UF=8			; CHECK-PWR9: Setting best plan to VF=16, UF=4


	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup:			for.cond.cleanup:
	%add.lcssa = phi i32 [ %add, %for.body ]			%add.lcssa = phi i32 [ %add, %for.body ]
	ret i32 %add.lcssa			ret i32 %add.lcssa
	▲ Show 20 Lines • Show All 260 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-loopid-dbg.ll

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s

	; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s			; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s
	; DEBUG-OUTPUT-NOT: .loc			; DEBUG-OUTPUT-NOT: .loc
	; DEBUG-OUTPUT-NOT: {{.*}}.debug_info			; DEBUG-OUTPUT-NOT: {{.*}}.debug_info

	; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 4, interleaved count: 1)			; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 2, interleaved count: 1)
	; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)			; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)
	; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or the loop has already been vectorized			; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or the loop has already been vectorized

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define i32 @foo(i32 %n) #0 !dbg !4 {			define i32 @foo(i32 %n) #0 !dbg !4 {
	entry:			entry:
	%diff = alloca i32, align 4			%diff = alloca i32, align 4
	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks.ll

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s

	; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s			; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s
	; DEBUG-OUTPUT-NOT: .loc			; DEBUG-OUTPUT-NOT: .loc
	; DEBUG-OUTPUT-NOT: {{.*}}.debug_info			; DEBUG-OUTPUT-NOT: {{.*}}.debug_info

	; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 4, interleaved count: 1)			; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 2, interleaved count: 1)
	; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)			; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)
	; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or the loop has already been vectorized			; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or the loop has already been vectorized

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define i32 @foo(i32 %n) #0 !dbg !4 {			define i32 @foo(i32 %n) #0 !dbg !4 {
	entry:			entry:
	%diff = alloca i32, align 4			%diff = alloca i32, align 4
	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/sext.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -mtriple=x86_64-unknown -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,SSE,SSE2		; RUN: opt < %s -mtriple=x86_64-unknown -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,SSE,SSE2
; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=slm -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,SSE,SLM		; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=slm -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,SSE,SLM
; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=corei7-avx -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX1		; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=corei7-avx -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX1
; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=core-avx2 -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX2		; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=core-avx2 -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX2
; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=knl -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX512,AVX512F		; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=knl -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX512,AVX512F
; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=skx -mattr=+avx512bw -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX512,AVX512BW		; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=skx -mattr=+avx512bw -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefixes=CHECK,AVX,AVX512,AVX512BW

;		;
; vXi8		; vXi8
;		;

define <2 x i64> @loadext_2i8_to_2i64(i8* %p0) {		define <2 x i64> @loadext_2i8_to_2i64(i8* %p0) {
; SSE-LABEL: @loadext_2i8_to_2i64(		; SSE2-LABEL: @loadext_2i8_to_2i64(
; SSE-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1		; SSE2-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1
; SSE-NEXT: [[I0:%.]] = load i8, i8 [[P0]], align 1		; SSE2-NEXT: [[TMP1:%.]] = bitcast i8 [[P0]] to <2 x i8>*
; SSE-NEXT: [[I1:%.]] = load i8, i8 [[P1]], align 1		; SSE2-NEXT: [[TMP2:%.]] = load <2 x i8>, <2 x i8> [[TMP1]], align 1
; SSE-NEXT: [[X0:%.*]] = sext i8 [[I0]] to i64		; SSE2-NEXT: [[TMP3:%.*]] = sext <2 x i8> [[TMP2]] to <2 x i64>
; SSE-NEXT: [[X1:%.*]] = sext i8 [[I1]] to i64		; SSE2-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; SSE-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[X0]], i32 0		; SSE2-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0
; SSE-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[X1]], i32 1		; SSE2-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1
; SSE-NEXT: ret <2 x i64> [[V1]]		; SSE2-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[TMP5]], i32 1
		; SSE2-NEXT: ret <2 x i64> [[V1]]
		;
		; SLM-LABEL: @loadext_2i8_to_2i64(
		; SLM-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1
		; SLM-NEXT: [[I0:%.]] = load i8, i8 [[P0]], align 1
		; SLM-NEXT: [[I1:%.]] = load i8, i8 [[P1]], align 1
		; SLM-NEXT: [[X0:%.*]] = sext i8 [[I0]] to i64
		; SLM-NEXT: [[X1:%.*]] = sext i8 [[I1]] to i64
		; SLM-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[X0]], i32 0
		; SLM-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[X1]], i32 1
		; SLM-NEXT: ret <2 x i64> [[V1]]
;		;
; AVX-LABEL: @loadext_2i8_to_2i64(		; AVX-LABEL: @loadext_2i8_to_2i64(
; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1		; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1
; AVX-NEXT: [[TMP1:%.]] = bitcast i8 [[P0]] to <2 x i8>*		; AVX-NEXT: [[TMP1:%.]] = bitcast i8 [[P0]] to <2 x i8>*
; AVX-NEXT: [[TMP2:%.]] = load <2 x i8>, <2 x i8> [[TMP1]], align 1		; AVX-NEXT: [[TMP2:%.]] = load <2 x i8>, <2 x i8> [[TMP1]], align 1
; AVX-NEXT: [[TMP3:%.*]] = sext <2 x i8> [[TMP2]] to <2 x i64>		; AVX-NEXT: [[TMP3:%.*]] = sext <2 x i8> [[TMP2]] to <2 x i64>
; AVX-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0		; AVX-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; AVX-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0		; AVX-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	;
%v0 = insertelement <4 x i32> undef, i32 %x0, i32 0		%v0 = insertelement <4 x i32> undef, i32 %x0, i32 0
%v1 = insertelement <4 x i32> %v0, i32 %x1, i32 1		%v1 = insertelement <4 x i32> %v0, i32 %x1, i32 1
%v2 = insertelement <4 x i32> %v1, i32 %x2, i32 2		%v2 = insertelement <4 x i32> %v1, i32 %x2, i32 2
%v3 = insertelement <4 x i32> %v2, i32 %x3, i32 3		%v3 = insertelement <4 x i32> %v2, i32 %x3, i32 3
ret <4 x i32> %v3		ret <4 x i32> %v3
}		}

define <4 x i64> @loadext_4i8_to_4i64(i8* %p0) {		define <4 x i64> @loadext_4i8_to_4i64(i8* %p0) {
; SSE-LABEL: @loadext_4i8_to_4i64(		; SSE2-LABEL: @loadext_4i8_to_4i64(
; SSE-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1		; SSE2-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1
; SSE-NEXT: [[P2:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 2		; SSE2-NEXT: [[P2:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 2
; SSE-NEXT: [[P3:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 3		; SSE2-NEXT: [[P3:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 3
; SSE-NEXT: [[I0:%.]] = load i8, i8 [[P0]], align 1		; SSE2-NEXT: [[TMP1:%.]] = bitcast i8 [[P0]] to <2 x i8>*
; SSE-NEXT: [[I1:%.]] = load i8, i8 [[P1]], align 1		; SSE2-NEXT: [[TMP2:%.]] = load <2 x i8>, <2 x i8> [[TMP1]], align 1
; SSE-NEXT: [[I2:%.]] = load i8, i8 [[P2]], align 1		; SSE2-NEXT: [[I2:%.]] = load i8, i8 [[P2]], align 1
; SSE-NEXT: [[I3:%.]] = load i8, i8 [[P3]], align 1		; SSE2-NEXT: [[I3:%.]] = load i8, i8 [[P3]], align 1
; SSE-NEXT: [[X0:%.*]] = sext i8 [[I0]] to i64		; SSE2-NEXT: [[TMP3:%.*]] = sext <2 x i8> [[TMP2]] to <2 x i64>
; SSE-NEXT: [[X1:%.*]] = sext i8 [[I1]] to i64		; SSE2-NEXT: [[X2:%.*]] = sext i8 [[I2]] to i64
; SSE-NEXT: [[X2:%.*]] = sext i8 [[I2]] to i64		; SSE2-NEXT: [[X3:%.*]] = sext i8 [[I3]] to i64
; SSE-NEXT: [[X3:%.*]] = sext i8 [[I3]] to i64		; SSE2-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; SSE-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[X0]], i32 0		; SSE2-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[TMP4]], i32 0
; SSE-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[X1]], i32 1		; SSE2-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1
; SSE-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2		; SSE2-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[TMP5]], i32 1
; SSE-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3		; SSE2-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2
; SSE-NEXT: ret <4 x i64> [[V3]]		; SSE2-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3
		; SSE2-NEXT: ret <4 x i64> [[V3]]
		;
		; SLM-LABEL: @loadext_4i8_to_4i64(
		; SLM-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1
		; SLM-NEXT: [[P2:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 2
		; SLM-NEXT: [[P3:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 3
		; SLM-NEXT: [[I0:%.]] = load i8, i8 [[P0]], align 1
		; SLM-NEXT: [[I1:%.]] = load i8, i8 [[P1]], align 1
		; SLM-NEXT: [[I2:%.]] = load i8, i8 [[P2]], align 1
		; SLM-NEXT: [[I3:%.]] = load i8, i8 [[P3]], align 1
		; SLM-NEXT: [[X0:%.*]] = sext i8 [[I0]] to i64
		; SLM-NEXT: [[X1:%.*]] = sext i8 [[I1]] to i64
		; SLM-NEXT: [[X2:%.*]] = sext i8 [[I2]] to i64
		; SLM-NEXT: [[X3:%.*]] = sext i8 [[I3]] to i64
		; SLM-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[X0]], i32 0
		; SLM-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[X1]], i32 1
		; SLM-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2
		; SLM-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3
		; SLM-NEXT: ret <4 x i64> [[V3]]
;		;
; AVX-LABEL: @loadext_4i8_to_4i64(		; AVX-LABEL: @loadext_4i8_to_4i64(
; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1		; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i8, i8 [[P0:%.*]], i64 1
; AVX-NEXT: [[P2:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 2		; AVX-NEXT: [[P2:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 2
; AVX-NEXT: [[P3:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 3		; AVX-NEXT: [[P3:%.]] = getelementptr inbounds i8, i8 [[P0]], i64 3
; AVX-NEXT: [[TMP1:%.]] = bitcast i8 [[P0]] to <4 x i8>*		; AVX-NEXT: [[TMP1:%.]] = bitcast i8 [[P0]] to <4 x i8>*
; AVX-NEXT: [[TMP2:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1		; AVX-NEXT: [[TMP2:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1
; AVX-NEXT: [[TMP3:%.*]] = sext <4 x i8> [[TMP2]] to <4 x i64>		; AVX-NEXT: [[TMP3:%.*]] = sext <4 x i8> [[TMP2]] to <4 x i64>
▲ Show 20 Lines • Show All 518 Lines • ▼ Show 20 Lines	;
ret <16 x i16> %v15		ret <16 x i16> %v15
}		}

;		;
; vXi16		; vXi16
;		;

define <2 x i64> @loadext_2i16_to_2i64(i16* %p0) {		define <2 x i64> @loadext_2i16_to_2i64(i16* %p0) {
; SSE-LABEL: @loadext_2i16_to_2i64(		; SSE2-LABEL: @loadext_2i16_to_2i64(
; SSE-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1		; SSE2-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1
; SSE-NEXT: [[I0:%.]] = load i16, i16 [[P0]], align 1		; SSE2-NEXT: [[TMP1:%.]] = bitcast i16 [[P0]] to <2 x i16>*
; SSE-NEXT: [[I1:%.]] = load i16, i16 [[P1]], align 1		; SSE2-NEXT: [[TMP2:%.]] = load <2 x i16>, <2 x i16> [[TMP1]], align 1
; SSE-NEXT: [[X0:%.*]] = sext i16 [[I0]] to i64		; SSE2-NEXT: [[TMP3:%.*]] = sext <2 x i16> [[TMP2]] to <2 x i64>
; SSE-NEXT: [[X1:%.*]] = sext i16 [[I1]] to i64		; SSE2-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; SSE-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[X0]], i32 0		; SSE2-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0
; SSE-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[X1]], i32 1		; SSE2-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1
; SSE-NEXT: ret <2 x i64> [[V1]]		; SSE2-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[TMP5]], i32 1
		; SSE2-NEXT: ret <2 x i64> [[V1]]
		;
		; SLM-LABEL: @loadext_2i16_to_2i64(
		; SLM-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1
		; SLM-NEXT: [[I0:%.]] = load i16, i16 [[P0]], align 1
		; SLM-NEXT: [[I1:%.]] = load i16, i16 [[P1]], align 1
		; SLM-NEXT: [[X0:%.*]] = sext i16 [[I0]] to i64
		; SLM-NEXT: [[X1:%.*]] = sext i16 [[I1]] to i64
		; SLM-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[X0]], i32 0
		; SLM-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[X1]], i32 1
		; SLM-NEXT: ret <2 x i64> [[V1]]
;		;
; AVX-LABEL: @loadext_2i16_to_2i64(		; AVX-LABEL: @loadext_2i16_to_2i64(
; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1		; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1
; AVX-NEXT: [[TMP1:%.]] = bitcast i16 [[P0]] to <2 x i16>*		; AVX-NEXT: [[TMP1:%.]] = bitcast i16 [[P0]] to <2 x i16>*
; AVX-NEXT: [[TMP2:%.]] = load <2 x i16>, <2 x i16> [[TMP1]], align 1		; AVX-NEXT: [[TMP2:%.]] = load <2 x i16>, <2 x i16> [[TMP1]], align 1
; AVX-NEXT: [[TMP3:%.*]] = sext <2 x i16> [[TMP2]] to <2 x i64>		; AVX-NEXT: [[TMP3:%.*]] = sext <2 x i16> [[TMP2]] to <2 x i64>
; AVX-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0		; AVX-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; AVX-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0		; AVX-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	;
%v0 = insertelement <4 x i32> undef, i32 %x0, i32 0		%v0 = insertelement <4 x i32> undef, i32 %x0, i32 0
%v1 = insertelement <4 x i32> %v0, i32 %x1, i32 1		%v1 = insertelement <4 x i32> %v0, i32 %x1, i32 1
%v2 = insertelement <4 x i32> %v1, i32 %x2, i32 2		%v2 = insertelement <4 x i32> %v1, i32 %x2, i32 2
%v3 = insertelement <4 x i32> %v2, i32 %x3, i32 3		%v3 = insertelement <4 x i32> %v2, i32 %x3, i32 3
ret <4 x i32> %v3		ret <4 x i32> %v3
}		}

define <4 x i64> @loadext_4i16_to_4i64(i16* %p0) {		define <4 x i64> @loadext_4i16_to_4i64(i16* %p0) {
; SSE-LABEL: @loadext_4i16_to_4i64(		; SSE2-LABEL: @loadext_4i16_to_4i64(
; SSE-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1		; SSE2-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1
; SSE-NEXT: [[P2:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 2		; SSE2-NEXT: [[P2:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 2
; SSE-NEXT: [[P3:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 3		; SSE2-NEXT: [[P3:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 3
; SSE-NEXT: [[I0:%.]] = load i16, i16 [[P0]], align 1		; SSE2-NEXT: [[TMP1:%.]] = bitcast i16 [[P0]] to <2 x i16>*
; SSE-NEXT: [[I1:%.]] = load i16, i16 [[P1]], align 1		; SSE2-NEXT: [[TMP2:%.]] = load <2 x i16>, <2 x i16> [[TMP1]], align 1
; SSE-NEXT: [[I2:%.]] = load i16, i16 [[P2]], align 1		; SSE2-NEXT: [[I2:%.]] = load i16, i16 [[P2]], align 1
; SSE-NEXT: [[I3:%.]] = load i16, i16 [[P3]], align 1		; SSE2-NEXT: [[I3:%.]] = load i16, i16 [[P3]], align 1
; SSE-NEXT: [[X0:%.*]] = sext i16 [[I0]] to i64		; SSE2-NEXT: [[TMP3:%.*]] = sext <2 x i16> [[TMP2]] to <2 x i64>
; SSE-NEXT: [[X1:%.*]] = sext i16 [[I1]] to i64		; SSE2-NEXT: [[X2:%.*]] = sext i16 [[I2]] to i64
; SSE-NEXT: [[X2:%.*]] = sext i16 [[I2]] to i64		; SSE2-NEXT: [[X3:%.*]] = sext i16 [[I3]] to i64
; SSE-NEXT: [[X3:%.*]] = sext i16 [[I3]] to i64		; SSE2-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; SSE-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[X0]], i32 0		; SSE2-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[TMP4]], i32 0
; SSE-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[X1]], i32 1		; SSE2-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1
; SSE-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2		; SSE2-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[TMP5]], i32 1
; SSE-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3		; SSE2-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2
; SSE-NEXT: ret <4 x i64> [[V3]]		; SSE2-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3
		; SSE2-NEXT: ret <4 x i64> [[V3]]
		;
		; SLM-LABEL: @loadext_4i16_to_4i64(
		; SLM-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1
		; SLM-NEXT: [[P2:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 2
		; SLM-NEXT: [[P3:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 3
		; SLM-NEXT: [[I0:%.]] = load i16, i16 [[P0]], align 1
		; SLM-NEXT: [[I1:%.]] = load i16, i16 [[P1]], align 1
		; SLM-NEXT: [[I2:%.]] = load i16, i16 [[P2]], align 1
		; SLM-NEXT: [[I3:%.]] = load i16, i16 [[P3]], align 1
		; SLM-NEXT: [[X0:%.*]] = sext i16 [[I0]] to i64
		; SLM-NEXT: [[X1:%.*]] = sext i16 [[I1]] to i64
		; SLM-NEXT: [[X2:%.*]] = sext i16 [[I2]] to i64
		; SLM-NEXT: [[X3:%.*]] = sext i16 [[I3]] to i64
		; SLM-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[X0]], i32 0
		; SLM-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[X1]], i32 1
		; SLM-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2
		; SLM-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3
		; SLM-NEXT: ret <4 x i64> [[V3]]
;		;
; AVX-LABEL: @loadext_4i16_to_4i64(		; AVX-LABEL: @loadext_4i16_to_4i64(
; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1		; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i16, i16 [[P0:%.*]], i64 1
; AVX-NEXT: [[P2:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 2		; AVX-NEXT: [[P2:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 2
; AVX-NEXT: [[P3:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 3		; AVX-NEXT: [[P3:%.]] = getelementptr inbounds i16, i16 [[P0]], i64 3
; AVX-NEXT: [[TMP1:%.]] = bitcast i16 [[P0]] to <4 x i16>*		; AVX-NEXT: [[TMP1:%.]] = bitcast i16 [[P0]] to <4 x i16>*
; AVX-NEXT: [[TMP2:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 1		; AVX-NEXT: [[TMP2:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 1
; AVX-NEXT: [[TMP3:%.*]] = sext <4 x i16> [[TMP2]] to <4 x i64>		; AVX-NEXT: [[TMP3:%.*]] = sext <4 x i16> [[TMP2]] to <4 x i64>
▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	;
ret <8 x i32> %v7		ret <8 x i32> %v7
}		}

;		;
; vXi32		; vXi32
;		;

define <2 x i64> @loadext_2i32_to_2i64(i32* %p0) {		define <2 x i64> @loadext_2i32_to_2i64(i32* %p0) {
; SSE-LABEL: @loadext_2i32_to_2i64(		; SSE2-LABEL: @loadext_2i32_to_2i64(
; SSE-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1		; SSE2-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1
; SSE-NEXT: [[I0:%.]] = load i32, i32 [[P0]], align 1		; SSE2-NEXT: [[TMP1:%.]] = bitcast i32 [[P0]] to <2 x i32>*
; SSE-NEXT: [[I1:%.]] = load i32, i32 [[P1]], align 1		; SSE2-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 1
; SSE-NEXT: [[X0:%.*]] = sext i32 [[I0]] to i64		; SSE2-NEXT: [[TMP3:%.*]] = sext <2 x i32> [[TMP2]] to <2 x i64>
; SSE-NEXT: [[X1:%.*]] = sext i32 [[I1]] to i64		; SSE2-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; SSE-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[X0]], i32 0		; SSE2-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0
; SSE-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[X1]], i32 1		; SSE2-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1
; SSE-NEXT: ret <2 x i64> [[V1]]		; SSE2-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[TMP5]], i32 1
		; SSE2-NEXT: ret <2 x i64> [[V1]]
		;
		; SLM-LABEL: @loadext_2i32_to_2i64(
		; SLM-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1
		; SLM-NEXT: [[I0:%.]] = load i32, i32 [[P0]], align 1
		; SLM-NEXT: [[I1:%.]] = load i32, i32 [[P1]], align 1
		; SLM-NEXT: [[X0:%.*]] = sext i32 [[I0]] to i64
		; SLM-NEXT: [[X1:%.*]] = sext i32 [[I1]] to i64
		; SLM-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[X0]], i32 0
		; SLM-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[X1]], i32 1
		; SLM-NEXT: ret <2 x i64> [[V1]]
;		;
; AVX-LABEL: @loadext_2i32_to_2i64(		; AVX-LABEL: @loadext_2i32_to_2i64(
; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1		; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1
; AVX-NEXT: [[TMP1:%.]] = bitcast i32 [[P0]] to <2 x i32>*		; AVX-NEXT: [[TMP1:%.]] = bitcast i32 [[P0]] to <2 x i32>*
; AVX-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 1		; AVX-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 1
; AVX-NEXT: [[TMP3:%.*]] = sext <2 x i32> [[TMP2]] to <2 x i64>		; AVX-NEXT: [[TMP3:%.*]] = sext <2 x i32> [[TMP2]] to <2 x i64>
; AVX-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0		; AVX-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; AVX-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0		; AVX-NEXT: [[V0:%.*]] = insertelement <2 x i64> undef, i64 [[TMP4]], i32 0
; AVX-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1		; AVX-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1
; AVX-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[TMP5]], i32 1		; AVX-NEXT: [[V1:%.*]] = insertelement <2 x i64> [[V0]], i64 [[TMP5]], i32 1
; AVX-NEXT: ret <2 x i64> [[V1]]		; AVX-NEXT: ret <2 x i64> [[V1]]
;		;
%p1 = getelementptr inbounds i32, i32* %p0, i64 1		%p1 = getelementptr inbounds i32, i32* %p0, i64 1
%i0 = load i32, i32* %p0, align 1		%i0 = load i32, i32* %p0, align 1
%i1 = load i32, i32* %p1, align 1		%i1 = load i32, i32* %p1, align 1
%x0 = sext i32 %i0 to i64		%x0 = sext i32 %i0 to i64
%x1 = sext i32 %i1 to i64		%x1 = sext i32 %i1 to i64
%v0 = insertelement <2 x i64> undef, i64 %x0, i32 0		%v0 = insertelement <2 x i64> undef, i64 %x0, i32 0
%v1 = insertelement <2 x i64> %v0, i64 %x1, i32 1		%v1 = insertelement <2 x i64> %v0, i64 %x1, i32 1
ret <2 x i64> %v1		ret <2 x i64> %v1
}		}

define <4 x i64> @loadext_4i32_to_4i64(i32* %p0) {		define <4 x i64> @loadext_4i32_to_4i64(i32* %p0) {
; SSE-LABEL: @loadext_4i32_to_4i64(		; SSE2-LABEL: @loadext_4i32_to_4i64(
; SSE-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1		; SSE2-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1
; SSE-NEXT: [[P2:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 2		; SSE2-NEXT: [[P2:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 2
; SSE-NEXT: [[P3:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 3		; SSE2-NEXT: [[P3:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 3
; SSE-NEXT: [[I0:%.]] = load i32, i32 [[P0]], align 1		; SSE2-NEXT: [[TMP1:%.]] = bitcast i32 [[P0]] to <2 x i32>*
; SSE-NEXT: [[I1:%.]] = load i32, i32 [[P1]], align 1		; SSE2-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 1
; SSE-NEXT: [[I2:%.]] = load i32, i32 [[P2]], align 1		; SSE2-NEXT: [[I2:%.]] = load i32, i32 [[P2]], align 1
; SSE-NEXT: [[I3:%.]] = load i32, i32 [[P3]], align 1		; SSE2-NEXT: [[I3:%.]] = load i32, i32 [[P3]], align 1
; SSE-NEXT: [[X0:%.*]] = sext i32 [[I0]] to i64		; SSE2-NEXT: [[TMP3:%.*]] = sext <2 x i32> [[TMP2]] to <2 x i64>
; SSE-NEXT: [[X1:%.*]] = sext i32 [[I1]] to i64		; SSE2-NEXT: [[X2:%.*]] = sext i32 [[I2]] to i64
; SSE-NEXT: [[X2:%.*]] = sext i32 [[I2]] to i64		; SSE2-NEXT: [[X3:%.*]] = sext i32 [[I3]] to i64
; SSE-NEXT: [[X3:%.*]] = sext i32 [[I3]] to i64		; SSE2-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; SSE-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[X0]], i32 0		; SSE2-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[TMP4]], i32 0
; SSE-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[X1]], i32 1		; SSE2-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[TMP3]], i32 1
; SSE-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2		; SSE2-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[TMP5]], i32 1
; SSE-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3		; SSE2-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2
; SSE-NEXT: ret <4 x i64> [[V3]]		; SSE2-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3
		; SSE2-NEXT: ret <4 x i64> [[V3]]
		;
		; SLM-LABEL: @loadext_4i32_to_4i64(
		; SLM-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1
		; SLM-NEXT: [[P2:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 2
		; SLM-NEXT: [[P3:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 3
		; SLM-NEXT: [[I0:%.]] = load i32, i32 [[P0]], align 1
		; SLM-NEXT: [[I1:%.]] = load i32, i32 [[P1]], align 1
		; SLM-NEXT: [[I2:%.]] = load i32, i32 [[P2]], align 1
		; SLM-NEXT: [[I3:%.]] = load i32, i32 [[P3]], align 1
		; SLM-NEXT: [[X0:%.*]] = sext i32 [[I0]] to i64
		; SLM-NEXT: [[X1:%.*]] = sext i32 [[I1]] to i64
		; SLM-NEXT: [[X2:%.*]] = sext i32 [[I2]] to i64
		; SLM-NEXT: [[X3:%.*]] = sext i32 [[I3]] to i64
		; SLM-NEXT: [[V0:%.*]] = insertelement <4 x i64> undef, i64 [[X0]], i32 0
		; SLM-NEXT: [[V1:%.*]] = insertelement <4 x i64> [[V0]], i64 [[X1]], i32 1
		; SLM-NEXT: [[V2:%.*]] = insertelement <4 x i64> [[V1]], i64 [[X2]], i32 2
		; SLM-NEXT: [[V3:%.*]] = insertelement <4 x i64> [[V2]], i64 [[X3]], i32 3
		; SLM-NEXT: ret <4 x i64> [[V3]]
;		;
; AVX-LABEL: @loadext_4i32_to_4i64(		; AVX-LABEL: @loadext_4i32_to_4i64(
; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1		; AVX-NEXT: [[P1:%.]] = getelementptr inbounds i32, i32 [[P0:%.*]], i64 1
; AVX-NEXT: [[P2:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 2		; AVX-NEXT: [[P2:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 2
; AVX-NEXT: [[P3:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 3		; AVX-NEXT: [[P3:%.]] = getelementptr inbounds i32, i32 [[P0]], i64 3
; AVX-NEXT: [[TMP1:%.]] = bitcast i32 [[P0]] to <4 x i32>*		; AVX-NEXT: [[TMP1:%.]] = bitcast i32 [[P0]] to <4 x i32>*
; AVX-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 1		; AVX-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 1
; AVX-NEXT: [[TMP3:%.*]] = sext <4 x i32> [[TMP2]] to <4 x i64>		; AVX-NEXT: [[TMP3:%.*]] = sext <4 x i32> [[TMP2]] to <4 x i64>
Show All 27 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CostModel] Use isExtLoad in BasicTTINeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 265455

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll

llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-loopid-dbg.ll

llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks.ll

llvm/test/Transforms/SLPVectorizer/X86/sext.ll

[CostModel] Use isExtLoad in BasicTTI
Needs ReviewPublic