This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUTargetTransformInfo.h
2/3
AMDGPUTargetTransformInfo.cpp
-
test/Transforms/LoopVectorize/AMDGPU/
-
Transforms/
-
LoopVectorize/
-
AMDGPU/
-
packed-fp32.ll
-
packed-math.ll

Differential D122850

[AMDGPU] Fix regression with vectorization limiting
ClosedPublic

Authored by rampitec on Mar 31 2022, 2:47 PM.

Download Raw Diff

Details

Reviewers

arsenm
dfukalov
foad

Commits

rGfced87d457d3: [AMDGPU] Fix regression with vectorization limiting

Summary

D67148 has removed TTI::getNumberOfRegisters(bool Vector) and
started to call TTI::getNumberOfRegisters(unsigned ClassID) from
the LoopVectorize. This has resulted in an unrestricted vectorization
on AMDGPU blowing up register pressure.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Mar 31 2022, 2:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 2:47 PM

Herald added subscribers: hsmhsm, foad, kerbowa and 8 others. · View Herald Transcript

rampitec requested review of this revision.Mar 31 2022, 2:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 2:47 PM

Herald added a subscriber: wdng. · View Herald Transcript

arsenm added inline comments.Mar 31 2022, 2:49 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
310	4 seems really small

rampitec added inline comments.Mar 31 2022, 2:51 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
310	It is enough to allow vectorization, all we need really. Giving more immediately explodes RP because of the interleaving. That can be possible to increase this, but then limit interleaving much more.

rampitec added inline comments.Mar 31 2022, 3:12 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
310	Here is the loop triggered the investigation: for (int i = rowStart; i < rowEnd; i++) { gq += temp[i]; } gs/temp are float. The whole kernel w/o loop-vectorize uses 9 VGPRs, with the vecotrizer as it is now 78. With this change it goes down to 38 which is still higher than wanted. If I allow 8 registers final budget is 78 VGPRs again, and to bring it back down to 38 I have to disable interleave. Even interleave factor of 2 plus 8 registers reported here results in 46 VGPRs.

I have realized that RCID passed into getNumberOfRegisters(unsigned RCID) is in fact not an RCID, but boolean for vector/scalar registers. We could implement getRegisterClassForType() to change that, but we cannot reasonably distinguish between VGPRs and SGPRs anyway. At the end result is clamped to just 4, so it is easier to remove all of these calculations and simply return 4.

Note that 4 was the most common return value before the regression. Optimistically we assume max occupancy which means 24 vgprs on most targets. Return value was max available vgprs / 8, i.e. 4. Exceptions were an implicit request of the maximum occupancy, a rare thing, and Navi with different vgpr to occupancy mappings.

rampitec added reviewers: dfukalov, foad.Mar 31 2022, 5:29 PM

This is a gross regression and I want more eyes on this. The only reason we didn't immediately spot it is because of not so much perf reports from gfx90a, where we have packed f32.

Harbormaster completed remote builds in B157284: Diff 419586.Mar 31 2022, 6:19 PM

It seems to me Cost::RateFormula() form LSR is the only user that can be affected by the change. Would you please double-look the use case?

In D122850#3430133, @dfukalov wrote:

It seems to me Cost::RateFormula() form LSR is the only user that can be affected by the change. Would you please double-look the use case?

Yes, I believe unbound LSR also hits us with RP a lot. Limiting it to just 4 'free' pointers is a good thing IMO.

Ping. Performance testing was done on gfx90a for the tests of interest, change gives roughly 10% increase in all subtests.

dfukalov accepted this revision.Apr 8 2022, 5:00 PM

This revision is now accepted and ready to land.Apr 8 2022, 5:00 PM

This revision was landed with ongoing or failed builds.Apr 8 2022, 5:47 PM

Closed by commit rGfced87d457d3: [AMDGPU] Fix regression with vectorization limiting (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGfced87d457d3: [AMDGPU] Fix regression with vectorization limiting.

alex-t mentioned this in D149281: Don't disable loop unroll for vectorized loops on AMDGPU target.May 8 2023, 6:14 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUTargetTransformInfo.h

3 lines

AMDGPUTargetTransformInfo.cpp

26 lines

test/

Transforms/

LoopVectorize/

AMDGPU/

packed-fp32.ll

24 lines

packed-math.ll

116 lines

Diff 421655

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	class GCNTTIImpl final : public BasicTTIImplBase<GCNTTIImpl> {
friend BaseT;		friend BaseT;

const GCNSubtarget *ST;		const GCNSubtarget *ST;
const SITargetLowering *TLI;		const SITargetLowering *TLI;
AMDGPUTTIImpl CommonTTI;		AMDGPUTTIImpl CommonTTI;
bool IsGraphics;		bool IsGraphics;
bool HasFP32Denormals;		bool HasFP32Denormals;
bool HasFP64FP16Denormals;		bool HasFP64FP16Denormals;
unsigned MaxVGPRs;

static const FeatureBitset InlineFeatureIgnoreList;		static const FeatureBitset InlineFeatureIgnoreList;

const GCNSubtarget *getST() const { return ST; }		const GCNSubtarget *getST() const { return ST; }
const SITargetLowering *getTLI() const { return TLI; }		const SITargetLowering *getTLI() const { return TLI; }

static inline int getFullRateInstrCost() {		static inline int getFullRateInstrCost() {
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;
Show All 28 Lines	public:
void getPeelingPreferences(Loop *L, ScalarEvolution &SE,		void getPeelingPreferences(Loop *L, ScalarEvolution &SE,
TTI::PeelingPreferences &PP);		TTI::PeelingPreferences &PP);

TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth) {		TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth) {
assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");		assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
return TTI::PSK_FastHardware;		return TTI::PSK_FastHardware;
}		}

unsigned getHardwareNumberOfRegisters(bool Vector) const;
unsigned getNumberOfRegisters(bool Vector) const;
unsigned getNumberOfRegisters(unsigned RCID) const;		unsigned getNumberOfRegisters(unsigned RCID) const;
TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind Vector) const;		TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind Vector) const;
unsigned getMinVectorRegisterBitWidth() const;		unsigned getMinVectorRegisterBitWidth() const;
unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const;		unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const;
unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,		unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
VectorType *VecTy) const;		VectorType *VecTy) const;
unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,		unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 282 Lines • ▼ Show 20 Lines	const FeatureBitset GCNTTIImpl::InlineFeatureIgnoreList = {

// Perf-tuning features		// Perf-tuning features
AMDGPU::FeatureFastFMAF32, AMDGPU::HalfRate64Ops};		AMDGPU::FeatureFastFMAF32, AMDGPU::HalfRate64Ops};

GCNTTIImpl::GCNTTIImpl(const AMDGPUTargetMachine *TM, const Function &F)		GCNTTIImpl::GCNTTIImpl(const AMDGPUTargetMachine *TM, const Function &F)
: BaseT(TM, F.getParent()->getDataLayout()),		: BaseT(TM, F.getParent()->getDataLayout()),
ST(static_cast<const GCNSubtarget *>(TM->getSubtargetImpl(F))),		ST(static_cast<const GCNSubtarget *>(TM->getSubtargetImpl(F))),
TLI(ST->getTargetLowering()), CommonTTI(TM, F),		TLI(ST->getTargetLowering()), CommonTTI(TM, F),
IsGraphics(AMDGPU::isGraphics(F.getCallingConv())),		IsGraphics(AMDGPU::isGraphics(F.getCallingConv())) {
MaxVGPRs(ST->getMaxNumVGPRs(
std::max(ST->getWavesPerEU(F).first,
ST->getWavesPerEUForWorkGroup(
ST->getFlatWorkGroupSizes(F).second)))) {
AMDGPU::SIModeRegisterDefaults Mode(F);		AMDGPU::SIModeRegisterDefaults Mode(F);
HasFP32Denormals = Mode.allFP32Denormals();		HasFP32Denormals = Mode.allFP32Denormals();
HasFP64FP16Denormals = Mode.allFP64FP16Denormals();		HasFP64FP16Denormals = Mode.allFP64FP16Denormals();
}		}

unsigned GCNTTIImpl::getHardwareNumberOfRegisters(bool Vec) const {		unsigned GCNTTIImpl::getNumberOfRegisters(unsigned RCID) const {
// The concept of vector registers doesn't really exist. Some packed vector		// NB: RCID is not an RCID. In fact it is 0 or 1 for scalar or vector
// operations operate on the normal 32-bit registers.		// registers. See getRegisterClassForType for the implementation.
return MaxVGPRs;		// In this case vector registers are not vector in terms of
}		// VGPRs, but those which can hold multiple values.

unsigned GCNTTIImpl::getNumberOfRegisters(bool Vec) const {
// This is really the number of registers to fill when vectorizing /		// This is really the number of registers to fill when vectorizing /
// interleaving loops, so we lie to avoid trying to use all registers.		// interleaving loops, so we lie to avoid trying to use all registers.
return getHardwareNumberOfRegisters(Vec) >> 3;		return 4;
}

unsigned GCNTTIImpl::getNumberOfRegisters(unsigned RCID) const {
const SIRegisterInfo *TRI = ST->getRegisterInfo();
const TargetRegisterClass *RC = TRI->getRegClass(RCID);
unsigned NumVGPRs = (TRI->getRegSizeInBits(*RC) + 31) / 32;
return getHardwareNumberOfRegisters(false) / NumVGPRs;
}		}

TypeSize		TypeSize
GCNTTIImpl::getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {		GCNTTIImpl::getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {
switch (K) {		switch (K) {
		arsenmUnsubmitted Not Done Reply Inline Actions 4 seems really small arsenm: 4 seems really small
		rampitecAuthorUnsubmitted Done Reply Inline Actions It is enough to allow vectorization, all we need really. Giving more immediately explodes RP because of the interleaving. That can be possible to increase this, but then limit interleaving much more. rampitec: It is enough to allow vectorization, all we need really. Giving more immediately explodes RP…
		rampitecAuthorUnsubmitted Done Reply Inline Actions Here is the loop triggered the investigation: for (int i = rowStart; i < rowEnd; i++) { gq += temp[i]; } gs/temp are float. The whole kernel w/o loop-vectorize uses 9 VGPRs, with the vecotrizer as it is now 78. With this change it goes down to 38 which is still higher than wanted. If I allow 8 registers final budget is 78 VGPRs again, and to bring it back down to 38 I have to disable interleave. Even interleave factor of 2 plus 8 registers reported here results in 46 VGPRs. rampitec: Here is the loop triggered the investigation: ``` for (int i = rowStart; i < rowEnd…
case TargetTransformInfo::RGK_Scalar:		case TargetTransformInfo::RGK_Scalar:
return TypeSize::getFixed(32);		return TypeSize::getFixed(32);
case TargetTransformInfo::RGK_FixedWidthVector:		case TargetTransformInfo::RGK_FixedWidthVector:
return TypeSize::getFixed(ST->hasPackedFP32Ops() ? 64 : 32);		return TypeSize::getFixed(ST->hasPackedFP32Ops() ? 64 : 32);
case TargetTransformInfo::RGK_ScalableVector:		case TargetTransformInfo::RGK_ScalableVector:
return TypeSize::getScalable(0);		return TypeSize::getScalable(0);
}		}
llvm_unreachable("Unsupported register kind");		llvm_unreachable("Unsupported register kind");
▲ Show 20 Lines • Show All 832 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AMDGPU/packed-fp32.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a < %s -loop-vectorize -S \| FileCheck -check-prefix=GFX90A %s

				; GFX90A-LABEL: @vectorize_v2f32_loop(
				; GFX90A-COUNT-2: load <2 x float>
				; GFX90A-COUNT-2: fadd fast <2 x float>

				define float @vectorize_v2f32_loop(float addrspace(1)* noalias %s) {
				entry:
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%q.04 = phi float [ 0.0, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %s, i64 %indvars.iv
				%load = load float, float addrspace(1)* %arrayidx, align 4
				%add = fadd fast float %q.04, %load
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%add.lcssa = phi float [ %add, %for.body ]
				ret float %add.lcssa
				}

llvm/test/Transforms/LoopVectorize/AMDGPU/packed-math.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s -loop-vectorize -dce -instcombine -S \| FileCheck -check-prefix=GFX9 %s			; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s -loop-vectorize -dce -instcombine -S \| FileCheck -check-prefix=GFX9 %s
	; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=fiji < %s -loop-vectorize -dce -instcombine -S \| FileCheck -check-prefix=VI %s			; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=fiji < %s -loop-vectorize -dce -instcombine -S \| FileCheck -check-prefix=VI %s
	; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii < %s -loop-vectorize -dce -instcombine -S \| FileCheck -check-prefix=CI %s			; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii < %s -loop-vectorize -dce -instcombine -S \| FileCheck -check-prefix=CI %s

	define half @vectorize_v2f16_loop(half addrspace(1)* noalias %s) {			define half @vectorize_v2f16_loop(half addrspace(1)* noalias %s) {
	; GFX9-LABEL: @vectorize_v2f16_loop(			; GFX9-LABEL: @vectorize_v2f16_loop(
	; GFX9-NEXT: entry:			; GFX9-NEXT: entry:
	; GFX9-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; GFX9-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; GFX9: vector.ph:			; GFX9: vector.ph:
	; GFX9-NEXT: br label [[VECTOR_BODY:%.*]]			; GFX9-NEXT: br label [[VECTOR_BODY:%.*]]
	; GFX9: vector.body:			; GFX9: vector.body:
	; GFX9-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; GFX9-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.]], [[VECTOR_BODY]] ]			; GFX9-NEXT: [[VEC_PHI:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI1:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP17:%.]], [[VECTOR_BODY]] ]			; GFX9-NEXT: [[VEC_PHI1:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI2:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP18:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI3:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI4:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI5:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP21:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI6:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP22:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[VEC_PHI7:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP23:%.]], [[VECTOR_BODY]] ]
	; GFX9-NEXT: [[TMP0:%.]] = getelementptr inbounds half, half addrspace(1) [[S:%.*]], i64 [[INDEX]]			; GFX9-NEXT: [[TMP0:%.]] = getelementptr inbounds half, half addrspace(1) [[S:%.*]], i64 [[INDEX]]
	; GFX9-NEXT: [[TMP1:%.]] = bitcast half addrspace(1) [[TMP0]] to <2 x half> addrspace(1)*			; GFX9-NEXT: [[TMP1:%.]] = bitcast half addrspace(1) [[TMP0]] to <2 x half> addrspace(1)*
	; GFX9-NEXT: [[WIDE_LOAD:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP1]], align 2			; GFX9-NEXT: [[WIDE_LOAD:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP1]], align 2
	; GFX9-NEXT: [[TMP2:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 2			; GFX9-NEXT: [[TMP2:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 2
	; GFX9-NEXT: [[TMP3:%.]] = bitcast half addrspace(1) [[TMP2]] to <2 x half> addrspace(1)*			; GFX9-NEXT: [[TMP3:%.]] = bitcast half addrspace(1) [[TMP2]] to <2 x half> addrspace(1)*
	; GFX9-NEXT: [[WIDE_LOAD8:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP3]], align 2			; GFX9-NEXT: [[WIDE_LOAD2:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP3]], align 2
	; GFX9-NEXT: [[TMP4:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 4			; GFX9-NEXT: [[TMP4]] = fadd fast <2 x half> [[VEC_PHI]], [[WIDE_LOAD]]
	; GFX9-NEXT: [[TMP5:%.]] = bitcast half addrspace(1) [[TMP4]] to <2 x half> addrspace(1)*			; GFX9-NEXT: [[TMP5]] = fadd fast <2 x half> [[VEC_PHI1]], [[WIDE_LOAD2]]
	; GFX9-NEXT: [[WIDE_LOAD9:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP5]], align 2			; GFX9-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; GFX9-NEXT: [[TMP6:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 6			; GFX9-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; GFX9-NEXT: [[TMP7:%.]] = bitcast half addrspace(1) [[TMP6]] to <2 x half> addrspace(1)*			; GFX9-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; GFX9-NEXT: [[WIDE_LOAD10:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP7]], align 2
	; GFX9-NEXT: [[TMP8:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 8
	; GFX9-NEXT: [[TMP9:%.]] = bitcast half addrspace(1) [[TMP8]] to <2 x half> addrspace(1)*
	; GFX9-NEXT: [[WIDE_LOAD11:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP9]], align 2
	; GFX9-NEXT: [[TMP10:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 10
	; GFX9-NEXT: [[TMP11:%.]] = bitcast half addrspace(1) [[TMP10]] to <2 x half> addrspace(1)*
	; GFX9-NEXT: [[WIDE_LOAD12:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP11]], align 2
	; GFX9-NEXT: [[TMP12:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 12
	; GFX9-NEXT: [[TMP13:%.]] = bitcast half addrspace(1) [[TMP12]] to <2 x half> addrspace(1)*
	; GFX9-NEXT: [[WIDE_LOAD13:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP13]], align 2
	; GFX9-NEXT: [[TMP14:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 14
	; GFX9-NEXT: [[TMP15:%.]] = bitcast half addrspace(1) [[TMP14]] to <2 x half> addrspace(1)*
	; GFX9-NEXT: [[WIDE_LOAD14:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP15]], align 2
	; GFX9-NEXT: [[TMP16]] = fadd fast <2 x half> [[VEC_PHI]], [[WIDE_LOAD]]
	; GFX9-NEXT: [[TMP17]] = fadd fast <2 x half> [[VEC_PHI1]], [[WIDE_LOAD8]]
	; GFX9-NEXT: [[TMP18]] = fadd fast <2 x half> [[VEC_PHI2]], [[WIDE_LOAD9]]
	; GFX9-NEXT: [[TMP19]] = fadd fast <2 x half> [[VEC_PHI3]], [[WIDE_LOAD10]]
	; GFX9-NEXT: [[TMP20]] = fadd fast <2 x half> [[VEC_PHI4]], [[WIDE_LOAD11]]
	; GFX9-NEXT: [[TMP21]] = fadd fast <2 x half> [[VEC_PHI5]], [[WIDE_LOAD12]]
	; GFX9-NEXT: [[TMP22]] = fadd fast <2 x half> [[VEC_PHI6]], [[WIDE_LOAD13]]
	; GFX9-NEXT: [[TMP23]] = fadd fast <2 x half> [[VEC_PHI7]], [[WIDE_LOAD14]]
	; GFX9-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
	; GFX9-NEXT: [[TMP24:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; GFX9-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; GFX9: middle.block:			; GFX9: middle.block:
	; GFX9-NEXT: [[BIN_RDX:%.*]] = fadd fast <2 x half> [[TMP17]], [[TMP16]]			; GFX9-NEXT: [[BIN_RDX:%.*]] = fadd fast <2 x half> [[TMP5]], [[TMP4]]
	; GFX9-NEXT: [[BIN_RDX15:%.*]] = fadd fast <2 x half> [[TMP18]], [[BIN_RDX]]			; GFX9-NEXT: [[TMP7:%.*]] = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH8000, <2 x half> [[BIN_RDX]])
	; GFX9-NEXT: [[BIN_RDX16:%.*]] = fadd fast <2 x half> [[TMP19]], [[BIN_RDX15]]
	; GFX9-NEXT: [[BIN_RDX17:%.*]] = fadd fast <2 x half> [[TMP20]], [[BIN_RDX16]]
	; GFX9-NEXT: [[BIN_RDX18:%.*]] = fadd fast <2 x half> [[TMP21]], [[BIN_RDX17]]
	; GFX9-NEXT: [[BIN_RDX19:%.*]] = fadd fast <2 x half> [[TMP22]], [[BIN_RDX18]]
	; GFX9-NEXT: [[BIN_RDX20:%.*]] = fadd fast <2 x half> [[TMP23]], [[BIN_RDX19]]
	; GFX9-NEXT: [[TMP25:%.*]] = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH8000, <2 x half> [[BIN_RDX20]])
	; GFX9-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; GFX9-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; GFX9: scalar.ph:			; GFX9: scalar.ph:
	; GFX9-NEXT: br label [[FOR_BODY:%.*]]			; GFX9-NEXT: br label [[FOR_BODY:%.*]]
	; GFX9: for.body:			; GFX9: for.body:
	; GFX9-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; GFX9-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	; GFX9: for.end:			; GFX9: for.end:
	; GFX9-NEXT: [[ADD_LCSSA:%.*]] = phi half [ undef, [[FOR_BODY]] ], [ [[TMP25]], [[MIDDLE_BLOCK]] ]			; GFX9-NEXT: [[ADD_LCSSA:%.*]] = phi half [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
	; GFX9-NEXT: ret half [[ADD_LCSSA]]			; GFX9-NEXT: ret half [[ADD_LCSSA]]
	;			;
	; VI-LABEL: @vectorize_v2f16_loop(			; VI-LABEL: @vectorize_v2f16_loop(
	; VI-NEXT: entry:			; VI-NEXT: entry:
	; VI-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; VI-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; VI: vector.ph:			; VI: vector.ph:
	; VI-NEXT: br label [[VECTOR_BODY:%.*]]			; VI-NEXT: br label [[VECTOR_BODY:%.*]]
	; VI: vector.body:			; VI: vector.body:
	; VI-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; VI-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.]], [[VECTOR_BODY]] ]			; VI-NEXT: [[VEC_PHI:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI1:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP17:%.]], [[VECTOR_BODY]] ]			; VI-NEXT: [[VEC_PHI1:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI2:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP18:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI3:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI4:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI5:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP21:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI6:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP22:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[VEC_PHI7:%.]] = phi <2 x half> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP23:%.]], [[VECTOR_BODY]] ]
	; VI-NEXT: [[TMP0:%.]] = getelementptr inbounds half, half addrspace(1) [[S:%.*]], i64 [[INDEX]]			; VI-NEXT: [[TMP0:%.]] = getelementptr inbounds half, half addrspace(1) [[S:%.*]], i64 [[INDEX]]
	; VI-NEXT: [[TMP1:%.]] = bitcast half addrspace(1) [[TMP0]] to <2 x half> addrspace(1)*			; VI-NEXT: [[TMP1:%.]] = bitcast half addrspace(1) [[TMP0]] to <2 x half> addrspace(1)*
	; VI-NEXT: [[WIDE_LOAD:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP1]], align 2			; VI-NEXT: [[WIDE_LOAD:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP1]], align 2
	; VI-NEXT: [[TMP2:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 2			; VI-NEXT: [[TMP2:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 2
	; VI-NEXT: [[TMP3:%.]] = bitcast half addrspace(1) [[TMP2]] to <2 x half> addrspace(1)*			; VI-NEXT: [[TMP3:%.]] = bitcast half addrspace(1) [[TMP2]] to <2 x half> addrspace(1)*
	; VI-NEXT: [[WIDE_LOAD8:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP3]], align 2			; VI-NEXT: [[WIDE_LOAD2:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP3]], align 2
	; VI-NEXT: [[TMP4:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 4			; VI-NEXT: [[TMP4]] = fadd fast <2 x half> [[VEC_PHI]], [[WIDE_LOAD]]
	; VI-NEXT: [[TMP5:%.]] = bitcast half addrspace(1) [[TMP4]] to <2 x half> addrspace(1)*			; VI-NEXT: [[TMP5]] = fadd fast <2 x half> [[VEC_PHI1]], [[WIDE_LOAD2]]
	; VI-NEXT: [[WIDE_LOAD9:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP5]], align 2			; VI-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; VI-NEXT: [[TMP6:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 6			; VI-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; VI-NEXT: [[TMP7:%.]] = bitcast half addrspace(1) [[TMP6]] to <2 x half> addrspace(1)*			; VI-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; VI-NEXT: [[WIDE_LOAD10:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP7]], align 2
	; VI-NEXT: [[TMP8:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 8
	; VI-NEXT: [[TMP9:%.]] = bitcast half addrspace(1) [[TMP8]] to <2 x half> addrspace(1)*
	; VI-NEXT: [[WIDE_LOAD11:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP9]], align 2
	; VI-NEXT: [[TMP10:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 10
	; VI-NEXT: [[TMP11:%.]] = bitcast half addrspace(1) [[TMP10]] to <2 x half> addrspace(1)*
	; VI-NEXT: [[WIDE_LOAD12:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP11]], align 2
	; VI-NEXT: [[TMP12:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 12
	; VI-NEXT: [[TMP13:%.]] = bitcast half addrspace(1) [[TMP12]] to <2 x half> addrspace(1)*
	; VI-NEXT: [[WIDE_LOAD13:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP13]], align 2
	; VI-NEXT: [[TMP14:%.]] = getelementptr inbounds half, half addrspace(1) [[TMP0]], i64 14
	; VI-NEXT: [[TMP15:%.]] = bitcast half addrspace(1) [[TMP14]] to <2 x half> addrspace(1)*
	; VI-NEXT: [[WIDE_LOAD14:%.]] = load <2 x half>, <2 x half> addrspace(1) [[TMP15]], align 2
	; VI-NEXT: [[TMP16]] = fadd fast <2 x half> [[VEC_PHI]], [[WIDE_LOAD]]
	; VI-NEXT: [[TMP17]] = fadd fast <2 x half> [[VEC_PHI1]], [[WIDE_LOAD8]]
	; VI-NEXT: [[TMP18]] = fadd fast <2 x half> [[VEC_PHI2]], [[WIDE_LOAD9]]
	; VI-NEXT: [[TMP19]] = fadd fast <2 x half> [[VEC_PHI3]], [[WIDE_LOAD10]]
	; VI-NEXT: [[TMP20]] = fadd fast <2 x half> [[VEC_PHI4]], [[WIDE_LOAD11]]
	; VI-NEXT: [[TMP21]] = fadd fast <2 x half> [[VEC_PHI5]], [[WIDE_LOAD12]]
	; VI-NEXT: [[TMP22]] = fadd fast <2 x half> [[VEC_PHI6]], [[WIDE_LOAD13]]
	; VI-NEXT: [[TMP23]] = fadd fast <2 x half> [[VEC_PHI7]], [[WIDE_LOAD14]]
	; VI-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
	; VI-NEXT: [[TMP24:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; VI-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; VI: middle.block:			; VI: middle.block:
	; VI-NEXT: [[BIN_RDX:%.*]] = fadd fast <2 x half> [[TMP17]], [[TMP16]]			; VI-NEXT: [[BIN_RDX:%.*]] = fadd fast <2 x half> [[TMP5]], [[TMP4]]
	; VI-NEXT: [[BIN_RDX15:%.*]] = fadd fast <2 x half> [[TMP18]], [[BIN_RDX]]			; VI-NEXT: [[TMP7:%.*]] = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH8000, <2 x half> [[BIN_RDX]])
	; VI-NEXT: [[BIN_RDX16:%.*]] = fadd fast <2 x half> [[TMP19]], [[BIN_RDX15]]
	; VI-NEXT: [[BIN_RDX17:%.*]] = fadd fast <2 x half> [[TMP20]], [[BIN_RDX16]]
	; VI-NEXT: [[BIN_RDX18:%.*]] = fadd fast <2 x half> [[TMP21]], [[BIN_RDX17]]
	; VI-NEXT: [[BIN_RDX19:%.*]] = fadd fast <2 x half> [[TMP22]], [[BIN_RDX18]]
	; VI-NEXT: [[BIN_RDX20:%.*]] = fadd fast <2 x half> [[TMP23]], [[BIN_RDX19]]
	; VI-NEXT: [[TMP25:%.*]] = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH8000, <2 x half> [[BIN_RDX20]])
	; VI-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; VI-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; VI: scalar.ph:			; VI: scalar.ph:
	; VI-NEXT: br label [[FOR_BODY:%.*]]			; VI-NEXT: br label [[FOR_BODY:%.*]]
	; VI: for.body:			; VI: for.body:
	; VI-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; VI-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	; VI: for.end:			; VI: for.end:
	; VI-NEXT: [[ADD_LCSSA:%.*]] = phi half [ undef, [[FOR_BODY]] ], [ [[TMP25]], [[MIDDLE_BLOCK]] ]			; VI-NEXT: [[ADD_LCSSA:%.*]] = phi half [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
	; VI-NEXT: ret half [[ADD_LCSSA]]			; VI-NEXT: ret half [[ADD_LCSSA]]
	;			;
	; CI-LABEL: @vectorize_v2f16_loop(			; CI-LABEL: @vectorize_v2f16_loop(
	; CI-NEXT: entry:			; CI-NEXT: entry:
	; CI-NEXT: br label [[FOR_BODY:%.*]]			; CI-NEXT: br label [[FOR_BODY:%.*]]
	; CI: for.body:			; CI: for.body:
	; CI-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]			; CI-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
	; CI-NEXT: [[Q_04:%.]] = phi half [ 0xH0000, [[ENTRY]] ], [ [[ADD:%.]], [[FOR_BODY]] ]			; CI-NEXT: [[Q_04:%.]] = phi half [ 0xH0000, [[ENTRY]] ], [ [[ADD:%.]], [[FOR_BODY]] ]
	Show All 26 Lines