This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
20/24
AMDGPUPromoteAlloca.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
2/2
fix-frame-reg-in-custom-csr-spills.ll
-
promote-alloca-array-aggregate.ll
-
promote-alloca-globals.ll
3/3
promote-alloca-loadstores.ll
-
promote-alloca-memset.ll
-
promote-alloca-pointer-array.ll
-
promote-alloca-vector-to-vector.ll
-
sroa-before-unroll.ll
-
vector-alloca-bitcast.ll

Differential D152706

[AMDGPU] Use SSAUpdater in PromoteAlloca
ClosedPublic

Authored by Pierre-vh on Jun 12 2023, 6:06 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
rampitec

Group Reviewers

Restricted Project

Commits

rG3890a3b11398: [AMDGPU] Use SSAUpdater in PromoteAlloca
rG091bfa76db64: [AMDGPU] Use SSAUpdater in PromoteAlloca

Summary

This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly.

Note PromoteAlloca is still reliant on SROA running first to
canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Pierre-vh created this revision.Jun 12 2023, 6:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2023, 6:06 AM

Herald added subscribers: StephenFan, kerbowa, zzheng and 7 others. · View Herald Transcript

Pierre-vh requested review of this revision.Jun 12 2023, 6:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2023, 6:06 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B238170: Diff 530483.Jun 12 2023, 6:44 AM

As a follow up patch should remove the sroa run from the pass pipeline

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
134	I think this is an aggressive statement, why not just preserves DominatorTree (also, I'd expect that to be implied by setPreservesCFG)
770–772	I don't understand why you would need to do this. You aren't moving any instructions around, so the dominance properties should be implied by the original values.
785	use_empty

Pierre-vh marked 2 inline comments as done.Jun 13 2023, 12:18 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
770–772	If I don't sort by dominance, I have multiples crashes in `promote-alloca-array-aggregate.ll`. It seems mostly related to memcpy though, hence why I put a TODO saying the lowering may be incorrect. I'm still not sure why. I agree that requiring the DT for this is annoying, I'll try to think of ways to avoid it.

Address some comments

Harbormaster completed remote builds in B238406: Diff 530791.Jun 13 2023, 1:06 AM

Remove domtree dependency

Harbormaster completed remote builds in B238416: Diff 530807.Jun 13 2023, 2:26 AM

ping

I just came across this when trying to figure out what your issue is:

/// Helper class for promoting a collection of loads and stores into SSA
/// Form using the SSAUpdater.
///
**/// This handles complexities that SSAUpdater doesn't, such as multiple loads
/// and stores in one block.**
///
/// Clients of this class are expected to subclass this and implement the
/// virtual methods.
class LoadAndStorePromoter {

If you move the vector insert/extract details into an implementation of this, do the dominance issues disappear?

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
341–358	This whole thing feels wrong to me. You shouldn't need to figure out any ordering, much less by using comesBefore repeatedly

Pierre-vh added inline comments.Jun 19 2023, 4:58 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
341–358	I think that the issue was introduced because I'm looking through things such as bitcasts, which means that we can have IR like this: bb1: %x = alloca ... %cast.x = cast %x to ... bb2: store ... %x store ... %cast.x Then, when I iterate the users of the alloca and build the worklist, the worklist may have the store to %cast.x before the store to %x, so I need to sort the worklist to use SSAAUpdater. I'm not sure how `LoadStorePromoter` would help, I've looked at it but it seems to be for a different purpose

arsenm added inline comments.Jun 19 2023, 5:46 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
341–358	I applied your patch and not that many tests fail. The one I'm looking at is promote_memcpy_inline_aggr. This has the multiple loads and stores in the same block case the LoadAndStorePromoter comment says the base SSAUpdater does not handle. I think this is probably the correct tool, you just have the added complexity that we're coercing the type of the accesses to vector instead of reusing the original

arsenm added inline comments.Jun 19 2023, 6:01 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
775	I think this is subtley wrong and should actually be undef. Alloca memory is currently initialized with undef, not poison

Address some comments

Herald added a subscriber: nlopes. · View Herald TranscriptJun 20 2023, 2:25 AM

Pierre-vh added inline comments.Jun 20 2023, 2:25 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
341–358	`LoadAndStorePromoter` would need some modifications to work here or I would need to first rewrite all of the load/stores to use vectors then run promote alloca later to do the promotion. For instance, `LoadAndStorePromoter` initializes SSAUpdater with the type of the first load/store I don't mind using it if you think it's better to first rewrite all the load/stores then run `LoadAndStoreUpdater`, but IMO it's really not necessary, we'd make things more complex (and less direct) just to avoid a 5 lines function (I heavily simplified it)

Harbormaster completed remote builds in B239946: Diff 532835.Jun 20 2023, 4:18 AM

arsenm added inline comments.Jun 21 2023, 5:19 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
341–358	I think the real question is how does LoadAndStorePromoter deal with multiple loads and stores in the same block? I doubt it's using a worklist sort. Can you use whatever technique it's using?

Pierre-vh added inline comments.Jun 22 2023, 12:20 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
341–358	It uses a system of buckets // First step: bucket up uses of the alloca by the block they occur in. // This is important because we have to handle multiple defs/uses in a block // ourselves: SSAUpdater is purely for cross-block references. DenseMap<BasicBlock , TinyPtrVector<Instruction >> UsesByBlock; And if, when promoting an inst, it sees there's more than one inst in the bucket, it does a linear scan of the basic block to find & promote all users. I used something similar at first but I just found it more complicated, when this case just a simple sorted worklist does the trick. The only important part really is that it sees users in a given basic block in program order. There's multiple ways to achieve that. If you're more comfortable with that approach then I can use it too, and just simplify it as much as possible for our use case

arsenm requested changes to this revision.Jun 26 2023, 11:12 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
341–358	I think that makes more sense, sorting isn't ideal

This revision now requires changes to proceed.Jun 26 2023, 11:12 AM

Use logic similar to LoadAndStoreUpdater

Harbormaster completed remote builds in B241397: Diff 534867.Jun 27 2023, 2:33 AM

arsenm accepted this revision.Jun 27 2023, 3:30 PM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
375	Capitalize

This revision is now accepted and ready to land.Jun 27 2023, 3:30 PM

Closed by commit rG091bfa76db64: [AMDGPU] Use SSAUpdater in PromoteAlloca (authored by Pierre-vh). · Explain WhyJun 27 2023, 11:12 PM

This revision was automatically updated to reflect the committed changes.

Pierre-vh marked an inline comment as done.

Pierre-vh added a commit: rG091bfa76db64: [AMDGPU] Use SSAUpdater in PromoteAlloca.

Hey Pierre, I believe this broke the AMDGPU OpenMP buildbot https://lab.llvm.org/buildbot/#/builders/193

In D152706#4455229, @jplehr wrote:

Hey Pierre, I believe this broke the AMDGPU OpenMP buildbot https://lab.llvm.org/buildbot/#/builders/193

Looking into it

Pierre-vh added a reverting change: rG7007b9934001: Revert "[AMDGPU] Use SSAUpdater in PromoteAlloca".Jun 28 2023, 2:14 AM

complex_reduction.cpp fails on GFX9 in openmp tests: https://lab.llvm.org/buildbot/#/builders/193/builds/33759

This revision is now accepted and ready to land.Jun 28 2023, 2:14 AM

Pierre-vh planned changes to this revision.Jun 28 2023, 2:15 AM

I think I figured out why the failure was happening, and it's due to improper handling of Load/Store promotion.
However LoadAndStorePromoter can't save us here because we also handle partial load/stores, i.e. we can load/store a single element of the vector. It complicates things quite a bit and we can't get away with deferring loads/stores everytime, we'll need to keep track of lanes and make decisions based on which lane is being hit.

I'm working on a fix but it might take a bit of time to get right

Fix openmp tests

In the end I didn't need to do anything complicated with lanes and such. I just had to do the same system as LoadAndStoreUpdater, which is to defer live-in loads lowering to a second pass.
Now, some dummy loads are also inserted on the first pass for instructions that must be promoted on the first pass but also need to know the current vector value.

It leads to less folding than I'd like, but it's a lot more correct.

This revision is now accepted and ready to land.Jun 29 2023, 12:29 AM

Pierre-vh requested review of this revision.Jun 29 2023, 12:30 AM

Harbormaster completed remote builds in B241992: Diff 535656.Jun 29 2023, 1:28 AM

There is still one more failure to debug on some app, an assertion can sometimes trip with castIsValid, trying to get a testcase.

In D152706#4458582, @Pierre-vh wrote:

It leads to less folding than I'd like, but it's a lot more correct.

Do you have an example of this?

In D152706#4459459, @arsenm wrote:

In D152706#4458582, @Pierre-vh wrote:

It leads to less folding than I'd like, but it's a lot more correct.

Do you have an example of this?

Look in promote-alloca-loadstores.ll, the first testcase, one of the insertelement is useless (gets overwritten right away)
It's very minor things so I don't think it's a big issue

Fix ptr load/stores issues + add test

Herald added a subscriber: wangpc. · View Herald TranscriptJul 11 2023, 2:57 AM

arsenm added inline comments.Jul 11 2023, 5:21 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
391	Doesn't consider vector of pointers
426	Specifically use PtrToInt?
llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
90	Add some load/store vector of pointer cases. Also mix different pointer sizes

Harbormaster completed remote builds in B244394: Diff 538990.Jul 11 2023, 5:48 AM

Add 32 bit pointer test + mixed test

Pierre-vh marked an inline comment as done.Jul 13 2023, 5:39 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
391	What case do you have in mind? This code path is just for loading the full vector. I initially had a test for a `<1 x ptr>` vector but we don't vectorize under 2 elements so it never worked.

arsenm added inline comments.Jul 13 2023, 6:45 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
391	store <2 x ptr> load <4 x ptr addrspace(3)> things like that

Good catch with the ptr vector type, I wasn't able to come up with a test for it earlier so I thought it was fine, but it was indeed crashing.
Now it's fixed and I added a testcase for it.

arsenm added inline comments.Jul 13 2023, 8:34 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
463	no else after return
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
67	Why did this change? It only uses volatile accesses
llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
128	There was a recent bug filed that amounts to not handling this (it didn't use pointers, but just different sized vectors)

Harbormaster completed remote builds in B245121: Diff 540034.Jul 13 2023, 9:40 AM

Pierre-vh marked 3 inline comments as done.Jul 14 2023, 12:10 AM

Pierre-vh added inline comments.

llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
67	We support non-simple accesses of the whole vector, it's volatile accesses of a single element that we don't support
llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
128	Yes, I just saw it. I'd rather fix this in a separate patch; this patch is already quite large and if I do too much in it I'm afraid it'll make potential issues harder to track down I think we just need to use something else than `isBitOrNoopPointerCastable`. It's too limited because it doesn't take into account that we can use an intermediate cast (like the cast to int for ptr -> vec)

Remove elses after returns

Harbormaster completed remote builds in B245300: Diff 540292.Jul 14 2023, 1:02 AM

Pierre-vh added a child revision: D155699: [AMDGPU] Allow vector access types in PromoteAllocaToVector.Jul 19 2023, 6:33 AM

Pierre-vh added inline comments.Jul 20 2023, 7:08 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
609	This will crash if we load/store a value at an offset, and the access type is the same size as the alloca. Should we check that the index is zero as well here?

arsenm added inline comments.Jul 20 2023, 7:47 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
609	Probably, need to not break on out of bounds access

Pierre-vh marked an inline comment as done.Jul 24 2023, 12:14 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
609	Oops, wrong place, it already does it (`&Ptr == &Alloca`), it's something in D155699

arsenm added inline comments.Jul 24 2023, 6:14 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
367	Use poison for the filler
449	You already have the cast<MemTransferInst> as MTI
479	Dead break

Comments

arsenm accepted this revision.Jul 24 2023, 6:26 AM

This revision is now accepted and ready to land.Jul 24 2023, 6:26 AM

Harbormaster completed remote builds in B247651: Diff 543512.Jul 24 2023, 10:14 AM

This revision was landed with ongoing or failed builds.Jul 24 2023, 10:45 PM

Closed by commit rG3890a3b11398: [AMDGPU] Use SSAUpdater in PromoteAlloca (authored by Pierre-vh). · Explain Why

This revision was automatically updated to reflect the committed changes.

Pierre-vh added a commit: rG3890a3b11398: [AMDGPU] Use SSAUpdater in PromoteAlloca.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUPromoteAlloca.cpp

384 lines

test/

CodeGen/

AMDGPU/

fix-frame-reg-in-custom-csr-spills.ll

31 lines

promote-alloca-array-aggregate.ll

197 lines

promote-alloca-globals.ll

2 lines

promote-alloca-loadstores.ll

161 lines

promote-alloca-memset.ll

35 lines

promote-alloca-pointer-array.ll

13 lines

promote-alloca-vector-to-vector.ll

50 lines

sroa-before-unroll.ll

2 lines

vector-alloca-bitcast.ll

36 lines

Diff 543824

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show All 22 Lines
// Note that both of them exist for the old and new PMs. The new PM passes are		// Note that both of them exist for the old and new PMs. The new PM passes are
// declared in AMDGPU.h and the legacy PM ones are declared here.s		// declared in AMDGPU.h and the legacy PM ones are declared here.s
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "GCNSubtarget.h"		#include "GCNSubtarget.h"
#include "Utils/AMDGPUBaseInfo.h"		#include "Utils/AMDGPUBaseInfo.h"
		#include "llvm/ADT/STLExtras.h"
#include "llvm/Analysis/CaptureTracking.h"		#include "llvm/Analysis/CaptureTracking.h"
		#include "llvm/Analysis/InstSimplifyFolder.h"
		#include "llvm/Analysis/InstructionSimplify.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/CodeGen/TargetPassConfig.h"		#include "llvm/CodeGen/TargetPassConfig.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/IntrinsicsAMDGPU.h"		#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/IntrinsicsR600.h"		#include "llvm/IR/IntrinsicsR600.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
		#include "llvm/Transforms/Utils/SSAUpdater.h"

#define DEBUG_TYPE "amdgpu-promote-alloca"		#define DEBUG_TYPE "amdgpu-promote-alloca"

using namespace llvm;		using namespace llvm;

namespace {		namespace {

static cl::opt<bool> DisablePromoteAllocaToVector(		static cl::opt<bool>
"disable-promote-alloca-to-vector",		DisablePromoteAllocaToVector("disable-promote-alloca-to-vector",
cl::desc("Disable promote alloca to vector"),		cl::desc("Disable promote alloca to vector"),
cl::init(false));		cl::init(false));

static cl::opt<bool> DisablePromoteAllocaToLDS(		static cl::opt<bool>
"disable-promote-alloca-to-lds",		DisablePromoteAllocaToLDS("disable-promote-alloca-to-lds",
cl::desc("Disable promote alloca to LDS"),		cl::desc("Disable promote alloca to LDS"),
cl::init(false));		cl::init(false));

static cl::opt<unsigned> PromoteAllocaToVectorLimit(		static cl::opt<unsigned> PromoteAllocaToVectorLimit(
"amdgpu-promote-alloca-to-vector-limit",		"amdgpu-promote-alloca-to-vector-limit",
cl::desc("Maximum byte size to consider promote alloca to vector"),		cl::desc("Maximum byte size to consider promote alloca to vector"),
cl::init(0));		cl::init(0));

// Shared implementation which can do both promotion to vector and to LDS.		// Shared implementation which can do both promotion to vector and to LDS.
class AMDGPUPromoteAllocaImpl {		class AMDGPUPromoteAllocaImpl {
private:		private:
const TargetMachine &TM;		const TargetMachine &TM;
Module *Mod = nullptr;		Module *Mod = nullptr;
const DataLayout *DL = nullptr;		const DataLayout *DL = nullptr;

// FIXME: This should be per-kernel.		// FIXME: This should be per-kernel.
uint32_t LocalMemLimit = 0;		uint32_t LocalMemLimit = 0;
uint32_t CurrentLocalMemUsage = 0;		uint32_t CurrentLocalMemUsage = 0;
unsigned MaxVGPRs;		unsigned MaxVGPRs;

bool IsAMDGCN = false;		bool IsAMDGCN = false;
bool IsAMDHSA = false;		bool IsAMDHSA = false;

std::pair<Value , Value > getLocalSizeYZ(IRBuilder<> &Builder);		std::pair<Value , Value > getLocalSizeYZ(IRBuilder<> &Builder);
Value *getWorkitemID(IRBuilder<> &Builder, unsigned N);		Value *getWorkitemID(IRBuilder<> &Builder, unsigned N);

/// BaseAlloca is the alloca root the search started from.		/// BaseAlloca is the alloca root the search started from.
/// Val may be that alloca or a recursive user of it.		/// Val may be that alloca or a recursive user of it.
bool collectUsesWithPtrTypes(Value *BaseAlloca,		bool collectUsesWithPtrTypes(Value BaseAlloca, Value Val,
Value *Val,
std::vector<Value*> &WorkList) const;		std::vector<Value *> &WorkList) const;

/// Val is a derived pointer from Alloca. OpIdx0/OpIdx1 are the operand		/// Val is a derived pointer from Alloca. OpIdx0/OpIdx1 are the operand
/// indices to an instruction with 2 pointer inputs (e.g. select, icmp).		/// indices to an instruction with 2 pointer inputs (e.g. select, icmp).
/// Returns true if both operands are derived from the same alloca. Val should		/// Returns true if both operands are derived from the same alloca. Val should
/// be the same value as one of the input operands of UseInst.		/// be the same value as one of the input operands of UseInst.
bool binaryOpIsDerivedFromSameAlloca(Value Alloca, Value Val,		bool binaryOpIsDerivedFromSameAlloca(Value Alloca, Value Val,
Instruction *UseInst,		Instruction *UseInst, int OpIdx0,
int OpIdx0, int OpIdx1) const;		int OpIdx1) const;

/// Check whether we have enough local memory for promotion.		/// Check whether we have enough local memory for promotion.
bool hasSufficientLocalMem(const Function &F);		bool hasSufficientLocalMem(const Function &F);

bool tryPromoteAllocaToVector(AllocaInst &I);		bool tryPromoteAllocaToVector(AllocaInst &I);
bool tryPromoteAllocaToLDS(AllocaInst &I, bool SufficientLDS);		bool tryPromoteAllocaToLDS(AllocaInst &I, bool SufficientLDS);

public:		public:
Show All 21 Lines	if (auto *TPC = getAnalysisIfAvailable<TargetPassConfig>())
.run(F, /PromoteToLDS/ true);		.run(F, /PromoteToLDS/ true);
return false;		return false;
}		}

StringRef getPassName() const override { return "AMDGPU Promote Alloca"; }		StringRef getPassName() const override { return "AMDGPU Promote Alloca"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
FunctionPass::getAnalysisUsage(AU);		FunctionPass::getAnalysisUsage(AU);
		arsenmUnsubmitted Done Reply Inline Actions I think this is an aggressive statement, why not just preserves DominatorTree (also, I'd expect that to be implied by setPreservesCFG) arsenm: I think this is an aggressive statement, why not just preserves DominatorTree (also, I'd expect…
}		}
};		};

class AMDGPUPromoteAllocaToVector : public FunctionPass {		class AMDGPUPromoteAllocaToVector : public FunctionPass {
public:		public:
static char ID;		static char ID;

AMDGPUPromoteAllocaToVector() : FunctionPass(ID) {}		AMDGPUPromoteAllocaToVector() : FunctionPass(ID) {}
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	bool AMDGPUPromoteAllocaImpl::run(Function &F, bool PromoteToLDS) {
bool Changed = false;		bool Changed = false;
for (AllocaInst *AI : Allocas) {		for (AllocaInst *AI : Allocas) {
if (tryPromoteAllocaToVector(*AI))		if (tryPromoteAllocaToVector(*AI))
Changed = true;		Changed = true;
else if (PromoteToLDS && tryPromoteAllocaToLDS(*AI, SufficientLDS))		else if (PromoteToLDS && tryPromoteAllocaToLDS(*AI, SufficientLDS))
Changed = true;		Changed = true;
}		}

		// NOTE: tryPromoteAllocaToVector removes the alloca, so Allocas contains
		// dangling pointers. If we want to reuse it past this point, the loop above
		// would need to be updated to remove successfully promoted allocas.

return Changed;		return Changed;
}		}

struct MemTransferInfo {		struct MemTransferInfo {
ConstantInt *SrcIndex = nullptr;		ConstantInt *SrcIndex = nullptr;
ConstantInt *DestIndex = nullptr;		ConstantInt *DestIndex = nullptr;
};		};

// Checks if the instruction I is a memset user of the alloca AI that we can		// Checks if the instruction I is a memset user of the alloca AI that we can
// deal with. Currently, only non-volatile memsets that affect the whole alloca		// deal with. Currently, only non-volatile memsets that affect the whole alloca
// are handled.		// are handled.
static bool isSupportedMemset(MemSetInst I, AllocaInst AI,		static bool isSupportedMemset(MemSetInst I, AllocaInst AI,
const DataLayout &DL) {		const DataLayout &DL) {
using namespace PatternMatch;		using namespace PatternMatch;
// For now we only care about non-volatile memsets that affect the whole type		// For now we only care about non-volatile memsets that affect the whole type
// (start at index 0 and fill the whole alloca).		// (start at index 0 and fill the whole alloca).
		//
		// TODO: Now that we moved to PromoteAlloca we could handle any memsets
		// (except maybe volatile ones?) - we just need to use shufflevector if it
		// only affects a subset of the vector.
const unsigned Size = DL.getTypeStoreSize(AI->getAllocatedType());		const unsigned Size = DL.getTypeStoreSize(AI->getAllocatedType());
return I->getOperand(0) == AI &&		return I->getOperand(0) == AI &&
match(I->getOperand(2), m_SpecificInt(Size)) && !I->isVolatile();		match(I->getOperand(2), m_SpecificInt(Size)) && !I->isVolatile();
}		}

static Value *		static Value *
calculateVectorIndex(Value *Ptr,		calculateVectorIndex(Value *Ptr,
const std::map<GetElementPtrInst , Value > &GEPIdx) {		const std::map<GetElementPtrInst , Value > &GEPIdx) {
Show All 34 Lines	static Value GEPToVectorIndex(GetElementPtrInst GEP, AllocaInst *Alloca,
uint64_t Rem;		uint64_t Rem;
APInt::udivrem(ConstOffset, VecElemSize, Quot, Rem);		APInt::udivrem(ConstOffset, VecElemSize, Quot, Rem);
if (Rem != 0)		if (Rem != 0)
return nullptr;		return nullptr;

return ConstantInt::get(GEP->getContext(), Quot);		return ConstantInt::get(GEP->getContext(), Quot);
}		}

		/// Promotes a single user of the alloca to a vector form.
		///
		/// \param Inst Instruction to be promoted.
		/// \param DL Module Data Layout.
		/// \param VectorTy Vectorized Type.
		/// \param VecStoreSize Size of \p VectorTy in bytes.
		/// \param ElementSize Size of \p VectorTy element type in bytes.
		/// \param TransferInfo MemTransferInst info map.
		/// \param GEPVectorIdx GEP -> VectorIdx cache.
		/// \param CurVal Current value of the vector (e.g. last stored value)
		/// \param[out] DeferredLoads \p Inst is added to this vector if it can't
		/// be promoted now. This happens when promoting requires \p
		/// CurVal, but \p CurVal is nullptr.
		/// \return the stored value if \p Inst would have written to the alloca, or
		/// nullptr otherwise.
		static Value *promoteAllocaUserToVector(
		Instruction Inst, const DataLayout &DL, FixedVectorType VectorTy,
		unsigned VecStoreSize, unsigned ElementSize,
		DenseMap<MemTransferInst *, MemTransferInfo> &TransferInfo,
		std::map<GetElementPtrInst , Value > &GEPVectorIdx, Value *CurVal,
		SmallVectorImpl<LoadInst *> &DeferredLoads) {
		// Note: we use InstSimplifyFolder because it can leverage the DataLayout
		// to do more folding, especially in the case of vector splats.
		IRBuilder<InstSimplifyFolder> Builder(Inst->getContext(),
		InstSimplifyFolder(DL));
		Builder.SetInsertPoint(Inst);
		arsenmUnsubmitted Done Reply Inline Actions This whole thing feels wrong to me. You shouldn't need to figure out any ordering, much less by using comesBefore repeatedly arsenm: This whole thing feels wrong to me. You shouldn't need to figure out any ordering, much less by…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions I think that the issue was introduced because I'm looking through things such as bitcasts, which means that we can have IR like this: bb1: %x = alloca ... %cast.x = cast %x to ... bb2: store ... %x store ... %cast.x Then, when I iterate the users of the alloca and build the worklist, the worklist may have the store to %cast.x before the store to %x, so I need to sort the worklist to use SSAAUpdater. I'm not sure how `LoadStorePromoter` would help, I've looked at it but it seems to be for a different purpose Pierre-vh: I think that the issue was introduced because I'm looking through things such as bitcasts…
		arsenmUnsubmitted Done Reply Inline Actions I applied your patch and not that many tests fail. The one I'm looking at is promote_memcpy_inline_aggr. This has the multiple loads and stores in the same block case the LoadAndStorePromoter comment says the base SSAUpdater does not handle. I think this is probably the correct tool, you just have the added complexity that we're coercing the type of the accesses to vector instead of reusing the original arsenm: I applied your patch and not that many tests fail. The one I'm looking at is…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions `LoadAndStorePromoter` would need some modifications to work here or I would need to first rewrite all of the load/stores to use vectors then run promote alloca later to do the promotion. For instance, `LoadAndStorePromoter` initializes SSAUpdater with the type of the first load/store I don't mind using it if you think it's better to first rewrite all the load/stores then run `LoadAndStoreUpdater`, but IMO it's really not necessary, we'd make things more complex (and less direct) just to avoid a 5 lines function (I heavily simplified it) Pierre-vh: `LoadAndStorePromoter` would need some modifications to work here or I would need to first…
		arsenmUnsubmitted Not Done Reply Inline Actions I think the real question is how does LoadAndStorePromoter deal with multiple loads and stores in the same block? I doubt it's using a worklist sort. Can you use whatever technique it's using? arsenm: I think the real question is how does LoadAndStorePromoter deal with multiple loads and stores…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions It uses a system of buckets // First step: bucket up uses of the alloca by the block they occur in. // This is important because we have to handle multiple defs/uses in a block // ourselves: SSAUpdater is purely for cross-block references. DenseMap<BasicBlock , TinyPtrVector<Instruction >> UsesByBlock; And if, when promoting an inst, it sees there's more than one inst in the bucket, it does a linear scan of the basic block to find & promote all users. I used something similar at first but I just found it more complicated, when this case just a simple sorted worklist does the trick. The only important part really is that it sees users in a given basic block in program order. There's multiple ways to achieve that. If you're more comfortable with that approach then I can use it too, and just simplify it as much as possible for our use case Pierre-vh: It uses a system of buckets ``` // First step: bucket up uses of the alloca by the block they…
		arsenmUnsubmitted Not Done Reply Inline Actions I think that makes more sense, sorting isn't ideal arsenm: I think that makes more sense, sorting isn't ideal

		const auto GetOrLoadCurrentVectorValue = [&]() -> Value * {
		if (CurVal)
		return CurVal;

		// If the current value is not known, insert a dummy load and lower it on
		// the second pass.
		LoadInst *Dummy =
		Builder.CreateLoad(VectorTy, PoisonValue::get(Builder.getPtrTy()),
		arsenmUnsubmitted Done Reply Inline Actions Use poison for the filler arsenm: Use poison for the filler
		"promotealloca.dummyload");
		DeferredLoads.push_back(Dummy);
		return Dummy;
		};

		const auto CreateTempPtrIntCast =
		[&Builder, VecStoreSize](Value Val, Type PtrTy) -> Value * {
		const unsigned TempIntSize = (VecStoreSize * 8);
		arsenmUnsubmitted Done Reply Inline Actions Capitalize arsenm: Capitalize
		if (!PtrTy->isVectorTy())
		return Builder.CreateBitOrPointerCast(Val,
		Builder.getIntNTy(TempIntSize));
		const unsigned NumPtrElts = cast<FixedVectorType>(PtrTy)->getNumElements();
		// If we want to cast to cast, e.g. a <2 x ptr> into a <4 x i32>, we need to
		// first cast the ptr vector to <2 x i64>.
		assert(alignTo(TempIntSize, NumPtrElts) == TempIntSize &&
		"Vector size not divisble");
		Type *EltTy = Builder.getIntNTy(TempIntSize / NumPtrElts);
		return Builder.CreateBitOrPointerCast(
		Val, FixedVectorType::get(EltTy, NumPtrElts));
		};

		Type *VecEltTy = VectorTy->getElementType();
		switch (Inst->getOpcode()) {
		case Instruction::Load: {
		arsenmUnsubmitted Not Done Reply Inline Actions Doesn't consider vector of pointers arsenm: Doesn't consider vector of pointers
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions What case do you have in mind? This code path is just for loading the full vector. I initially had a test for a `<1 x ptr>` vector but we don't vectorize under 2 elements so it never worked. Pierre-vh: What case do you have in mind? This code path is just for loading the full vector. I initially…
		arsenmUnsubmitted Not Done Reply Inline Actions store <2 x ptr> load <4 x ptr addrspace(3)> things like that arsenm: store <2 x ptr> load <4 x ptr addrspace(3)> things like that
		// Loads can only be lowered if the value is known.
		if (!CurVal) {
		DeferredLoads.push_back(cast<LoadInst>(Inst));
		return nullptr;
		}

		Value *Index = calculateVectorIndex(
		cast<LoadInst>(Inst)->getPointerOperand(), GEPVectorIdx);

		// We're loading the full vector.
		if (DL.getTypeStoreSize(Inst->getType()) == VecStoreSize) {
		assert(cast<Constant>(Index)->isZeroValue());
		Type *InstTy = Inst->getType();
		if (InstTy->isPtrOrPtrVectorTy())
		CurVal = CreateTempPtrIntCast(CurVal, InstTy);
		Value *NewVal = Builder.CreateBitOrPointerCast(CurVal, InstTy);
		Inst->replaceAllUsesWith(NewVal);
		return nullptr;
		}

		// We're loading one element.
		Value *ExtractElement = Builder.CreateExtractElement(CurVal, Index);
		if (Inst->getType() != VecEltTy)
		ExtractElement =
		Builder.CreateBitOrPointerCast(ExtractElement, Inst->getType());

		Inst->replaceAllUsesWith(ExtractElement);
		return nullptr;
		}
		case Instruction::Store: {
		// For stores, it's a bit trickier and it depends on whether we're storing
		// the full vector or not. If we're storing the full vector, we don't need
		// to know the current value. If this is a store of a single element, we
		// need to know the value.
		StoreInst *SI = cast<StoreInst>(Inst);
		arsenmUnsubmitted Done Reply Inline Actions Specifically use PtrToInt? arsenm: Specifically use PtrToInt?
		Value *Index = calculateVectorIndex(SI->getPointerOperand(), GEPVectorIdx);
		Value *Val = SI->getValueOperand();

		// We're storing the full vector, we can handle this without knowing CurVal.
		if (DL.getTypeStoreSize(Val->getType()) == VecStoreSize) {
		assert(cast<Constant>(Index)->isZeroValue());
		Type *SrcTy = Val->getType();
		if (SrcTy->isPtrOrPtrVectorTy())
		Val = CreateTempPtrIntCast(Val, SrcTy);
		return Builder.CreateBitOrPointerCast(Val, VectorTy);
		}

		if (Val->getType() != VecEltTy)
		Val = Builder.CreateBitOrPointerCast(Val, VecEltTy);
		return Builder.CreateInsertElement(GetOrLoadCurrentVectorValue(), Val,
		Index);
		}
		case Instruction::Call: {
		if (auto *MTI = dyn_cast<MemTransferInst>(Inst)) {
		// For memcpy, we need to know curval.
		ConstantInt *Length = cast<ConstantInt>(MTI->getLength());
		unsigned NumCopied = Length->getZExtValue() / ElementSize;
		MemTransferInfo *TI = &TransferInfo[MTI];
		arsenmUnsubmitted Done Reply Inline Actions You already have the cast<MemTransferInst> as MTI arsenm: You already have the cast<MemTransferInst> as MTI
		unsigned SrcBegin = TI->SrcIndex->getZExtValue();
		unsigned DestBegin = TI->DestIndex->getZExtValue();

		SmallVector<int> Mask;
		for (unsigned Idx = 0; Idx < VectorTy->getNumElements(); ++Idx) {
		if (Idx >= DestBegin && Idx < DestBegin + NumCopied) {
		Mask.push_back(SrcBegin++);
		} else {
		Mask.push_back(Idx);
		}
		}

		return Builder.CreateShuffleVector(GetOrLoadCurrentVectorValue(), Mask);
		}
		arsenmUnsubmitted Done Reply Inline Actions no else after return arsenm: no else after return

		if (auto *MSI = dyn_cast<MemSetInst>(Inst)) {
		// For memset, we don't need to know the previous value because we
		// currently only allow memsets that cover the whole alloca.
		Value *Elt = MSI->getOperand(1);
		if (DL.getTypeStoreSize(VecEltTy) > 1) {
		Value *EltBytes =
		Builder.CreateVectorSplat(DL.getTypeStoreSize(VecEltTy), Elt);
		Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
		}

		return Builder.CreateVectorSplat(VectorTy->getElementCount(), Elt);
		}

		llvm_unreachable("Unsupported call when promoting alloca to vector");
		}
		arsenmUnsubmitted Done Reply Inline Actions Dead break arsenm: Dead break

		default:
		llvm_unreachable("Inconsistency in instructions promotable to vector");
		}

		llvm_unreachable("Did not return after promoting instruction!");
		}

		/// Iterates over an instruction worklist that may contain multiple instructions
		/// from the same basic block, but in a different order.
		template <typename InstContainer>
		static void forEachWorkListItem(const InstContainer &WorkList,
		std::function<void(Instruction *)> Fn) {
		// Bucket up uses of the alloca by the block they occur in.
		// This is important because we have to handle multiple defs/uses in a block
		// ourselves: SSAUpdater is purely for cross-block references.
		DenseMap<BasicBlock , SmallDenseSet<Instruction >> UsesByBlock;
		for (Instruction *User : WorkList)
		UsesByBlock[User->getParent()].insert(User);

		for (Instruction *User : WorkList) {
		BasicBlock *BB = User->getParent();
		auto &BlockUses = UsesByBlock[BB];

		// Already processed, skip.
		if (BlockUses.empty())
		continue;

		// Only user in the block, directly process it.
		if (BlockUses.size() == 1) {
		Fn(User);
		continue;
		}

		// Multiple users in the block, do a linear scan to see users in order.
		for (Instruction &Inst : *BB) {
		if (!BlockUses.contains(&Inst))
		continue;

		Fn(&Inst);
		}

		// Clear the block so we know it's been processed.
		BlockUses.clear();
		}
		}

// FIXME: Should try to pick the most likely to be profitable allocas first.		// FIXME: Should try to pick the most likely to be profitable allocas first.
bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {		bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
LLVM_DEBUG(dbgs() << "Trying to promote to vector: " << Alloca << '\n');		LLVM_DEBUG(dbgs() << "Trying to promote to vector: " << Alloca << '\n');

if (DisablePromoteAllocaToVector) {		if (DisablePromoteAllocaToVector) {
LLVM_DEBUG(dbgs() << " Promote alloca to vector is disabled\n");		LLVM_DEBUG(dbgs() << " Promote alloca to vector is disabled\n");
return false;		return false;
}		}
Show All 30 Lines	bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
if (VectorTy->getNumElements() > 16 \|\| VectorTy->getNumElements() < 2) {		if (VectorTy->getNumElements() > 16 \|\| VectorTy->getNumElements() < 2) {
LLVM_DEBUG(dbgs() << " " << *VectorTy		LLVM_DEBUG(dbgs() << " " << *VectorTy
<< " has an unsupported number of elements\n");		<< " has an unsupported number of elements\n");
return false;		return false;
}		}

std::map<GetElementPtrInst , Value > GEPVectorIdx;		std::map<GetElementPtrInst , Value > GEPVectorIdx;
SmallVector<Instruction *> WorkList;		SmallVector<Instruction *> WorkList;
		SmallVector<Instruction *> UsersToRemove;
SmallVector<Instruction *> DeferredInsts;		SmallVector<Instruction *> DeferredInsts;
SmallVector<Use *, 8> Uses;		SmallVector<Use *, 8> Uses;
DenseMap<MemTransferInst *, MemTransferInfo> TransferInfo;		DenseMap<MemTransferInst *, MemTransferInfo> TransferInfo;

const auto RejectUser = [&](Instruction *Inst, Twine Msg) {		const auto RejectUser = [&](Instruction *Inst, Twine Msg) {
LLVM_DEBUG(dbgs() << " Cannot promote alloca to vector: " << Msg << "\n"		LLVM_DEBUG(dbgs() << " Cannot promote alloca to vector: " << Msg << "\n"
<< " " << *Inst << "\n");		<< " " << *Inst << "\n");
return false;		return false;
Show All 12 Lines	while (!Uses.empty()) {

if (Value *Ptr = getLoadStorePointerOperand(Inst)) {		if (Value *Ptr = getLoadStorePointerOperand(Inst)) {
// This is a store of the pointer, not to the pointer.		// This is a store of the pointer, not to the pointer.
if (isa<StoreInst>(Inst) &&		if (isa<StoreInst>(Inst) &&
U->getOperandNo() != StoreInst::getPointerOperandIndex())		U->getOperandNo() != StoreInst::getPointerOperandIndex())
return RejectUser(Inst, "pointer is being stored");		return RejectUser(Inst, "pointer is being stored");

Type *AccessTy = getLoadStoreType(Inst);		Type *AccessTy = getLoadStoreType(Inst);
		if (AccessTy->isAggregateType())
		return RejectUser(Inst, "unsupported load/store as aggregate");
		assert(!AccessTy->isAggregateType() \|\| AccessTy->isArrayTy());

Ptr = Ptr->stripPointerCasts();		Ptr = Ptr->stripPointerCasts();

// Alloca already accessed as vector, leave alone.		// Alloca already accessed as vector.
if (Ptr == &Alloca && DL->getTypeStoreSize(Alloca.getAllocatedType()) ==		if (Ptr == &Alloca && DL->getTypeStoreSize(Alloca.getAllocatedType()) ==
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions This will crash if we load/store a value at an offset, and the access type is the same size as the alloca. Should we check that the index is zero as well here? Pierre-vh: This will crash if we load/store a value at an offset, and the access type is the same size as…
		arsenmUnsubmitted Done Reply Inline Actions Probably, need to not break on out of bounds access arsenm: Probably, need to not break on out of bounds access
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Oops, wrong place, it already does it (`&Ptr == &Alloca`), it's something in D155699 Pierre-vh: Oops, wrong place, it already does it (`&Ptr == &Alloca`), it's something in D155699
DL->getTypeStoreSize(AccessTy))		DL->getTypeStoreSize(AccessTy)) {
		WorkList.push_back(Inst);
continue;		continue;
		}

// Check that this is a simple access of a vector element.		// Check that this is a simple access of a vector element.
bool IsSimple = isa<LoadInst>(Inst) ? cast<LoadInst>(Inst)->isSimple()		bool IsSimple = isa<LoadInst>(Inst) ? cast<LoadInst>(Inst)->isSimple()
: cast<StoreInst>(Inst)->isSimple();		: cast<StoreInst>(Inst)->isSimple();
if (!IsSimple \|\|		if (!IsSimple \|\|
!CastInst::isBitOrNoopPointerCastable(VecEltTy, AccessTy, *DL))		!CastInst::isBitOrNoopPointerCastable(VecEltTy, AccessTy, *DL))
return RejectUser(Inst, "not simple and/or vector element type not "		return RejectUser(Inst, "not simple and/or vector element type not "
"castable to access type");		"castable to access type");

WorkList.push_back(Inst);		WorkList.push_back(Inst);
continue;		continue;
}		}

if (isa<BitCastInst>(Inst)) {		if (isa<BitCastInst>(Inst)) {
// Look through bitcasts.		// Look through bitcasts.
for (Use &U : Inst->uses())		for (Use &U : Inst->uses())
Uses.push_back(&U);		Uses.push_back(&U);
		UsersToRemove.push_back(Inst);
continue;		continue;
}		}

if (auto *GEP = dyn_cast<GetElementPtrInst>(Inst)) {		if (auto *GEP = dyn_cast<GetElementPtrInst>(Inst)) {
// If we can't compute a vector index from this GEP, then we can't		// If we can't compute a vector index from this GEP, then we can't
// promote this alloca to vector.		// promote this alloca to vector.
Value Index = GEPToVectorIndex(GEP, &Alloca, VecEltTy, DL);		Value Index = GEPToVectorIndex(GEP, &Alloca, VecEltTy, DL);
if (!Index)		if (!Index)
return RejectUser(Inst, "cannot compute vector index for GEP");		return RejectUser(Inst, "cannot compute vector index for GEP");

GEPVectorIdx[GEP] = Index;		GEPVectorIdx[GEP] = Index;
for (Use &U : Inst->uses())		for (Use &U : Inst->uses())
Uses.push_back(&U);		Uses.push_back(&U);
		UsersToRemove.push_back(Inst);
continue;		continue;
}		}

if (MemSetInst *MSI = dyn_cast<MemSetInst>(Inst);		if (MemSetInst *MSI = dyn_cast<MemSetInst>(Inst);
MSI && isSupportedMemset(MSI, &Alloca, *DL)) {		MSI && isSupportedMemset(MSI, &Alloca, *DL)) {
WorkList.push_back(Inst);		WorkList.push_back(Inst);
continue;		continue;
}		}
Show All 36 Lines	if (MemTransferInst *TransferInst = dyn_cast<MemTransferInst>(Inst)) {
if (!Index)		if (!Index)
return RejectUser(Inst, "could not calculate constant src index");		return RejectUser(Inst, "could not calculate constant src index");
TI->SrcIndex = Index;		TI->SrcIndex = Index;
}		}
continue;		continue;
}		}

// Ignore assume-like intrinsics and comparisons used in assumes.		// Ignore assume-like intrinsics and comparisons used in assumes.
if (isAssumeLikeIntrinsic(Inst))		if (isAssumeLikeIntrinsic(Inst)) {
		UsersToRemove.push_back(Inst);
continue;		continue;
		}

if (isa<ICmpInst>(Inst) && all_of(Inst->users(), [](User *U) {		if (isa<ICmpInst>(Inst) && all_of(Inst->users(), [](User *U) {
return isAssumeLikeIntrinsic(cast<Instruction>(U));		return isAssumeLikeIntrinsic(cast<Instruction>(U));
}))		})) {
		UsersToRemove.push_back(Inst);
continue;		continue;
		}

return RejectUser(Inst, "unhandled alloca user");		return RejectUser(Inst, "unhandled alloca user");
}		}

while (!DeferredInsts.empty()) {		while (!DeferredInsts.empty()) {
Instruction *Inst = DeferredInsts.pop_back_val();		Instruction *Inst = DeferredInsts.pop_back_val();
MemTransferInst *TransferInst = cast<MemTransferInst>(Inst);		MemTransferInst *TransferInst = cast<MemTransferInst>(Inst);
// TODO: Support the case if the pointers are from different alloca or		// TODO: Support the case if the pointers are from different alloca or
// from different address spaces.		// from different address spaces.
MemTransferInfo &Info = TransferInfo[TransferInst];		MemTransferInfo &Info = TransferInfo[TransferInst];
if (!Info.SrcIndex \|\| !Info.DestIndex)		if (!Info.SrcIndex \|\| !Info.DestIndex)
return RejectUser(		return RejectUser(
Inst, "mem transfer inst is missing constant src and/or dst index");		Inst, "mem transfer inst is missing constant src and/or dst index");
}		}

LLVM_DEBUG(dbgs() << " Converting alloca to vector " << *AllocaTy << " -> "		LLVM_DEBUG(dbgs() << " Converting alloca to vector " << *AllocaTy << " -> "
<< *VectorTy << '\n');		<< *VectorTy << '\n');
		const unsigned VecStoreSize = DL->getTypeStoreSize(VectorTy);

for (Instruction *Inst : WorkList) {		// Alloca is uninitialized memory. Imitate that by making the first value
IRBuilder<> Builder(Inst);		// undef.
switch (Inst->getOpcode()) {		SSAUpdater Updater;
case Instruction::Load: {		Updater.Initialize(VectorTy, "promotealloca");
Value *Ptr = cast<LoadInst>(Inst)->getPointerOperand();		Updater.AddAvailableValue(Alloca.getParent(), UndefValue::get(VectorTy));
Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);
Value *VecValue =		// First handle the initial worklist.
Builder.CreateAlignedLoad(VectorTy, &Alloca, Alloca.getAlign());		SmallVector<LoadInst *, 4> DeferredLoads;
Value *ExtractElement = Builder.CreateExtractElement(VecValue, Index);		forEachWorkListItem(WorkList, [&](Instruction *I) {
if (Inst->getType() != VecEltTy)		BasicBlock *BB = I->getParent();
ExtractElement =		// On the first pass, we only take values that are trivially known, i.e.
Builder.CreateBitOrPointerCast(ExtractElement, Inst->getType());		// where AddAvailableValue was already called in this block.
Inst->replaceAllUsesWith(ExtractElement);		Value *Result = promoteAllocaUserToVector(
Inst->eraseFromParent();		I, *DL, VectorTy, VecStoreSize, ElementSize, TransferInfo, GEPVectorIdx,
break;		Updater.FindValueForBlock(BB), DeferredLoads);
}		if (Result)
case Instruction::Store: {		Updater.AddAvailableValue(BB, Result);
StoreInst *SI = cast<StoreInst>(Inst);		});
Value *Ptr = SI->getPointerOperand();
Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);		// Then handle deferred loads.
Value *VecValue =		forEachWorkListItem(DeferredLoads, [&](Instruction *I) {
Builder.CreateAlignedLoad(VectorTy, &Alloca, Alloca.getAlign());		SmallVector<LoadInst *, 0> NewDLs;
Value *Elt = SI->getValueOperand();		BasicBlock *BB = I->getParent();
if (Elt->getType() != VecEltTy)		// On the second pass, we use GetValueInMiddleOfBlock to guarantee we always
Elt = Builder.CreateBitOrPointerCast(Elt, VecEltTy);		// get a value, inserting PHIs as needed.
Value *NewVecValue = Builder.CreateInsertElement(VecValue, Elt, Index);		Value *Result = promoteAllocaUserToVector(
Builder.CreateAlignedStore(NewVecValue, &Alloca, Alloca.getAlign());		I, *DL, VectorTy, VecStoreSize, ElementSize, TransferInfo, GEPVectorIdx,
Inst->eraseFromParent();		Updater.GetValueInMiddleOfBlock(I->getParent()), NewDLs);
break;		if (Result)
}		Updater.AddAvailableValue(BB, Result);
case Instruction::Call: {		assert(NewDLs.empty() && "No more deferred loads should be queued!");
if (const MemTransferInst *MTI = dyn_cast<MemTransferInst>(Inst)) {		});
ConstantInt *Length = cast<ConstantInt>(MTI->getLength());
unsigned NumCopied = Length->getZExtValue() / ElementSize;		// Delete all instructions. On the first pass, new dummy loads may have been
MemTransferInfo *TI = &TransferInfo[cast<MemTransferInst>(Inst)];		// added so we need to collect them too.
unsigned SrcBegin = TI->SrcIndex->getZExtValue();		DenseSet<Instruction *> InstsToDelete(WorkList.begin(), WorkList.end());
unsigned DestBegin = TI->DestIndex->getZExtValue();		InstsToDelete.insert(DeferredLoads.begin(), DeferredLoads.end());
		for (Instruction *I : InstsToDelete) {
SmallVector<int> Mask;		assert(I->use_empty());
for (unsigned Idx = 0; Idx < VectorTy->getNumElements(); ++Idx) {		I->eraseFromParent();
if (Idx >= DestBegin && Idx < DestBegin + NumCopied) {		}
Mask.push_back(SrcBegin++);
} else {		// Delete all the users that are known to be removeable.
Mask.push_back(Idx);		for (Instruction *I : reverse(UsersToRemove)) {
}		I->dropDroppableUses();
		arsenmUnsubmitted Done Reply Inline Actions I don't understand why you would need to do this. You aren't moving any instructions around, so the dominance properties should be implied by the original values. arsenm: I don't understand why you would need to do this. You aren't moving any instructions around, so…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions If I don't sort by dominance, I have multiples crashes in `promote-alloca-array-aggregate.ll`. It seems mostly related to memcpy though, hence why I put a TODO saying the lowering may be incorrect. I'm still not sure why. I agree that requiring the DT for this is annoying, I'll try to think of ways to avoid it. Pierre-vh: If I don't sort by dominance, I have multiples crashes in `promote-alloca-array-aggregate.ll`.
}		assert(I->use_empty());
Value *VecValue =		I->eraseFromParent();
Builder.CreateAlignedLoad(VectorTy, &Alloca, Alloca.getAlign());		}
		arsenmUnsubmitted Done Reply Inline Actions I think this is subtley wrong and should actually be undef. Alloca memory is currently initialized with undef, not poison arsenm: I think this is subtley wrong and should actually be undef. Alloca memory is currently…
Value *NewVecValue = Builder.CreateShuffleVector(VecValue, Mask);
Builder.CreateAlignedStore(NewVecValue, &Alloca, Alloca.getAlign());		// Alloca should now be dead too.
		assert(Alloca.use_empty());
Inst->eraseFromParent();		Alloca.eraseFromParent();
} else if (MemSetInst *MSI = dyn_cast<MemSetInst>(Inst)) {
// Ensure the length parameter of the memsets matches the new vector
// type's. In general, the type size shouldn't change so this is a
// no-op, but it's better to be safe.
MSI->setOperand(2, Builder.getInt64(DL->getTypeStoreSize(VectorTy)));
} else {
llvm_unreachable("Unsupported call when promoting alloca to vector");
}
break;
}

default:
llvm_unreachable("Inconsistency in instructions promotable to vector");
}
}

return true;		return true;
}		}

std::pair<Value , Value >		std::pair<Value , Value >
AMDGPUPromoteAllocaImpl::getLocalSizeYZ(IRBuilder<> &Builder) {		AMDGPUPromoteAllocaImpl::getLocalSizeYZ(IRBuilder<> &Builder) {
Function &F = *Builder.GetInsertBlock()->getParent();		Function &F = *Builder.GetInsertBlock()->getParent();
		arsenmUnsubmitted Done Reply Inline Actions use_empty arsenm: use_empty
const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);		const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);

if (!IsAMDHSA) {		if (!IsAMDHSA) {
Function *LocalSizeYFn =		Function *LocalSizeYFn =
Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_y);		Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_y);
Function *LocalSizeZFn =		Function *LocalSizeZFn =
Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_z);		Intrinsic::getDeclaration(Mod, Intrinsic::r600_read_local_size_z);

▲ Show 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaInst &I,
if (NewSize > LocalMemLimit) {		if (NewSize > LocalMemLimit) {
LLVM_DEBUG(dbgs() << " " << AllocSize		LLVM_DEBUG(dbgs() << " " << AllocSize
<< " bytes of local memory not available to promote\n");		<< " bytes of local memory not available to promote\n");
return false;		return false;
}		}

CurrentLocalMemUsage = NewSize;		CurrentLocalMemUsage = NewSize;

std::vector<Value*> WorkList;		std::vector<Value *> WorkList;

if (!collectUsesWithPtrTypes(&I, &I, WorkList)) {		if (!collectUsesWithPtrTypes(&I, &I, WorkList)) {
LLVM_DEBUG(dbgs() << " Do not know how to convert all uses\n");		LLVM_DEBUG(dbgs() << " Do not know how to convert all uses\n");
return false;		return false;
}		}

LLVM_DEBUG(dbgs() << "Promoting alloca to local memory\n");		LLVM_DEBUG(dbgs() << "Promoting alloca to local memory\n");

▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines	bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaInst &I,
}		}

for (IntrinsicInst *Intr : DeferredIntrs) {		for (IntrinsicInst *Intr : DeferredIntrs) {
Builder.SetInsertPoint(Intr);		Builder.SetInsertPoint(Intr);
Intrinsic::ID ID = Intr->getIntrinsicID();		Intrinsic::ID ID = Intr->getIntrinsicID();
assert(ID == Intrinsic::memcpy \|\| ID == Intrinsic::memmove);		assert(ID == Intrinsic::memcpy \|\| ID == Intrinsic::memmove);

MemTransferInst *MI = cast<MemTransferInst>(Intr);		MemTransferInst *MI = cast<MemTransferInst>(Intr);
auto *B =		auto *B = Builder.CreateMemTransferInst(
Builder.CreateMemTransferInst(ID, MI->getRawDest(), MI->getDestAlign(),		ID, MI->getRawDest(), MI->getDestAlign(), MI->getRawSource(),
MI->getRawSource(), MI->getSourceAlign(),		MI->getSourceAlign(), MI->getLength(), MI->isVolatile());
MI->getLength(), MI->isVolatile());

for (unsigned I = 0; I != 2; ++I) {		for (unsigned I = 0; I != 2; ++I) {
if (uint64_t Bytes = Intr->getParamDereferenceableBytes(I)) {		if (uint64_t Bytes = Intr->getParamDereferenceableBytes(I)) {
B->addDereferenceableParamAttr(I, Bytes);		B->addDereferenceableParamAttr(I, Bytes);
}		}
}		}

Intr->eraseFromParent();		Intr->eraseFromParent();
}		}

return true;		return true;
}		}

llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	; The custom CSR spills inserted during the frame lowering was earlier using SP as the frame base.			; The custom CSR spills inserted during the frame lowering was earlier using SP as the frame base.
	; The offsets allocated for the CS objects go wrong when any local stack object has a higher			; The offsets allocated for the CS objects go wrong when any local stack object has a higher
	; alignment requirement than the default stack alignment for AMDGPU (either 4 or 16). The offsets			; alignment requirement than the default stack alignment for AMDGPU (either 4 or 16). The offsets
	; in such cases should be from the newly aligned FP. Even to adjust the offset from the SP value			; in such cases should be from the newly aligned FP. Even to adjust the offset from the SP value
	; at function entry, the FP-SP can't be statically determined with dynamic stack realignment. To			; at function entry, the FP-SP can't be statically determined with dynamic stack realignment. To
	; fix the problem, use FP as the frame base in the spills whenever the function has FP.			; fix the problem, use FP as the frame base in the spills whenever the function has FP.

	define void @test_stack_realign(<8 x i32> %val, i32 %idx) #0 {			define void @test_stack_realign(<8 x i32> %val, i32 %idx) #0 {
	; GCN-LABEL: test_stack_realign:			; GCN-LABEL: test_stack_realign:
	; GCN: ; %bb.0:			; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: s_mov_b32 s16, s33			; GCN-NEXT: s_mov_b32 s16, s33
	; GCN-NEXT: s_add_i32 s33, s32, 0xfc0			; GCN-NEXT: s_mov_b32 s33, s32
	; GCN-NEXT: s_and_b32 s33, s33, 0xfffff000
	; GCN-NEXT: s_or_saveexec_b64 s[18:19], -1			; GCN-NEXT: s_or_saveexec_b64 s[18:19], -1
	; GCN-NEXT: buffer_store_dword v42, off, s[0:3], s33 offset:96 ; 4-byte Folded Spill			; GCN-NEXT: buffer_store_dword v42, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
	; GCN-NEXT: s_mov_b64 exec, s[18:19]			; GCN-NEXT: s_mov_b64 exec, s[18:19]
	; GCN-NEXT: s_addk_i32 s32, 0x3000			; GCN-NEXT: s_addk_i32 s32, 0x400
	; GCN-NEXT: v_writelane_b32 v42, s16, 2			; GCN-NEXT: v_writelane_b32 v42, s16, 2
	; GCN-NEXT: s_getpc_b64 s[16:17]			; GCN-NEXT: s_getpc_b64 s[16:17]
	; GCN-NEXT: s_add_u32 s16, s16, extern_func@gotpcrel32@lo+4			; GCN-NEXT: s_add_u32 s16, s16, extern_func@gotpcrel32@lo+4
	; GCN-NEXT: s_addc_u32 s17, s17, extern_func@gotpcrel32@hi+12			; GCN-NEXT: s_addc_u32 s17, s17, extern_func@gotpcrel32@hi+12
	; GCN-NEXT: s_load_dwordx2 s[16:17], s[16:17], 0x0			; GCN-NEXT: s_load_dwordx2 s[16:17], s[16:17], 0x0
	; GCN-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
	; GCN-NEXT: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
	; GCN-NEXT: v_writelane_b32 v42, s30, 0			; GCN-NEXT: v_writelane_b32 v42, s30, 0
	; GCN-NEXT: buffer_store_dword v7, off, s[0:3], s33 offset:92
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: buffer_store_dword v6, off, s[0:3], s33 offset:88
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: buffer_store_dword v5, off, s[0:3], s33 offset:84
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: buffer_store_dword v4, off, s[0:3], s33 offset:80
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: buffer_store_dword v3, off, s[0:3], s33 offset:76
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: buffer_store_dword v2, off, s[0:3], s33 offset:72
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: buffer_store_dword v1, off, s[0:3], s33 offset:68
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: buffer_store_dword v0, off, s[0:3], s33 offset:64
	; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_mov_b32_e32 v0, v8			; GCN-NEXT: v_mov_b32_e32 v0, v8
				; GCN-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
				; GCN-NEXT: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
	; GCN-NEXT: v_writelane_b32 v42, s31, 1			; GCN-NEXT: v_writelane_b32 v42, s31, 1
	; GCN-NEXT: ;;#ASMSTART			; GCN-NEXT: ;;#ASMSTART
	; GCN-NEXT: ;;#ASMEND			; GCN-NEXT: ;;#ASMEND
	; GCN-NEXT: ;;#ASMSTART			; GCN-NEXT: ;;#ASMSTART
	; GCN-NEXT: ;;#ASMEND			; GCN-NEXT: ;;#ASMEND
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: s_swappc_b64 s[30:31], s[16:17]			; GCN-NEXT: s_swappc_b64 s[30:31], s[16:17]
	; GCN-NEXT: buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload			; GCN-NEXT: buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload
	; GCN-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload			; GCN-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
	; GCN-NEXT: v_readlane_b32 s31, v42, 1			; GCN-NEXT: v_readlane_b32 s31, v42, 1
	; GCN-NEXT: v_readlane_b32 s30, v42, 0			; GCN-NEXT: v_readlane_b32 s30, v42, 0
	; GCN-NEXT: v_readlane_b32 s4, v42, 2			; GCN-NEXT: v_readlane_b32 s4, v42, 2
	; GCN-NEXT: s_or_saveexec_b64 s[6:7], -1			; GCN-NEXT: s_or_saveexec_b64 s[6:7], -1
	; GCN-NEXT: buffer_load_dword v42, off, s[0:3], s33 offset:96 ; 4-byte Folded Reload			; GCN-NEXT: buffer_load_dword v42, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
	; GCN-NEXT: s_mov_b64 exec, s[6:7]			; GCN-NEXT: s_mov_b64 exec, s[6:7]
	; GCN-NEXT: s_addk_i32 s32, 0xd000			; GCN-NEXT: s_addk_i32 s32, 0xfc00
	; GCN-NEXT: s_mov_b32 s33, s4			; GCN-NEXT: s_mov_b32 s33, s4
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: s_setpc_b64 s[30:31]			; GCN-NEXT: s_setpc_b64 s[30:31]
	%alloca.val = alloca <8 x i32>, align 64, addrspace(5)			%alloca.val = alloca <8 x i32>, align 64, addrspace(5)
	store volatile <8 x i32> %val, ptr addrspace(5) %alloca.val, align 64			store volatile <8 x i32> %val, ptr addrspace(5) %alloca.val, align 64
	arsenmUnsubmitted Done Reply Inline Actions Why did this change? It only uses volatile accesses arsenm: Why did this change? It only uses volatile accesses
	Pierre-vhAuthorUnsubmitted Done Reply Inline Actions We support non-simple accesses of the whole vector, it's volatile accesses of a single element that we don't support Pierre-vh: We support non-simple accesses of the whole vector, it's volatile accesses of a single element…
	call void asm sideeffect "", "~{v40}" ()			call void asm sideeffect "", "~{v40}" ()
	call void asm sideeffect "", "~{v41}" ()			call void asm sideeffect "", "~{v41}" ()
	call void @extern_func(i32 %idx)			call void @extern_func(i32 %idx)
	ret void			ret void
	}			}

	declare void @extern_func(i32) #0			declare void @extern_func(i32) #0

	attributes #0 = { noinline nounwind }			attributes #0 = { noinline nounwind }

llvm/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-promote-alloca < %s \| FileCheck %s		; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=sroa,amdgpu-promote-alloca < %s \| FileCheck %s

; Make sure that array alloca loaded and stored as multi-element aggregates are handled correctly		; Make sure that array alloca loaded and stored as multi-element aggregates are handled correctly
; Strictly the promote-alloca pass shouldn't have to deal with this case as it is non-canonical, but		; Strictly the promote-alloca pass shouldn't have to deal with this case as it is non-canonical, but
; the pass should handle it gracefully if it is		; the pass should handle it gracefully if it is
; The checks look for lines that previously caused issues in PromoteAlloca (non-canonical). Opt		; The checks look for lines that previously caused issues in PromoteAlloca (non-canonical). Opt
; should now leave these unchanged		; should now leave these unchanged

%Block = type { [1 x float], i32 }		%Block = type { [1 x float], i32 }
%gl_PerVertex = type { <4 x float>, float, [1 x float], [1 x float] }		%gl_PerVertex = type { <4 x float>, float, [1 x float], [1 x float] }
%struct = type { i32, i32 }		%struct = type { i32, i32 }

@block = external addrspace(1) global %Block		@block = external addrspace(1) global %Block
@pv = external addrspace(1) global %gl_PerVertex		@pv = external addrspace(1) global %gl_PerVertex

define amdgpu_vs void @promote_1d_aggr() #0 {		define amdgpu_vs void @promote_1d_aggr() #0 {
; CHECK-LABEL: @promote_1d_aggr(		; CHECK-LABEL: @promote_1d_aggr(
; CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
; CHECK-NEXT: [[F1:%.*]] = alloca [1 x float], align 4, addrspace(5)		; CHECK-NEXT: [[F1:%.*]] = alloca [1 x float], align 4, addrspace(5)
; CHECK-NEXT: [[FOO:%.]] = getelementptr [[BLOCK:%.]], ptr addrspace(1) @block, i32 0, i32 1		; CHECK-NEXT: [[FOO:%.]] = getelementptr [[BLOCK:%.]], ptr addrspace(1) @block, i32 0, i32 1
; CHECK-NEXT: [[FOO1:%.*]] = load i32, ptr addrspace(1) [[FOO]], align 4		; CHECK-NEXT: [[FOO1:%.*]] = load i32, ptr addrspace(1) [[FOO]], align 4
; CHECK-NEXT: store i32 [[FOO1]], ptr addrspace(5) [[I]], align 4
; CHECK-NEXT: [[FOO3:%.*]] = load [1 x float], ptr addrspace(1) @block, align 4		; CHECK-NEXT: [[FOO3:%.*]] = load [1 x float], ptr addrspace(1) @block, align 4
; CHECK-NEXT: store [1 x float] [[FOO3]], ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: [[FOO3_FCA_0_EXTRACT:%.*]] = extractvalue [1 x float] [[FOO3]], 0
; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(5) [[I]], align 4		; CHECK-NEXT: [[FOO3_FCA_0_GEP:%.*]] = getelementptr inbounds [1 x float], ptr addrspace(5) [[F1]], i32 0, i32 0
; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [1 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]		; CHECK-NEXT: store float [[FOO3_FCA_0_EXTRACT]], ptr addrspace(5) [[FOO3_FCA_0_GEP]], align 4
		; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [1 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO1]]
; CHECK-NEXT: [[FOO6:%.*]] = load float, ptr addrspace(5) [[FOO5]], align 4		; CHECK-NEXT: [[FOO6:%.*]] = load float, ptr addrspace(5) [[FOO5]], align 4
; CHECK-NEXT: [[FOO7:%.*]] = alloca <4 x float>, align 16, addrspace(5)		; CHECK-NEXT: [[FOO9:%.*]] = insertelement <4 x float> undef, float [[FOO6]], i32 0
; CHECK-NEXT: [[FOO8:%.*]] = load <4 x float>, ptr addrspace(5) [[FOO7]], align 16
; CHECK-NEXT: [[FOO9:%.*]] = insertelement <4 x float> [[FOO8]], float [[FOO6]], i32 0
; CHECK-NEXT: [[FOO10:%.*]] = insertelement <4 x float> [[FOO9]], float [[FOO6]], i32 1		; CHECK-NEXT: [[FOO10:%.*]] = insertelement <4 x float> [[FOO9]], float [[FOO6]], i32 1
; CHECK-NEXT: [[FOO11:%.*]] = insertelement <4 x float> [[FOO10]], float [[FOO6]], i32 2		; CHECK-NEXT: [[FOO11:%.*]] = insertelement <4 x float> [[FOO10]], float [[FOO6]], i32 2
; CHECK-NEXT: [[FOO12:%.*]] = insertelement <4 x float> [[FOO11]], float [[FOO6]], i32 3		; CHECK-NEXT: [[FOO12:%.*]] = insertelement <4 x float> [[FOO11]], float [[FOO6]], i32 3
; CHECK-NEXT: store <4 x float> [[FOO12]], ptr addrspace(1) @pv, align 16		; CHECK-NEXT: store <4 x float> [[FOO12]], ptr addrspace(1) @pv, align 16
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%i = alloca i32, addrspace(5)		%i = alloca i32, addrspace(5)
%f1 = alloca [1 x float], addrspace(5)		%f1 = alloca [1 x float], addrspace(5)
Show All 15 Lines	;
ret void		ret void
}		}

%Block2 = type { i32, [2 x float] }		%Block2 = type { i32, [2 x float] }
@block2 = external addrspace(1) global %Block2		@block2 = external addrspace(1) global %Block2

define amdgpu_vs void @promote_store_aggr() #0 {		define amdgpu_vs void @promote_store_aggr() #0 {
; CHECK-LABEL: @promote_store_aggr(		; CHECK-LABEL: @promote_store_aggr(
; CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
; CHECK-NEXT: [[F1:%.*]] = alloca [2 x float], align 4, addrspace(5)
; CHECK-NEXT: [[FOO1:%.*]] = load i32, ptr addrspace(1) @block2, align 4		; CHECK-NEXT: [[FOO1:%.*]] = load i32, ptr addrspace(1) @block2, align 4
; CHECK-NEXT: store i32 [[FOO1]], ptr addrspace(5) [[I]], align 4		; CHECK-NEXT: [[FOO3:%.*]] = sitofp i32 [[FOO1]] to float
; CHECK-NEXT: [[FOO2:%.*]] = load i32, ptr addrspace(5) [[I]], align 4		; CHECK-NEXT: [[FOO6_FCA_0_INSERT:%.*]] = insertvalue [2 x float] poison, float [[FOO3]], 0
; CHECK-NEXT: [[FOO3:%.*]] = sitofp i32 [[FOO2]] to float		; CHECK-NEXT: [[FOO6_FCA_1_INSERT:%.*]] = insertvalue [2 x float] [[FOO6_FCA_0_INSERT]], float 2.000000e+00, 1
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float [[FOO3]], i32 0
; CHECK-NEXT: store <2 x float> [[TMP2]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [2 x float], ptr addrspace(5) [[F1]], i32 0, i32 1
; CHECK-NEXT: [[TMP3:%.*]] = load <2 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x float> [[TMP3]], float 2.000000e+00, i64 1
; CHECK-NEXT: store <2 x float> [[TMP4]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO6:%.*]] = load [2 x float], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO7:%.]] = getelementptr [[BLOCK2:%.]], ptr addrspace(1) @block2, i32 0, i32 1		; CHECK-NEXT: [[FOO7:%.]] = getelementptr [[BLOCK2:%.]], ptr addrspace(1) @block2, i32 0, i32 1
; CHECK-NEXT: store [2 x float] [[FOO6]], ptr addrspace(1) [[FOO7]], align 4		; CHECK-NEXT: store [2 x float] [[FOO6_FCA_1_INSERT]], ptr addrspace(1) [[FOO7]], align 4
; CHECK-NEXT: store <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, ptr addrspace(1) @pv, align 16		; CHECK-NEXT: store <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, ptr addrspace(1) @pv, align 16
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%i = alloca i32, addrspace(5)		%i = alloca i32, addrspace(5)
%f1 = alloca [2 x float], addrspace(5)		%f1 = alloca [2 x float], addrspace(5)
%foo1 = load i32, ptr addrspace(1) @block2		%foo1 = load i32, ptr addrspace(1) @block2
store i32 %foo1, ptr addrspace(5) %i		store i32 %foo1, ptr addrspace(5) %i
%foo2 = load i32, ptr addrspace(5) %i		%foo2 = load i32, ptr addrspace(5) %i
%foo3 = sitofp i32 %foo2 to float		%foo3 = sitofp i32 %foo2 to float
store float %foo3, ptr addrspace(5) %f1		store float %foo3, ptr addrspace(5) %f1
%foo5 = getelementptr [2 x float], ptr addrspace(5) %f1, i32 0, i32 1		%foo5 = getelementptr [2 x float], ptr addrspace(5) %f1, i32 0, i32 1
store float 2.000000e+00, ptr addrspace(5) %foo5		store float 2.000000e+00, ptr addrspace(5) %foo5
%foo6 = load [2 x float], ptr addrspace(5) %f1		%foo6 = load [2 x float], ptr addrspace(5) %f1
%foo7 = getelementptr %Block2, ptr addrspace(1) @block2, i32 0, i32 1		%foo7 = getelementptr %Block2, ptr addrspace(1) @block2, i32 0, i32 1
store [2 x float] %foo6, ptr addrspace(1) %foo7		store [2 x float] %foo6, ptr addrspace(1) %foo7
store <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, ptr addrspace(1) @pv		store <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, ptr addrspace(1) @pv
ret void		ret void
}		}

%Block3 = type { [2 x float], i32 }		%Block3 = type { [2 x float], i32 }
@block3 = external addrspace(1) global %Block3		@block3 = external addrspace(1) global %Block3

define amdgpu_vs void @promote_load_from_store_aggr() #0 {		define amdgpu_vs void @promote_load_from_store_aggr() #0 {
; CHECK-LABEL: @promote_load_from_store_aggr(		; CHECK-LABEL: @promote_load_from_store_aggr(
; CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
; CHECK-NEXT: [[F1:%.*]] = alloca [2 x float], align 4, addrspace(5)
; CHECK-NEXT: [[FOO:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 1		; CHECK-NEXT: [[FOO:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 1
; CHECK-NEXT: [[FOO1:%.*]] = load i32, ptr addrspace(1) [[FOO]], align 4		; CHECK-NEXT: [[FOO1:%.*]] = load i32, ptr addrspace(1) [[FOO]], align 4
; CHECK-NEXT: store i32 [[FOO1]], ptr addrspace(5) [[I]], align 4
; CHECK-NEXT: [[FOO3:%.*]] = load [2 x float], ptr addrspace(1) @block3, align 4		; CHECK-NEXT: [[FOO3:%.*]] = load [2 x float], ptr addrspace(1) @block3, align 4
; CHECK-NEXT: store [2 x float] [[FOO3]], ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: [[FOO3_FCA_0_EXTRACT:%.*]] = extractvalue [2 x float] [[FOO3]], 0
; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(5) [[I]], align 4		; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x float> undef, float [[FOO3_FCA_0_EXTRACT]], i32 0
; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [2 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]		; CHECK-NEXT: [[FOO3_FCA_1_EXTRACT:%.*]] = extractvalue [2 x float] [[FOO3]], 1
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float [[FOO3_FCA_1_EXTRACT]], i64 1
; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x float> [[TMP1]], i32 [[FOO4]]		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 [[FOO1]]
; CHECK-NEXT: [[FOO7:%.*]] = alloca <4 x float>, align 16, addrspace(5)		; CHECK-NEXT: [[FOO9:%.*]] = insertelement <4 x float> undef, float [[TMP3]], i32 0
; CHECK-NEXT: [[FOO8:%.*]] = load <4 x float>, ptr addrspace(5) [[FOO7]], align 16		; CHECK-NEXT: [[FOO10:%.*]] = insertelement <4 x float> [[FOO9]], float [[TMP3]], i32 1
; CHECK-NEXT: [[FOO9:%.*]] = insertelement <4 x float> [[FOO8]], float [[TMP2]], i32 0		; CHECK-NEXT: [[FOO11:%.*]] = insertelement <4 x float> [[FOO10]], float [[TMP3]], i32 2
; CHECK-NEXT: [[FOO10:%.*]] = insertelement <4 x float> [[FOO9]], float [[TMP2]], i32 1		; CHECK-NEXT: [[FOO12:%.*]] = insertelement <4 x float> [[FOO11]], float [[TMP3]], i32 3
; CHECK-NEXT: [[FOO11:%.*]] = insertelement <4 x float> [[FOO10]], float [[TMP2]], i32 2
; CHECK-NEXT: [[FOO12:%.*]] = insertelement <4 x float> [[FOO11]], float [[TMP2]], i32 3
; CHECK-NEXT: store <4 x float> [[FOO12]], ptr addrspace(1) @pv, align 16		; CHECK-NEXT: store <4 x float> [[FOO12]], ptr addrspace(1) @pv, align 16
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%i = alloca i32, addrspace(5)		%i = alloca i32, addrspace(5)
%f1 = alloca [2 x float], addrspace(5)		%f1 = alloca [2 x float], addrspace(5)
%foo = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 1		%foo = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 1
%foo1 = load i32, ptr addrspace(1) %foo		%foo1 = load i32, ptr addrspace(1) %foo
store i32 %foo1, ptr addrspace(5) %i		store i32 %foo1, ptr addrspace(5) %i
Show All 9 Lines	;
%foo11 = insertelement <4 x float> %foo10, float %foo6, i32 2		%foo11 = insertelement <4 x float> %foo10, float %foo6, i32 2
%foo12 = insertelement <4 x float> %foo11, float %foo6, i32 3		%foo12 = insertelement <4 x float> %foo11, float %foo6, i32 3
store <4 x float> %foo12, ptr addrspace(1) @pv		store <4 x float> %foo12, ptr addrspace(1) @pv
ret void		ret void
}		}

define amdgpu_vs void @promote_memmove_aggr() #0 {		define amdgpu_vs void @promote_memmove_aggr() #0 {
; CHECK-LABEL: @promote_memmove_aggr(		; CHECK-LABEL: @promote_memmove_aggr(
; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)		; CHECK-NEXT: store float 1.000000e+00, ptr addrspace(1) @pv, align 4
; CHECK-NEXT: store [5 x float] zeroinitializer, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO1:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 1
; CHECK-NEXT: [[TMP1:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <5 x float> [[TMP1]], float 1.000000e+00, i64 1
; CHECK-NEXT: store <5 x float> [[TMP2]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO2:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 3
; CHECK-NEXT: [[TMP3:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <5 x float> [[TMP3]], float 2.000000e+00, i64 3
; CHECK-NEXT: store <5 x float> [[TMP4]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP5:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <5 x float> [[TMP5]], <5 x float> poison, <5 x i32> <i32 1, i32 2, i32 3, i32 4, i32 4>
; CHECK-NEXT: store <5 x float> [[TMP6]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP7:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <5 x float> [[TMP7]], i32 0
; CHECK-NEXT: store float [[TMP8]], ptr addrspace(1) @pv, align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%f1 = alloca [5 x float], addrspace(5)		%f1 = alloca [5 x float], addrspace(5)
store [5 x float] zeroinitializer, ptr addrspace(5) %f1		store [5 x float] zeroinitializer, ptr addrspace(5) %f1
%foo1 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 1		%foo1 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 1
store float 1.0, ptr addrspace(5) %foo1		store float 1.0, ptr addrspace(5) %foo1
%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3		%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3
store float 2.0, ptr addrspace(5) %foo2		store float 2.0, ptr addrspace(5) %foo2
call void @llvm.memmove.p5i8.p5i8.i32(ptr addrspace(5) align 4 %f1, ptr addrspace(5) align 4 %foo1, i32 16, i1 false)		call void @llvm.memmove.p5i8.p5i8.i32(ptr addrspace(5) align 4 %f1, ptr addrspace(5) align 4 %foo1, i32 16, i1 false)
%foo3 = load float, ptr addrspace(5) %f1		%foo3 = load float, ptr addrspace(5) %f1
store float %foo3, ptr addrspace(1) @pv		store float %foo3, ptr addrspace(1) @pv
ret void		ret void
}		}

define amdgpu_vs void @promote_memcpy_aggr() #0 {		define amdgpu_vs void @promote_memcpy_aggr() #0 {
; CHECK-LABEL: @promote_memcpy_aggr(		; CHECK-LABEL: @promote_memcpy_aggr(
; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)
; CHECK-NEXT: store [5 x float] zeroinitializer, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO2:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 3
; CHECK-NEXT: [[TMP1:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <5 x float> [[TMP1]], float 2.000000e+00, i64 3
; CHECK-NEXT: store <5 x float> [[TMP2]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0		; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0
; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4		; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4
; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]		; CHECK-NEXT: [[TMP1:%.*]] = insertelement <5 x float> <float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 2.000000e+00, float 0.000000e+00>, float 3.000000e+00, i32 [[FOO4]]
; CHECK-NEXT: [[TMP3:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <5 x float> [[TMP1]], <5 x float> poison, <5 x i32> <i32 3, i32 4, i32 2, i32 3, i32 4>
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <5 x float> [[TMP3]], float 3.000000e+00, i32 [[FOO4]]		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <5 x float> [[TMP2]], i32 0
; CHECK-NEXT: store <5 x float> [[TMP4]], ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: store float [[TMP3]], ptr addrspace(1) @pv, align 4
; CHECK-NEXT: [[TMP5:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <5 x float> [[TMP5]], <5 x float> poison, <5 x i32> <i32 3, i32 4, i32 2, i32 3, i32 4>
; CHECK-NEXT: store <5 x float> [[TMP6]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP7:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <5 x float> [[TMP7]], i32 0
; CHECK-NEXT: store float [[TMP8]], ptr addrspace(1) @pv, align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%f1 = alloca [5 x float], addrspace(5)		%f1 = alloca [5 x float], addrspace(5)
store [5 x float] zeroinitializer, ptr addrspace(5) %f1		store [5 x float] zeroinitializer, ptr addrspace(5) %f1

%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3		%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3
store float 2.0, ptr addrspace(5) %foo2		store float 2.0, ptr addrspace(5) %foo2

%foo3 = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 0		%foo3 = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 0
%foo4 = load i32, ptr addrspace(1) %foo3		%foo4 = load i32, ptr addrspace(1) %foo3
%foo5 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 %foo4		%foo5 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 %foo4
store float 3.0, ptr addrspace(5) %foo5		store float 3.0, ptr addrspace(5) %foo5

call void @llvm.memcpy.p5i8.p5i8.i32(ptr addrspace(5) align 4 %f1, ptr addrspace(5) align 4 %foo2, i32 8, i1 false)		call void @llvm.memcpy.p5i8.p5i8.i32(ptr addrspace(5) align 4 %f1, ptr addrspace(5) align 4 %foo2, i32 8, i1 false)
%foo6 = load float, ptr addrspace(5) %f1		%foo6 = load float, ptr addrspace(5) %f1
store float %foo6, ptr addrspace(1) @pv		store float %foo6, ptr addrspace(1) @pv
ret void		ret void
}		}

define amdgpu_vs void @promote_memcpy_identity_aggr() #0 {		define amdgpu_vs void @promote_memcpy_identity_aggr() #0 {
; CHECK-LABEL: @promote_memcpy_identity_aggr(		; CHECK-LABEL: @promote_memcpy_identity_aggr(
; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(1) @pv, align 4
; CHECK-NEXT: store [5 x float] zeroinitializer, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO1:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 1
; CHECK-NEXT: [[TMP1:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <5 x float> [[TMP1]], float 1.000000e+00, i64 1
; CHECK-NEXT: store <5 x float> [[TMP2]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO2:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 3
; CHECK-NEXT: [[TMP3:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <5 x float> [[TMP3]], float 2.000000e+00, i64 3
; CHECK-NEXT: store <5 x float> [[TMP4]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP5:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <5 x float> [[TMP5]], <5 x float> poison, <5 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4>
; CHECK-NEXT: store <5 x float> [[TMP6]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP7:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <5 x float> [[TMP7]], i32 0
; CHECK-NEXT: store float [[TMP8]], ptr addrspace(1) @pv, align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%f1 = alloca [5 x float], addrspace(5)		%f1 = alloca [5 x float], addrspace(5)
store [5 x float] zeroinitializer, ptr addrspace(5) %f1		store [5 x float] zeroinitializer, ptr addrspace(5) %f1
%foo1 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 1		%foo1 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 1
store float 1.0, ptr addrspace(5) %foo1		store float 1.0, ptr addrspace(5) %foo1
%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3		%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3
store float 2.0, ptr addrspace(5) %foo2		store float 2.0, ptr addrspace(5) %foo2
call void @llvm.memcpy.p5i8.p5i8.i32(ptr addrspace(5) align 4 %f1, ptr addrspace(5) align 4 %f1, i32 20, i1 false)		call void @llvm.memcpy.p5i8.p5i8.i32(ptr addrspace(5) align 4 %f1, ptr addrspace(5) align 4 %f1, i32 20, i1 false)
%foo3 = load float, ptr addrspace(5) %f1		%foo3 = load float, ptr addrspace(5) %f1
store float %foo3, ptr addrspace(1) @pv		store float %foo3, ptr addrspace(1) @pv
ret void		ret void
}		}

; TODO: promote alloca even there is a memcpy between different alloca		; TODO: promote alloca even there is a memcpy between different alloca
define amdgpu_vs void @promote_memcpy_two_aggrs() #0 {		define amdgpu_vs void @promote_memcpy_two_aggrs() #0 {
; CHECK-LABEL: @promote_memcpy_two_aggrs(		; CHECK-LABEL: @promote_memcpy_two_aggrs(
; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)		; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)
; CHECK-NEXT: [[F2:%.*]] = alloca [5 x float], align 4, addrspace(5)		; CHECK-NEXT: [[F2:%.*]] = alloca [5 x float], align 4, addrspace(5)
; CHECK-NEXT: store [5 x float] zeroinitializer, ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: [[DOTFCA_0_GEP1:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 0
; CHECK-NEXT: store [5 x float] zeroinitializer, ptr addrspace(5) [[F2]], align 4		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_0_GEP1]], align 4
		; CHECK-NEXT: [[DOTFCA_1_GEP2:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 1
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_1_GEP2]], align 4
		; CHECK-NEXT: [[DOTFCA_2_GEP3:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 2
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_2_GEP3]], align 4
		; CHECK-NEXT: [[DOTFCA_3_GEP4:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 3
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_3_GEP4]], align 4
		; CHECK-NEXT: [[DOTFCA_4_GEP5:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 4
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_4_GEP5]], align 4
		; CHECK-NEXT: [[DOTFCA_0_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F2]], i32 0, i32 0
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_0_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_1_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F2]], i32 0, i32 1
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_1_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_2_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F2]], i32 0, i32 2
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_2_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_3_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F2]], i32 0, i32 3
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_3_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_4_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F2]], i32 0, i32 4
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_4_GEP]], align 4
; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0		; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0
; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4		; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4
; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]		; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]
; CHECK-NEXT: store float 3.000000e+00, ptr addrspace(5) [[FOO5]], align 4		; CHECK-NEXT: store float 3.000000e+00, ptr addrspace(5) [[FOO5]], align 4
; CHECK-NEXT: call void @llvm.memcpy.p5.p5.i32(ptr addrspace(5) align 4 [[F2]], ptr addrspace(5) align 4 [[F1]], i32 8, i1 false)		; CHECK-NEXT: call void @llvm.memcpy.p5.p5.i32(ptr addrspace(5) align 4 [[F2]], ptr addrspace(5) align 4 [[F1]], i32 8, i1 false)
; CHECK-NEXT: [[FOO6:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F2]], i32 0, i32 [[FOO4]]		; CHECK-NEXT: [[FOO6:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F2]], i32 0, i32 [[FOO4]]
; CHECK-NEXT: [[FOO7:%.*]] = load float, ptr addrspace(5) [[FOO6]], align 4		; CHECK-NEXT: [[FOO7:%.*]] = load float, ptr addrspace(5) [[FOO6]], align 4
; CHECK-NEXT: store float [[FOO7]], ptr addrspace(1) @pv, align 4		; CHECK-NEXT: store float [[FOO7]], ptr addrspace(1) @pv, align 4
Show All 17 Lines	;
store float %foo7, ptr addrspace(1) @pv		store float %foo7, ptr addrspace(1) @pv
ret void		ret void
}		}

; TODO: promote alloca even there is a memcpy between the alloca and other memory space.		; TODO: promote alloca even there is a memcpy between the alloca and other memory space.
define amdgpu_vs void @promote_memcpy_p1p5_aggr(ptr addrspace(1) inreg %src) #0 {		define amdgpu_vs void @promote_memcpy_p1p5_aggr(ptr addrspace(1) inreg %src) #0 {
; CHECK-LABEL: @promote_memcpy_p1p5_aggr(		; CHECK-LABEL: @promote_memcpy_p1p5_aggr(
; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)		; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)
; CHECK-NEXT: store [5 x float] zeroinitializer, ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: [[DOTFCA_0_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 0
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_0_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_1_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 1
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_1_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_2_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 2
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_2_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_3_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 3
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_3_GEP]], align 4
		; CHECK-NEXT: [[DOTFCA_4_GEP:%.*]] = getelementptr inbounds [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 4
		; CHECK-NEXT: store float 0.000000e+00, ptr addrspace(5) [[DOTFCA_4_GEP]], align 4
; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0		; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0
; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4		; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4
; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]		; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]
; CHECK-NEXT: store float 3.000000e+00, ptr addrspace(5) [[FOO5]], align 4		; CHECK-NEXT: store float 3.000000e+00, ptr addrspace(5) [[FOO5]], align 4
; CHECK-NEXT: call void @llvm.memcpy.p1.p5.i32(ptr addrspace(1) align 4 @pv, ptr addrspace(5) align 4 [[F1]], i32 8, i1 false)		; CHECK-NEXT: call void @llvm.memcpy.p1.p5.i32(ptr addrspace(1) align 4 @pv, ptr addrspace(5) align 4 [[F1]], i32 8, i1 false)
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%f1 = alloca [5 x float], addrspace(5)		%f1 = alloca [5 x float], addrspace(5)
store [5 x float] zeroinitializer, ptr addrspace(5) %f1		store [5 x float] zeroinitializer, ptr addrspace(5) %f1

%foo3 = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 0		%foo3 = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 0
%foo4 = load i32, ptr addrspace(1) %foo3		%foo4 = load i32, ptr addrspace(1) %foo3
%foo5 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 %foo4		%foo5 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 %foo4
store float 3.0, ptr addrspace(5) %foo5		store float 3.0, ptr addrspace(5) %foo5

call void @llvm.memcpy.p1i8.p5i8.i32(ptr addrspace(1) align 4 @pv, ptr addrspace(5) align 4 %f1, i32 8, i1 false)		call void @llvm.memcpy.p1i8.p5i8.i32(ptr addrspace(1) align 4 @pv, ptr addrspace(5) align 4 %f1, i32 8, i1 false)
ret void		ret void
}		}

define amdgpu_vs void @promote_memcpy_inline_aggr() #0 {		define amdgpu_vs void @promote_memcpy_inline_aggr() #0 {
; CHECK-LABEL: @promote_memcpy_inline_aggr(		; CHECK-LABEL: @promote_memcpy_inline_aggr(
; CHECK-NEXT: [[F1:%.*]] = alloca [5 x float], align 4, addrspace(5)
; CHECK-NEXT: store [5 x float] zeroinitializer, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[FOO2:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 3
; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0		; CHECK-NEXT: [[FOO3:%.]] = getelementptr [[BLOCK3:%.]], ptr addrspace(1) @block3, i32 0, i32 0
; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4		; CHECK-NEXT: [[FOO4:%.*]] = load i32, ptr addrspace(1) [[FOO3]], align 4
; CHECK-NEXT: [[FOO5:%.*]] = getelementptr [5 x float], ptr addrspace(5) [[F1]], i32 0, i32 [[FOO4]]		; CHECK-NEXT: [[TMP1:%.*]] = insertelement <5 x float> zeroinitializer, float 3.000000e+00, i32 [[FOO4]]
; CHECK-NEXT: [[TMP1:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <5 x float> [[TMP1]], <5 x float> poison, <5 x i32> <i32 3, i32 4, i32 2, i32 3, i32 4>
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <5 x float> [[TMP1]], float 3.000000e+00, i32 [[FOO4]]		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <5 x float> [[TMP2]], i32 0
; CHECK-NEXT: store <5 x float> [[TMP2]], ptr addrspace(5) [[F1]], align 4		; CHECK-NEXT: store float [[TMP3]], ptr addrspace(1) @pv, align 4
; CHECK-NEXT: [[TMP3:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <5 x float> [[TMP3]], <5 x float> poison, <5 x i32> <i32 3, i32 4, i32 2, i32 3, i32 4>
; CHECK-NEXT: store <5 x float> [[TMP4]], ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP5:%.*]] = load <5 x float>, ptr addrspace(5) [[F1]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = extractelement <5 x float> [[TMP5]], i32 0
; CHECK-NEXT: store float [[TMP6]], ptr addrspace(1) @pv, align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%f1 = alloca [5 x float], addrspace(5)		%f1 = alloca [5 x float], addrspace(5)
store [5 x float] zeroinitializer, ptr addrspace(5) %f1		store [5 x float] zeroinitializer, ptr addrspace(5) %f1

%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3		%foo2 = getelementptr [5 x float], ptr addrspace(5) %f1, i32 0, i32 3
%foo3 = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 0		%foo3 = getelementptr %Block3, ptr addrspace(1) @block3, i32 0, i32 0
%foo4 = load i32, ptr addrspace(1) %foo3		%foo4 = load i32, ptr addrspace(1) %foo3
Show All 11 Lines
declare void @llvm.memcpy.inline.p5i8.p5i8.i32(ptr addrspace(5) nocapture writeonly, ptr addrspace(5) nocapture readonly, i32, i1 immarg)		declare void @llvm.memcpy.inline.p5i8.p5i8.i32(ptr addrspace(5) nocapture writeonly, ptr addrspace(5) nocapture readonly, i32, i1 immarg)
declare void @llvm.memmove.p5i8.p5i8.i32(ptr addrspace(5) nocapture writeonly, ptr addrspace(5) nocapture readonly, i32, i1 immarg)		declare void @llvm.memmove.p5i8.p5i8.i32(ptr addrspace(5) nocapture writeonly, ptr addrspace(5) nocapture readonly, i32, i1 immarg)

@tmp_g = external addrspace(1) global { [4 x double], <2 x double>, <3 x double>, <4 x double> }		@tmp_g = external addrspace(1) global { [4 x double], <2 x double>, <3 x double>, <4 x double> }
@frag_color = external addrspace(1) global <4 x float>		@frag_color = external addrspace(1) global <4 x float>

define amdgpu_ps void @promote_double_aggr() #0 {		define amdgpu_ps void @promote_double_aggr() #0 {
; CHECK-LABEL: @promote_double_aggr(		; CHECK-LABEL: @promote_double_aggr(
; CHECK-NEXT: [[S:%.*]] = alloca [2 x double], align 8, addrspace(5)
; CHECK-NEXT: [[FOO:%.*]] = getelementptr { [4 x double], <2 x double>, <3 x double>, <4 x double> }, ptr addrspace(1) @tmp_g, i32 0, i32 0, i32 0		; CHECK-NEXT: [[FOO:%.*]] = getelementptr { [4 x double], <2 x double>, <3 x double>, <4 x double> }, ptr addrspace(1) @tmp_g, i32 0, i32 0, i32 0
; CHECK-NEXT: [[FOO1:%.*]] = load double, ptr addrspace(1) [[FOO]], align 8		; CHECK-NEXT: [[FOO1:%.*]] = load double, ptr addrspace(1) [[FOO]], align 8
; CHECK-NEXT: [[FOO2:%.*]] = getelementptr { [4 x double], <2 x double>, <3 x double>, <4 x double> }, ptr addrspace(1) @tmp_g, i32 0, i32 0, i32 1		; CHECK-NEXT: [[FOO2:%.*]] = getelementptr { [4 x double], <2 x double>, <3 x double>, <4 x double> }, ptr addrspace(1) @tmp_g, i32 0, i32 0, i32 1
; CHECK-NEXT: [[FOO3:%.*]] = load double, ptr addrspace(1) [[FOO2]], align 8		; CHECK-NEXT: [[FOO3:%.*]] = load double, ptr addrspace(1) [[FOO2]], align 8
; CHECK-NEXT: [[FOO4:%.*]] = insertvalue [2 x double] undef, double [[FOO1]], 0		; CHECK-NEXT: [[FOO4:%.*]] = insertvalue [2 x double] undef, double [[FOO1]], 0
; CHECK-NEXT: [[FOO5:%.*]] = insertvalue [2 x double] [[FOO4]], double [[FOO3]], 1		; CHECK-NEXT: [[FOO5:%.*]] = insertvalue [2 x double] [[FOO4]], double [[FOO3]], 1
; CHECK-NEXT: store [2 x double] [[FOO5]], ptr addrspace(5) [[S]], align 8		; CHECK-NEXT: [[FOO5_FCA_0_EXTRACT:%.*]] = extractvalue [2 x double] [[FOO5]], 0
; CHECK-NEXT: [[FOO6:%.*]] = getelementptr [2 x double], ptr addrspace(5) [[S]], i32 0, i32 1		; CHECK-NEXT: [[FOO5_FCA_1_EXTRACT:%.*]] = extractvalue [2 x double] [[FOO5]], 1
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, ptr addrspace(5) [[S]], align 8		; CHECK-NEXT: [[FOO10:%.*]] = fadd double [[FOO5_FCA_1_EXTRACT]], [[FOO5_FCA_1_EXTRACT]]
; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[TMP1]], i64 1		; CHECK-NEXT: [[FOO16:%.*]] = fadd double [[FOO10]], [[FOO5_FCA_1_EXTRACT]]
; CHECK-NEXT: [[FOO8:%.*]] = getelementptr [2 x double], ptr addrspace(5) [[S]], i32 0, i32 1
; CHECK-NEXT: [[TMP3:%.*]] = load <2 x double>, ptr addrspace(5) [[S]], align 8
; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP3]], i64 1
; CHECK-NEXT: [[FOO10:%.*]] = fadd double [[TMP2]], [[TMP4]]
; CHECK-NEXT: [[TMP5:%.*]] = load <2 x double>, ptr addrspace(5) [[S]], align 8
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[FOO10]], i32 0
; CHECK-NEXT: store <2 x double> [[TMP6]], ptr addrspace(5) [[S]], align 8
; CHECK-NEXT: [[TMP7:%.*]] = load <2 x double>, ptr addrspace(5) [[S]], align 8
; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[TMP7]], i32 0
; CHECK-NEXT: [[FOO14:%.*]] = getelementptr [2 x double], ptr addrspace(5) [[S]], i32 0, i32 1
; CHECK-NEXT: [[TMP9:%.*]] = load <2 x double>, ptr addrspace(5) [[S]], align 8
; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[TMP9]], i64 1
; CHECK-NEXT: [[FOO16:%.*]] = fadd double [[TMP8]], [[TMP10]]
; CHECK-NEXT: [[FOO17:%.*]] = fptrunc double [[FOO16]] to float		; CHECK-NEXT: [[FOO17:%.*]] = fptrunc double [[FOO16]] to float
; CHECK-NEXT: [[FOO18:%.*]] = insertelement <4 x float> undef, float [[FOO17]], i32 0		; CHECK-NEXT: [[FOO18:%.*]] = insertelement <4 x float> undef, float [[FOO17]], i32 0
; CHECK-NEXT: [[FOO19:%.*]] = insertelement <4 x float> [[FOO18]], float [[FOO17]], i32 1		; CHECK-NEXT: [[FOO19:%.*]] = insertelement <4 x float> [[FOO18]], float [[FOO17]], i32 1
; CHECK-NEXT: [[FOO20:%.*]] = insertelement <4 x float> [[FOO19]], float [[FOO17]], i32 2		; CHECK-NEXT: [[FOO20:%.*]] = insertelement <4 x float> [[FOO19]], float [[FOO17]], i32 2
; CHECK-NEXT: [[FOO21:%.*]] = insertelement <4 x float> [[FOO20]], float [[FOO17]], i32 3		; CHECK-NEXT: [[FOO21:%.*]] = insertelement <4 x float> [[FOO20]], float [[FOO17]], i32 3
; CHECK-NEXT: store <4 x float> [[FOO21]], ptr addrspace(1) @frag_color, align 16		; CHECK-NEXT: store <4 x float> [[FOO21]], ptr addrspace(1) @frag_color, align 16
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
Show All 23 Lines	;
store <4 x float> %foo21, ptr addrspace(1) @frag_color		store <4 x float> %foo21, ptr addrspace(1) @frag_color
ret void		ret void
}		}

; Don't crash on a type that isn't a valid vector element.		; Don't crash on a type that isn't a valid vector element.
define amdgpu_kernel void @alloca_struct() #0 {		define amdgpu_kernel void @alloca_struct() #0 {
; CHECK-LABEL: @alloca_struct(		; CHECK-LABEL: @alloca_struct(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.*]] = call noalias nonnull dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP0]], i64 1
; CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !invariant.load !0
; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP0]], i64 2
; CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr addrspace(4) [[TMP3]], align 4, !range [[RNG1:![0-9]+]], !invariant.load !0
; CHECK-NEXT: [[TMP5:%.*]] = lshr i32 [[TMP2]], 16
; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.amdgcn.workitem.id.x(), !range [[RNG2:![0-9]+]]
; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.amdgcn.workitem.id.y(), !range [[RNG2]]
; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.amdgcn.workitem.id.z(), !range [[RNG2]]
; CHECK-NEXT: [[TMP9:%.*]] = mul nuw nsw i32 [[TMP5]], [[TMP4]]
; CHECK-NEXT: [[TMP10:%.*]] = mul i32 [[TMP9]], [[TMP6]]
; CHECK-NEXT: [[TMP11:%.*]] = mul nuw nsw i32 [[TMP7]], [[TMP4]]
; CHECK-NEXT: [[TMP12:%.*]] = add i32 [[TMP10]], [[TMP11]]
; CHECK-NEXT: [[TMP13:%.*]] = add i32 [[TMP12]], [[TMP8]]
; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds [1024 x [2 x %struct]], ptr addrspace(3) @alloca_struct.alloca, i32 0, i32 [[TMP13]]
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%alloca = alloca [2 x %struct], align 4, addrspace(5)		%alloca = alloca [2 x %struct], align 4, addrspace(5)
ret void		ret void
}		}

llvm/test/CodeGen/AMDGPU/promote-alloca-globals.ll

	; RUN: opt -data-layout=A5 -S -mtriple=amdgcn-unknown-unknown -passes=amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s			; RUN: opt -data-layout=A5 -S -mtriple=amdgcn-unknown-unknown -passes=amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s
	; RUN: llc -march=amdgcn -mcpu=tonga < %s \| FileCheck -check-prefix=ASM %s			; RUN: llc -march=amdgcn -mcpu=tonga < %s \| FileCheck -check-prefix=ASM %s


	@global_array0 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4			@global_array0 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4
	@global_array1 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4			@global_array1 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4

	; IR-LABEL: define amdgpu_kernel void @promote_alloca_size_256(ptr addrspace(1) nocapture %out, ptr addrspace(1) nocapture %in) {			; IR-LABEL: define amdgpu_kernel void @promote_alloca_size_256(ptr addrspace(1) nocapture %out, ptr addrspace(1) nocapture %in) {
	; IR: alloca [10 x i32]			; IR-NOT: alloca [10 x i32]
	; ASM-LABEL: {{^}}promote_alloca_size_256:			; ASM-LABEL: {{^}}promote_alloca_size_256:
	; ASM: .amdgpu_lds llvm.amdgcn.kernel.promote_alloca_size_256.lds, 60000, 16			; ASM: .amdgpu_lds llvm.amdgcn.kernel.promote_alloca_size_256.lds, 60000, 16
	; ASM-NOT: .amdgpu_lds			; ASM-NOT: .amdgpu_lds

	define amdgpu_kernel void @promote_alloca_size_256(ptr addrspace(1) nocapture %out, ptr addrspace(1) nocapture %in) {			define amdgpu_kernel void @promote_alloca_size_256(ptr addrspace(1) nocapture %out, ptr addrspace(1) nocapture %in) {
	entry:			entry:
	%stack = alloca [10 x i32], align 4, addrspace(5)			%stack = alloca [10 x i32], align 4, addrspace(5)
	%tmp = load i32, ptr addrspace(1) %in, align 4			%tmp = load i32, ptr addrspace(1) %in, align 4
	Show All 16 Lines

llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
				; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -passes=amdgpu-promote-alloca < %s \| FileCheck %s

				target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5"

				define amdgpu_kernel void @test_overwrite(i64 %val, i1 %cond) {
				; CHECK-LABEL: define amdgpu_kernel void @test_overwrite
				; CHECK-SAME: (i64 [[VAL:%.]], i1 [[COND:%.]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 [[COND]], label [[LOOP:%.]], label [[END:%.]]
				; CHECK: loop:
				; CHECK-NEXT: [[PROMOTEALLOCA:%.]] = phi <3 x i64> [ [[TMP2:%.]], [[LOOP]] ], [ <i64 43, i64 undef, i64 undef>, [[ENTRY:%.*]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x i64> [[PROMOTEALLOCA]], i32 0
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x i64> [[PROMOTEALLOCA]], i64 68, i32 0
				; CHECK-NEXT: [[TMP2]] = insertelement <3 x i64> [[TMP1]], i64 32, i32 0
				; CHECK-NEXT: [[LOOP_CC:%.*]] = icmp ne i64 [[TMP0]], 68
				; CHECK-NEXT: br i1 [[LOOP_CC]], label [[LOOP]], label [[END]]
				; CHECK: end:
				; CHECK-NEXT: [[PROMOTEALLOCA1:%.*]] = phi <3 x i64> [ [[TMP2]], [[LOOP]] ], [ <i64 43, i64 undef, i64 undef>, [[ENTRY]] ]
				; CHECK-NEXT: [[TMP3:%.*]] = extractelement <3 x i64> [[PROMOTEALLOCA1]], i32 0
				; CHECK-NEXT: ret void
				;
				entry:
				%stack = alloca [3 x i64], align 4, addrspace(5)
				store i64 43, ptr addrspace(5) %stack
				br i1 %cond, label %loop, label %end

				loop:
				%load.0 = load i64, ptr addrspace(5) %stack
				store i64 68, ptr addrspace(5) %stack
				%load.1 = load i64, ptr addrspace(5) %stack
				store i64 32, ptr addrspace(5) %stack
				%loop.cc = icmp ne i64 %load.0, %load.1
				br i1 %loop.cc, label %loop, label %end

				end:
				%reload = load i64, ptr addrspace(5) %stack
				ret void
				}

				define amdgpu_kernel void @test_no_overwrite(i64 %val, i1 %cond) {
				; CHECK-LABEL: define amdgpu_kernel void @test_no_overwrite
				; CHECK-SAME: (i64 [[VAL:%.]], i1 [[COND:%.]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 [[COND]], label [[LOOP:%.]], label [[END:%.]]
				; CHECK: loop:
				; CHECK-NEXT: [[PROMOTEALLOCA:%.]] = phi <3 x i64> [ [[TMP1:%.]], [[LOOP]] ], [ <i64 43, i64 undef, i64 undef>, [[ENTRY:%.*]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x i64> [[PROMOTEALLOCA]], i32 0
				; CHECK-NEXT: [[TMP1]] = insertelement <3 x i64> [[PROMOTEALLOCA]], i64 32, i32 1
				; CHECK-NEXT: [[LOOP_CC:%.*]] = icmp ne i64 [[TMP0]], 32
				; CHECK-NEXT: br i1 [[LOOP_CC]], label [[LOOP]], label [[END]]
				; CHECK: end:
				; CHECK-NEXT: [[PROMOTEALLOCA1:%.*]] = phi <3 x i64> [ [[TMP1]], [[LOOP]] ], [ <i64 43, i64 undef, i64 undef>, [[ENTRY]] ]
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x i64> [[PROMOTEALLOCA1]], i32 0
				; CHECK-NEXT: [[TMP3:%.*]] = extractelement <3 x i64> [[PROMOTEALLOCA1]], i32 1
				; CHECK-NEXT: ret void
				;
				entry:
				%stack = alloca [3 x i64], align 4, addrspace(5)
				%stack.1 = getelementptr inbounds i64, ptr addrspace(5) %stack, i32 1
				store i64 43, ptr addrspace(5) %stack
				br i1 %cond, label %loop, label %end

				loop:
				%load = load i64, ptr addrspace(5) %stack
				store i64 32, ptr addrspace(5) %stack.1
				%loop.cc = icmp ne i64 %load, 32
				br i1 %loop.cc, label %loop, label %end

				end:
				%reload = load i64, ptr addrspace(5) %stack
				%reload.1 = load i64, ptr addrspace(5) %stack.1
				ret void
				}

				define ptr @alloca_load_store_ptr64_full_ivec(ptr %arg) {
				; CHECK-LABEL: define ptr @alloca_load_store_ptr64_full_ivec
				; CHECK-SAME: (ptr [[ARG:%.*]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.*]] = ptrtoint ptr [[ARG]] to i64
				; CHECK-NEXT: [[TMP1:%.*]] = bitcast i64 [[TMP0]] to <8 x i8>
				; CHECK-NEXT: [[TMP2:%.*]] = bitcast <8 x i8> [[TMP1]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[TMP2]] to ptr
				; CHECK-NEXT: ret ptr [[TMP3]]
				;
				entry:
				%alloca = alloca [8 x i8], align 8, addrspace(5)
				store ptr %arg, ptr addrspace(5) %alloca, align 8
				%tmp = load ptr, ptr addrspace(5) %alloca, align 8
				ret ptr %tmp
				arsenmUnsubmitted Done Reply Inline Actions Add some load/store vector of pointer cases. Also mix different pointer sizes arsenm: Add some load/store vector of pointer cases. Also mix different pointer sizes
				}

				define ptr addrspace(3) @alloca_load_store_ptr32_full_ivec(ptr addrspace(3) %arg) {
				; CHECK-LABEL: define ptr addrspace(3) @alloca_load_store_ptr32_full_ivec
				; CHECK-SAME: (ptr addrspace(3) [[ARG:%.*]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.*]] = ptrtoint ptr addrspace(3) [[ARG]] to i32
				; CHECK-NEXT: [[TMP1:%.*]] = bitcast i32 [[TMP0]] to <4 x i8>
				; CHECK-NEXT: [[TMP2:%.*]] = bitcast <4 x i8> [[TMP1]] to i32
				; CHECK-NEXT: [[TMP3:%.*]] = inttoptr i32 [[TMP2]] to ptr addrspace(3)
				; CHECK-NEXT: ret ptr addrspace(3) [[TMP3]]
				;
				entry:
				%alloca = alloca [4 x i8], align 8, addrspace(5)
				store ptr addrspace(3) %arg, ptr addrspace(5) %alloca, align 8
				%tmp = load ptr addrspace(3), ptr addrspace(5) %alloca, align 8
				ret ptr addrspace(3) %tmp
				}

				define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec(<2 x ptr> %arg) {
				; CHECK-LABEL: define <4 x ptr addrspace(3)> @alloca_load_store_ptr_mixed_full_ptrvec
				; CHECK-SAME: (<2 x ptr> [[ARG:%.*]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.*]] = ptrtoint <2 x ptr> [[ARG]] to <2 x i64>
				; CHECK-NEXT: [[TMP1:%.*]] = bitcast <2 x i64> [[TMP0]] to <4 x i32>
				; CHECK-NEXT: [[TMP2:%.*]] = inttoptr <4 x i32> [[TMP1]] to <4 x ptr addrspace(3)>
				; CHECK-NEXT: ret <4 x ptr addrspace(3)> [[TMP2]]
				;
				entry:
				%alloca = alloca [4 x i32], align 8, addrspace(5)
				store <2 x ptr> %arg, ptr addrspace(5) %alloca, align 8
				%tmp = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
				ret <4 x ptr addrspace(3)> %tmp
				}

				; Currently rejected due to the store not being cast-able.
				; TODO: We should probably be able to vectorize this
				define void @alloca_load_store_ptr_mixed_ptrvec(<2 x ptr addrspace(3)> %arg) {
				arsenmUnsubmitted Done Reply Inline Actions There was a recent bug filed that amounts to not handling this (it didn't use pointers, but just different sized vectors) arsenm: There was a recent bug filed that amounts to not handling this (it didn't use pointers, but…
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Yes, I just saw it. I'd rather fix this in a separate patch; this patch is already quite large and if I do too much in it I'm afraid it'll make potential issues harder to track down I think we just need to use something else than `isBitOrNoopPointerCastable`. It's too limited because it doesn't take into account that we can use an intermediate cast (like the cast to int for ptr -> vec) Pierre-vh: Yes, I just saw it. I'd rather fix this in a separate patch; this patch is already quite large…
				; CHECK-LABEL: define void @alloca_load_store_ptr_mixed_ptrvec
				; CHECK-SAME: (<2 x ptr addrspace(3)> [[ARG:%.*]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [8 x i32], align 8, addrspace(5)
				; CHECK-NEXT: store <2 x ptr addrspace(3)> [[ARG]], ptr addrspace(5) [[ALLOCA]], align 8
				; CHECK-NEXT: [[TMP:%.*]] = load <2 x ptr addrspace(3)>, ptr addrspace(5) [[ALLOCA]], align 8
				; CHECK-NEXT: [[TMP_FULL:%.*]] = load <4 x ptr addrspace(3)>, ptr addrspace(5) [[ALLOCA]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%alloca = alloca [8 x i32], align 8, addrspace(5)
				store <2 x ptr addrspace(3)> %arg, ptr addrspace(5) %alloca, align 8
				%tmp = load <2 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
				%tmp.full = load <4 x ptr addrspace(3)>, ptr addrspace(5) %alloca, align 8
				ret void
				}

				; Will not vectorize because we're accessing a 64 bit vector with a 32 bits pointer.
				define ptr addrspace(3) @alloca_load_store_ptr_mixed_full_ivec(ptr addrspace(3) %arg) {
				; CHECK-LABEL: define ptr addrspace(3) @alloca_load_store_ptr_mixed_full_ivec
				; CHECK-SAME: (ptr addrspace(3) [[ARG:%.*]]) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [8 x i8], align 8, addrspace(5)
				; CHECK-NEXT: store ptr addrspace(3) [[ARG]], ptr addrspace(5) [[ALLOCA]], align 8
				; CHECK-NEXT: [[TMP:%.*]] = load ptr addrspace(3), ptr addrspace(5) [[ALLOCA]], align 8
				; CHECK-NEXT: ret ptr addrspace(3) [[TMP]]
				;
				entry:
				%alloca = alloca [8 x i8], align 8, addrspace(5)
				store ptr addrspace(3) %arg, ptr addrspace(5) %alloca, align 8
				%tmp = load ptr addrspace(3), ptr addrspace(5) %alloca, align 8
				ret ptr addrspace(3) %tmp
				}

llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -passes=amdgpu-promote-alloca,sroa < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -passes=amdgpu-promote-alloca < %s \| FileCheck %s

	; Checks that memsets don't block PromoteAlloca.			; Checks that memsets don't block PromoteAlloca.

	; Note: memsets are just updated with the new type size. They are not eliminated which means
	; the original alloca also stay. This puts a bit more load on SROA.
	; If PromoteAlloca is moved to SSAUpdater, we could just entirely replace the memsets with
	; e.g. ConstantAggregate.

	define amdgpu_kernel void @memset_all_zero(i64 %val) {			define amdgpu_kernel void @memset_all_zero(i64 %val) {
	; CHECK-LABEL: @memset_all_zero(			; CHECK-LABEL: @memset_all_zero(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = insertelement <6 x i64> zeroinitializer, i64 [[VAL:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = insertelement <6 x i64> zeroinitializer, i64 [[VAL:%.]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = extractelement <6 x i64> [[TMP0]], i32 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <6 x i64> [[TMP0]], i64 [[VAL]], i64 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <6 x i64> [[TMP0]], i64 [[VAL]], i64 1
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%stack = alloca [6 x i64], align 4, addrspace(5)			%stack = alloca [6 x i64], align 4, addrspace(5)
	call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 0, i64 48, i1 false)			call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 0, i64 48, i1 false)
	store i64 %val, ptr addrspace(5) %stack			store i64 %val, ptr addrspace(5) %stack
	%reload = load i64, ptr addrspace(5) %stack			%reload = load i64, ptr addrspace(5) %stack
	%stack.1 = getelementptr [6 x i64], ptr addrspace(5) %stack, i64 0, i64 1			%stack.1 = getelementptr [6 x i64], ptr addrspace(5) %stack, i64 0, i64 1
	store i64 %val, ptr addrspace(5) %stack.1			store i64 %val, ptr addrspace(5) %stack.1
	ret void			ret void
	}			}

	define amdgpu_kernel void @memset_all_5(i64 %val) {			define amdgpu_kernel void @memset_all_5(i64 %val) {
	; CHECK-LABEL: @memset_all_5(			; CHECK-LABEL: @memset_all_5(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = insertelement <4 x i64> <i64 361700864190383365, i64 361700864190383365, i64 361700864190383365, i64 361700864190383365>, i64 [[VAL:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = insertelement <4 x i64> <i64 361700864190383365, i64 361700864190383365, i64 361700864190383365, i64 361700864190383365>, i64 [[VAL:%.]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x i64> [[TMP0]], i32 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i64> [[TMP0]], i64 [[VAL]], i64 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i64> [[TMP0]], i64 [[VAL]], i64 1
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%stack = alloca [4 x i64], align 4, addrspace(5)			%stack = alloca [4 x i64], align 4, addrspace(5)
	call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 5, i64 32, i1 false)			call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 5, i64 32, i1 false)
	store i64 %val, ptr addrspace(5) %stack			store i64 %val, ptr addrspace(5) %stack
	%reload = load i64, ptr addrspace(5) %stack			%reload = load i64, ptr addrspace(5) %stack
	%stack.1 = getelementptr [6 x i64], ptr addrspace(5) %stack, i64 0, i64 1			%stack.1 = getelementptr [6 x i64], ptr addrspace(5) %stack, i64 0, i64 1
	store i64 %val, ptr addrspace(5) %stack.1			store i64 %val, ptr addrspace(5) %stack.1
	ret void			ret void
	}			}

	define amdgpu_kernel void @memset_volatile_nopromote(i64 %val) {			define amdgpu_kernel void @memset_volatile_nopromote(i64 %val) {
	; CHECK-LABEL: @memset_volatile_nopromote(			; CHECK-LABEL: @memset_volatile_nopromote(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[STACK_SROA_0:%.*]] = alloca i64, align 8, addrspace(5)			; CHECK-NEXT: [[STACK:%.*]] = alloca [4 x i64], align 4, addrspace(5)
	; CHECK-NEXT: [[STACK_SROA_2:%.*]] = alloca [3 x i64], align 8, addrspace(5)			; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) [[STACK]], i8 0, i64 32, i1 true)
	; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) align 8 [[STACK_SROA_0]], i8 0, i64 8, i1 true)			; CHECK-NEXT: store i64 [[VAL:%.*]], ptr addrspace(5) [[STACK]], align 4
	; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) align 8 [[STACK_SROA_2]], i8 0, i64 24, i1 true)
	; CHECK-NEXT: store i64 [[VAL:%.*]], ptr addrspace(5) [[STACK_SROA_0]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%stack = alloca [4 x i64], align 4, addrspace(5)			%stack = alloca [4 x i64], align 4, addrspace(5)
	call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 0, i64 32, i1 true)			call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 0, i64 32, i1 true)
	store i64 %val, ptr addrspace(5) %stack			store i64 %val, ptr addrspace(5) %stack
	ret void			ret void
	}			}

	define amdgpu_kernel void @memset_badsize_nopromote(i64 %val) {			define amdgpu_kernel void @memset_badsize_nopromote(i64 %val) {
	; CHECK-LABEL: @memset_badsize_nopromote(			; CHECK-LABEL: @memset_badsize_nopromote(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[STACK_SROA_0:%.*]] = alloca i64, align 8, addrspace(5)			; CHECK-NEXT: [[STACK:%.*]] = alloca [4 x i64], align 4, addrspace(5)
	; CHECK-NEXT: [[STACK_SROA_2:%.*]] = alloca [23 x i8], align 4, addrspace(5)			; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) [[STACK]], i8 0, i64 31, i1 true)
	; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) align 8 [[STACK_SROA_0]], i8 0, i64 8, i1 true)			; CHECK-NEXT: store i64 [[VAL:%.*]], ptr addrspace(5) [[STACK]], align 4
	; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) align 4 [[STACK_SROA_2]], i8 0, i64 23, i1 true)
	; CHECK-NEXT: store i64 [[VAL:%.*]], ptr addrspace(5) [[STACK_SROA_0]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%stack = alloca [4 x i64], align 4, addrspace(5)			%stack = alloca [4 x i64], align 4, addrspace(5)
	call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 0, i64 31, i1 true)			call void @llvm.memset.p5.i64(ptr addrspace(5) %stack, i8 0, i64 31, i1 true)
	store i64 %val, ptr addrspace(5) %stack			store i64 %val, ptr addrspace(5) %stack
	ret void			ret void
	}			}

	define amdgpu_kernel void @memset_offset_ptr_nopromote(i64 %val) {			define amdgpu_kernel void @memset_offset_ptr_nopromote(i64 %val) {
	; CHECK-LABEL: @memset_offset_ptr_nopromote(			; CHECK-LABEL: @memset_offset_ptr_nopromote(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[STACK_SROA_1:%.*]] = alloca [3 x i64], align 8, addrspace(5)			; CHECK-NEXT: [[STACK:%.*]] = alloca [4 x i64], align 4, addrspace(5)
	; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) align 8 [[STACK_SROA_1]], i8 0, i64 24, i1 true)			; CHECK-NEXT: [[GEP:%.*]] = getelementptr [4 x i64], ptr addrspace(5) [[STACK]], i64 0, i64 1
				; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) [[GEP]], i8 0, i64 24, i1 true)
				; CHECK-NEXT: store i64 [[VAL:%.*]], ptr addrspace(5) [[STACK]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%stack = alloca [4 x i64], align 4, addrspace(5)			%stack = alloca [4 x i64], align 4, addrspace(5)
	%gep = getelementptr [4 x i64], ptr addrspace(5) %stack, i64 0, i64 1			%gep = getelementptr [4 x i64], ptr addrspace(5) %stack, i64 0, i64 1
	call void @llvm.memset.p5.i64(ptr addrspace(5) %gep, i8 0, i64 24, i1 true)			call void @llvm.memset.p5.i64(ptr addrspace(5) %gep, i8 0, i64 24, i1 true)
	store i64 %val, ptr addrspace(5) %stack			store i64 %val, ptr addrspace(5) %stack
	ret void			ret void
	}			}

	declare void @llvm.memset.p5.i64(ptr addrspace(5) nocapture writeonly, i8, i64, i1 immarg)			declare void @llvm.memset.p5.i64(ptr addrspace(5) nocapture writeonly, i8, i64, i1 immarg)

llvm/test/CodeGen/AMDGPU/promote-alloca-pointer-array.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -mtriple=amdgcn-- -mcpu=fiji -data-layout=A5 -passes=amdgpu-promote-alloca < %s \| FileCheck -check-prefix=OPT %s			; RUN: opt -S -mtriple=amdgcn-- -mcpu=fiji -data-layout=A5 -passes=amdgpu-promote-alloca < %s \| FileCheck -check-prefix=OPT %s

	define i64 @test_pointer_array(i64 %v) {			define i64 @test_pointer_array(i64 %v) {
	; OPT-LABEL: @test_pointer_array(			; OPT-LABEL: @test_pointer_array(
	; OPT-NEXT: entry:			; OPT-NEXT: entry:
	; OPT-NEXT: [[A:%.*]] = alloca [3 x ptr], align 16, addrspace(5)			; OPT-NEXT: [[TMP0:%.]] = inttoptr i64 [[V:%.]] to ptr
	; OPT-NEXT: [[TMP0:%.*]] = load <3 x ptr>, ptr addrspace(5) [[A]], align 16			; OPT-NEXT: [[TMP1:%.*]] = insertelement <3 x ptr> undef, ptr [[TMP0]], i32 0
	; OPT-NEXT: [[TMP1:%.]] = inttoptr i64 [[V:%.]] to ptr			; OPT-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP0]] to i64
	; OPT-NEXT: [[TMP2:%.*]] = insertelement <3 x ptr> [[TMP0]], ptr [[TMP1]], i32 0			; OPT-NEXT: ret i64 [[TMP2]]
	; OPT-NEXT: store <3 x ptr> [[TMP2]], ptr addrspace(5) [[A]], align 16
	; OPT-NEXT: [[TMP3:%.*]] = load <3 x ptr>, ptr addrspace(5) [[A]], align 16
	; OPT-NEXT: [[TMP4:%.*]] = extractelement <3 x ptr> [[TMP3]], i32 0
	; OPT-NEXT: [[TMP5:%.*]] = ptrtoint ptr [[TMP4]] to i64
	; OPT-NEXT: ret i64 [[TMP5]]
	;			;
	entry:			entry:
	%a = alloca [3 x ptr], align 16, addrspace(5)			%a = alloca [3 x ptr], align 16, addrspace(5)
	store i64 %v, ptr addrspace(5) %a, align 16			store i64 %v, ptr addrspace(5) %a, align 16
	%ld = load i64, ptr addrspace(5) %a, align 16			%ld = load i64, ptr addrspace(5) %a, align 16
	ret i64 %ld			ret i64 %ld
	}			}

llvm/test/CodeGen/AMDGPU/promote-alloca-vector-to-vector.ll

	; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s
	; RUN: opt -S -mtriple=amdgcn-- -data-layout=A5 -mcpu=fiji -passes=amdgpu-promote-alloca < %s \| FileCheck -check-prefix=OPT %s			; RUN: opt -S -mtriple=amdgcn-- -data-layout=A5 -mcpu=fiji -passes=sroa,amdgpu-promote-alloca < %s \| FileCheck -check-prefix=OPT %s

	; GCN-LABEL: {{^}}float4_alloca_store4:			; GCN-LABEL: {{^}}float4_alloca_store4:
	; OPT-LABEL: define amdgpu_kernel void @float4_alloca_store4			; OPT-LABEL: define amdgpu_kernel void @float4_alloca_store4

	; GCN-NOT: buffer_			; GCN-NOT: buffer_
	; GCN: v_cndmask_b32			; GCN: v_cndmask_b32
	; GCN: v_cndmask_b32			; GCN: v_cndmask_b32
	; GCN: v_cndmask_b32_e32 [[RES:v[0-9]+]], 4.0,			; GCN: v_cndmask_b32_e32 [[RES:v[0-9]+]], 4.0,
	; GCN: store_dword v{{.+}}, [[RES]]			; GCN: store_dword v{{.+}}, [[RES]]

	; OPT: %gep = getelementptr inbounds <4 x float>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			; OPT: %0 = extractelement <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+00>, i32 %sel2
	; OPT: store <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+00>, ptr addrspace(5) %alloca, align 4			; OPT: store float %0, ptr addrspace(1) %out, align 4
	; OPT: %0 = load <4 x float>, ptr addrspace(5) %alloca
	; OPT: %1 = extractelement <4 x float> %0, i32 %sel2
	; OPT: store float %1, ptr addrspace(1) %out, align 4

	define amdgpu_kernel void @float4_alloca_store4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {			define amdgpu_kernel void @float4_alloca_store4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {
	entry:			entry:
	%alloca = alloca <4 x float>, align 16, addrspace(5)			%alloca = alloca <4 x float>, align 16, addrspace(5)
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()			%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%y = tail call i32 @llvm.amdgcn.workitem.id.y()			%y = tail call i32 @llvm.amdgcn.workitem.id.y()
	%c1 = icmp uge i32 %x, 3			%c1 = icmp uge i32 %x, 3
	%c2 = icmp uge i32 %y, 3			%c2 = icmp uge i32 %y, 3
	Show All 14 Lines
	; GCN-NOT: v_cmp_			; GCN-NOT: v_cmp_
	; GCN-NOT: v_cndmask_			; GCN-NOT: v_cndmask_
	; GCN: v_mov_b32_e32 [[ONE:v[0-9]+]], 1.0			; GCN: v_mov_b32_e32 [[ONE:v[0-9]+]], 1.0
	; GCN: v_mov_b32_e32 v{{[0-9]+}}, [[ONE]]			; GCN: v_mov_b32_e32 v{{[0-9]+}}, [[ONE]]
	; GCN: v_mov_b32_e32 v{{[0-9]+}}, [[ONE]]			; GCN: v_mov_b32_e32 v{{[0-9]+}}, [[ONE]]
	; GCN: v_mov_b32_e32 v{{[0-9]+}}, [[ONE]]			; GCN: v_mov_b32_e32 v{{[0-9]+}}, [[ONE]]
	; GCN: store_dwordx4 v{{.+}},			; GCN: store_dwordx4 v{{.+}},

	; OPT: %gep = getelementptr inbounds <4 x float>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			; OPT: %0 = insertelement <4 x float> undef, float 1.000000e+00, i32 %sel2
	; OPT: %0 = load <4 x float>, ptr addrspace(5) %alloca			; OPT: store <4 x float> %0, ptr addrspace(1) %out, align 4
	; OPT: %1 = insertelement <4 x float> %0, float 1.000000e+00, i32 %sel2
	; OPT: store <4 x float> %1, ptr addrspace(5) %alloca
	; OPT: %load = load <4 x float>, ptr addrspace(5) %alloca, align 4
	; OPT: store <4 x float> %load, ptr addrspace(1) %out, align 4

	define amdgpu_kernel void @float4_alloca_load4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {			define amdgpu_kernel void @float4_alloca_load4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {
	entry:			entry:
	%alloca = alloca <4 x float>, align 16, addrspace(5)			%alloca = alloca <4 x float>, align 16, addrspace(5)
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()			%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%y = tail call i32 @llvm.amdgcn.workitem.id.y()			%y = tail call i32 @llvm.amdgcn.workitem.id.y()
	%c1 = icmp uge i32 %x, 3			%c1 = icmp uge i32 %x, 3
	%c2 = icmp uge i32 %y, 3			%c2 = icmp uge i32 %y, 3
	Show All 9 Lines
	; GCN-LABEL: {{^}}half4_alloca_store4:			; GCN-LABEL: {{^}}half4_alloca_store4:
	; OPT-LABEL: define amdgpu_kernel void @half4_alloca_store4			; OPT-LABEL: define amdgpu_kernel void @half4_alloca_store4

	; GCN-NOT: buffer_			; GCN-NOT: buffer_
	; GCN-DAG: s_mov_b32 s[[SH:[0-9]+]], 0x44004200			; GCN-DAG: s_mov_b32 s[[SH:[0-9]+]], 0x44004200
	; GCN-DAG: s_mov_b32 s[[SL:[0-9]+]], 0x40003c00			; GCN-DAG: s_mov_b32 s[[SL:[0-9]+]], 0x40003c00
	; GCN: v_lshrrev_b64 v[{{[0-9:]+}}], v{{[0-9]+}}, s[[[SL]]:[[SH]]]			; GCN: v_lshrrev_b64 v[{{[0-9:]+}}], v{{[0-9]+}}, s[[[SL]]:[[SH]]]

	; OPT: %gep = getelementptr inbounds <4 x half>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			; OPT: %0 = extractelement <4 x half> <half 0xH3C00, half 0xH4000, half 0xH4200, half 0xH4400>, i32 %sel2
	; OPT: store <4 x half> <half 0xH3C00, half 0xH4000, half 0xH4200, half 0xH4400>, ptr addrspace(5) %alloca, align 2			; OPT: store half %0, ptr addrspace(1) %out, align 2
	; OPT: %0 = load <4 x half>, ptr addrspace(5) %alloca
	; OPT: %1 = extractelement <4 x half> %0, i32 %sel2
	; OPT: store half %1, ptr addrspace(1) %out, align 2

	define amdgpu_kernel void @half4_alloca_store4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {			define amdgpu_kernel void @half4_alloca_store4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {
	entry:			entry:
	%alloca = alloca <4 x half>, align 16, addrspace(5)			%alloca = alloca <4 x half>, align 16, addrspace(5)
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()			%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%y = tail call i32 @llvm.amdgcn.workitem.id.y()			%y = tail call i32 @llvm.amdgcn.workitem.id.y()
	%c1 = icmp uge i32 %x, 3			%c1 = icmp uge i32 %x, 3
	%c2 = icmp uge i32 %y, 3			%c2 = icmp uge i32 %y, 3
	%sel1 = select i1 %c1, i32 1, i32 2			%sel1 = select i1 %c1, i32 1, i32 2
	%sel2 = select i1 %c2, i32 0, i32 %sel1			%sel2 = select i1 %c2, i32 0, i32 %sel1
	%gep = getelementptr inbounds <4 x half>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			%gep = getelementptr inbounds <4 x half>, ptr addrspace(5) %alloca, i32 0, i32 %sel2
	store <4 x half> <half 1.0, half 2.0, half 3.0, half 4.0>, ptr addrspace(5) %alloca, align 2			store <4 x half> <half 1.0, half 2.0, half 3.0, half 4.0>, ptr addrspace(5) %alloca, align 2
	%load = load half, ptr addrspace(5) %gep, align 2			%load = load half, ptr addrspace(5) %gep, align 2
	store half %load, ptr addrspace(1) %out, align 2			store half %load, ptr addrspace(1) %out, align 2
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}half4_alloca_load4:			; GCN-LABEL: {{^}}half4_alloca_load4:
	; OPT-LABEL: define amdgpu_kernel void @half4_alloca_load4			; OPT-LABEL: define amdgpu_kernel void @half4_alloca_load4

	; GCN-NOT: buffer_			; GCN-NOT: buffer_
	; GCN: s_mov_b64 s[{{[0-9:]+}}], 0xffff			; GCN: s_mov_b64 s[{{[0-9:]+}}], 0xffff

	; OPT: %gep = getelementptr inbounds <4 x half>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			; OPT: %0 = insertelement <4 x half> undef, half 0xH3C00, i32 %sel2
	; OPT: %0 = load <4 x half>, ptr addrspace(5) %alloca			; OPT: store <4 x half> %0, ptr addrspace(1) %out, align 2
	; OPT: %1 = insertelement <4 x half> %0, half 0xH3C00, i32 %sel2
	; OPT: store <4 x half> %1, ptr addrspace(5) %alloca
	; OPT: %load = load <4 x half>, ptr addrspace(5) %alloca, align 2
	; OPT: store <4 x half> %load, ptr addrspace(1) %out, align 2

	define amdgpu_kernel void @half4_alloca_load4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {			define amdgpu_kernel void @half4_alloca_load4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {
	entry:			entry:
	%alloca = alloca <4 x half>, align 16, addrspace(5)			%alloca = alloca <4 x half>, align 16, addrspace(5)
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()			%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%y = tail call i32 @llvm.amdgcn.workitem.id.y()			%y = tail call i32 @llvm.amdgcn.workitem.id.y()
	%c1 = icmp uge i32 %x, 3			%c1 = icmp uge i32 %x, 3
	%c2 = icmp uge i32 %y, 3			%c2 = icmp uge i32 %y, 3
	Show All 9 Lines
	; GCN-LABEL: {{^}}short4_alloca_store4:			; GCN-LABEL: {{^}}short4_alloca_store4:
	; OPT-LABEL: define amdgpu_kernel void @short4_alloca_store4			; OPT-LABEL: define amdgpu_kernel void @short4_alloca_store4

	; GCN-NOT: buffer_			; GCN-NOT: buffer_
	; GCN-DAG: s_mov_b32 s[[SH:[0-9]+]], 0x40003			; GCN-DAG: s_mov_b32 s[[SH:[0-9]+]], 0x40003
	; GCN-DAG: s_mov_b32 s[[SL:[0-9]+]], 0x20001			; GCN-DAG: s_mov_b32 s[[SL:[0-9]+]], 0x20001
	; GCN: v_lshrrev_b64 v[{{[0-9:]+}}], v{{[0-9]+}}, s[[[SL]]:[[SH]]]			; GCN: v_lshrrev_b64 v[{{[0-9:]+}}], v{{[0-9]+}}, s[[[SL]]:[[SH]]]

	; OPT: %gep = getelementptr inbounds <4 x i16>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			; OPT: %0 = extractelement <4 x i16> <i16 1, i16 2, i16 3, i16 4>, i32 %sel2
	; OPT: store <4 x i16> <i16 1, i16 2, i16 3, i16 4>, ptr addrspace(5) %alloca, align 2			; OPT: store i16 %0, ptr addrspace(1) %out, align 2
	; OPT: %0 = load <4 x i16>, ptr addrspace(5) %alloca
	; OPT: %1 = extractelement <4 x i16> %0, i32 %sel2
	; OPT: store i16 %1, ptr addrspace(1) %out, align 2

	define amdgpu_kernel void @short4_alloca_store4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {			define amdgpu_kernel void @short4_alloca_store4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {
	entry:			entry:
	%alloca = alloca <4 x i16>, align 16, addrspace(5)			%alloca = alloca <4 x i16>, align 16, addrspace(5)
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()			%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%y = tail call i32 @llvm.amdgcn.workitem.id.y()			%y = tail call i32 @llvm.amdgcn.workitem.id.y()
	%c1 = icmp uge i32 %x, 3			%c1 = icmp uge i32 %x, 3
	%c2 = icmp uge i32 %y, 3			%c2 = icmp uge i32 %y, 3
	%sel1 = select i1 %c1, i32 1, i32 2			%sel1 = select i1 %c1, i32 1, i32 2
	%sel2 = select i1 %c2, i32 0, i32 %sel1			%sel2 = select i1 %c2, i32 0, i32 %sel1
	%gep = getelementptr inbounds <4 x i16>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			%gep = getelementptr inbounds <4 x i16>, ptr addrspace(5) %alloca, i32 0, i32 %sel2
	store <4 x i16> <i16 1, i16 2, i16 3, i16 4>, ptr addrspace(5) %alloca, align 2			store <4 x i16> <i16 1, i16 2, i16 3, i16 4>, ptr addrspace(5) %alloca, align 2
	%load = load i16, ptr addrspace(5) %gep, align 2			%load = load i16, ptr addrspace(5) %gep, align 2
	store i16 %load, ptr addrspace(1) %out, align 2			store i16 %load, ptr addrspace(1) %out, align 2
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}short4_alloca_load4:			; GCN-LABEL: {{^}}short4_alloca_load4:
	; OPT-LABEL: define amdgpu_kernel void @short4_alloca_load4			; OPT-LABEL: define amdgpu_kernel void @short4_alloca_load4

	; GCN-NOT: buffer_			; GCN-NOT: buffer_
	; GCN: s_mov_b64 s[{{[0-9:]+}}], 0xffff			; GCN: s_mov_b64 s[{{[0-9:]+}}], 0xffff

	; OPT: %gep = getelementptr inbounds <4 x i16>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			; OPT: %0 = insertelement <4 x i16> undef, i16 1, i32 %sel2
	; OPT: %0 = load <4 x i16>, ptr addrspace(5) %alloca			; OPT: store <4 x i16> %0, ptr addrspace(1) %out, align 2
	; OPT: %1 = insertelement <4 x i16> %0, i16 1, i32 %sel2
	; OPT: store <4 x i16> %1, ptr addrspace(5) %alloca
	; OPT: %load = load <4 x i16>, ptr addrspace(5) %alloca, align 2
	; OPT: store <4 x i16> %load, ptr addrspace(1) %out, align 2

	define amdgpu_kernel void @short4_alloca_load4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {			define amdgpu_kernel void @short4_alloca_load4(ptr addrspace(1) %out, ptr addrspace(3) %dummy_lds) {
	entry:			entry:
	%alloca = alloca <4 x i16>, align 16, addrspace(5)			%alloca = alloca <4 x i16>, align 16, addrspace(5)
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()			%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%y = tail call i32 @llvm.amdgcn.workitem.id.y()			%y = tail call i32 @llvm.amdgcn.workitem.id.y()
	%c1 = icmp uge i32 %x, 3			%c1 = icmp uge i32 %x, 3
	%c2 = icmp uge i32 %y, 3			%c2 = icmp uge i32 %y, 3
	%sel1 = select i1 %c1, i32 1, i32 2			%sel1 = select i1 %c1, i32 1, i32 2
	%sel2 = select i1 %c2, i32 0, i32 %sel1			%sel2 = select i1 %c2, i32 0, i32 %sel1
	%gep = getelementptr inbounds <4 x i16>, ptr addrspace(5) %alloca, i32 0, i32 %sel2			%gep = getelementptr inbounds <4 x i16>, ptr addrspace(5) %alloca, i32 0, i32 %sel2
	store i16 1, ptr addrspace(5) %gep, align 4			store i16 1, ptr addrspace(5) %gep, align 4
	%load = load <4 x i16>, ptr addrspace(5) %alloca, align 2			%load = load <4 x i16>, ptr addrspace(5) %alloca, align 2
	store <4 x i16> %load, ptr addrspace(1) %out, align 2			store <4 x i16> %load, ptr addrspace(1) %out, align 2
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}ptr_alloca_bitcast:			; GCN-LABEL: {{^}}ptr_alloca_bitcast:
	; OPT-LABEL: define i64 @ptr_alloca_bitcast			; OPT-LABEL: define i64 @ptr_alloca_bitcast

	; GCN-NOT: buffer_			; GCN-NOT: buffer_
	; GCN: v_mov_b32_e32 v1, 0			; GCN: v_mov_b32_e32 v1, 0

	; OPT: %private_iptr = alloca <2 x i32>, align 8, addrspace(5)			; OPT: ret i64 undef
	; OPT: %tmp1 = load i64, ptr addrspace(5) %private_iptr, align 8

	define i64 @ptr_alloca_bitcast() {			define i64 @ptr_alloca_bitcast() {
	entry:			entry:
	%private_iptr = alloca <2 x i32>, align 8, addrspace(5)			%private_iptr = alloca <2 x i32>, align 8, addrspace(5)
	%tmp1 = load i64, ptr addrspace(5) %private_iptr, align 8			%tmp1 = load i64, ptr addrspace(5) %private_iptr, align 8
	ret i64 %tmp1			ret i64 %tmp1
	}			}

	declare i32 @llvm.amdgcn.workitem.id.x()			declare i32 @llvm.amdgcn.workitem.id.x()
	declare i32 @llvm.amdgcn.workitem.id.y()			declare i32 @llvm.amdgcn.workitem.id.y()

llvm/test/CodeGen/AMDGPU/sroa-before-unroll.ll

	; RUN: opt -mtriple=amdgcn-- -O1 -S < %s \| FileCheck %s --check-prefixes=FUNC,LOOP			; RUN: opt -mtriple=amdgcn-- -O1 -S < %s \| FileCheck %s --check-prefixes=FUNC,LOOP
	; RUN: opt -mtriple=amdgcn-- -passes='default<O1>' -S < %s \| FileCheck %s --check-prefixes=FUNC,LOOP			; RUN: opt -mtriple=amdgcn-- -passes='default<O1>' -S < %s \| FileCheck %s --check-prefixes=FUNC,LOOP
	; RUN: opt -mtriple=amdgcn-- -O1 -S -disable-promote-alloca-to-vector < %s \| FileCheck %s --check-prefixes=FUNC,FULL-UNROLL			; RUN: opt -mtriple=amdgcn-- -O1 -S -disable-promote-alloca-to-vector < %s \| FileCheck %s --check-prefixes=FUNC,FULL-UNROLL
	; RUN: opt -mtriple=amdgcn-- -passes='default<O1>' -S -disable-promote-alloca-to-vector < %s \| FileCheck %s --check-prefixes=FUNC,FULL-UNROLL			; RUN: opt -mtriple=amdgcn-- -passes='default<O1>' -S -disable-promote-alloca-to-vector < %s \| FileCheck %s --check-prefixes=FUNC,FULL-UNROLL

	target datalayout = "A5"			target datalayout = "A5"

	; This test contains a simple loop that initializes an array declared in			; This test contains a simple loop that initializes an array declared in
	; private memory. This loop would be fully unrolled if we could not SROA			; private memory. This loop would be fully unrolled if we could not SROA
	; the alloca. Check that we successfully eliminate it before the unroll,			; the alloca. Check that we successfully eliminate it before the unroll,
	; so that we do not need to fully unroll it.			; so that we do not need to fully unroll it.

	; FUNC-LABEL: @private_memory			; FUNC-LABEL: @private_memory
	; LOOP-NOT: alloca			; LOOP-NOT: = alloca
	; LOOP: loop.header:			; LOOP: loop.header:
	; LOOP: br i1 %{{[^,]+}}, label %exit, label %loop.header			; LOOP: br i1 %{{[^,]+}}, label %exit, label %loop.header

	; FULL-UNROLL: alloca			; FULL-UNROLL: alloca
	; FULL-UNROLL-COUNT-256: store i32 {{[0-9]+}}, ptr addrspace(5)			; FULL-UNROLL-COUNT-256: store i32 {{[0-9]+}}, ptr addrspace(5)
	; FULL-UNROLL-NOT: br			; FULL-UNROLL-NOT: br

	; FUNC: store i32 %{{[^,]+}}, ptr addrspace(1) %out			; FUNC: store i32 %{{[^,]+}}, ptr addrspace(1) %out
	Show All 27 Lines

llvm/test/CodeGen/AMDGPU/vector-alloca-bitcast.ll

; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=tonga -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -enable-var-scope --check-prefixes=GCN,GCN-ALLOCA %s		; RUN: opt -S -mtriple=amdgcn- -passes=sroa %s -o %t.sroa.ll
; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=tonga -mattr=+promote-alloca -verify-machineinstrs < %s \| FileCheck -enable-var-scope --check-prefixes=GCN,GCN-PROMOTE %s		; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=tonga -mattr=-promote-alloca -verify-machineinstrs < %t.sroa.ll \| FileCheck -enable-var-scope --check-prefixes=GCN,GCN-ALLOCA %s
; RUN: opt -S -mtriple=amdgcn-- -passes='amdgpu-promote-alloca,sroa,instcombine' < %s \| FileCheck -check-prefix=OPT %s		; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=tonga -mattr=+promote-alloca -verify-machineinstrs < %t.sroa.ll \| FileCheck -enable-var-scope --check-prefixes=GCN,GCN-PROMOTE %s
		; RUN: opt -S -mtriple=amdgcn-- -passes='sroa,amdgpu-promote-alloca,instcombine' < %s \| FileCheck -check-prefix=OPT %s

target datalayout = "A5"		target datalayout = "A5"

; OPT-LABEL: @vector_read_alloca_bitcast(		; OPT-LABEL: @vector_read_alloca_bitcast(
; OPT-NOT: alloca		; OPT-NOT: alloca
; OPT: %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index		; OPT: %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index
; OPT-NEXT: store i32 %0, ptr addrspace(1) %out, align 4		; OPT-NEXT: store i32 %0, ptr addrspace(1) %out, align 4

▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	entry:
%tmp3 = load i32, ptr addrspace(5) %tmp2		%tmp3 = load i32, ptr addrspace(5) %tmp2
store i32 %tmp3, ptr addrspace(1) %out		store i32 %tmp3, ptr addrspace(1) %out
ret void		ret void
}		}

; OPT-LABEL: @vector_write_read_bitcast_to_float(		; OPT-LABEL: @vector_write_read_bitcast_to_float(
; OPT-NOT: alloca		; OPT-NOT: alloca
; OPT: bb2:		; OPT: bb2:
; OPT: %tmp.sroa.0.0 = phi <6 x float> [ undef, %bb ], [ %0, %bb2 ]		; OPT: %promotealloca = phi <6 x float> [ undef, %bb ], [ %0, %bb2 ]
; OPT: %0 = insertelement <6 x float> %tmp.sroa.0.0, float %tmp72, i32 %tmp10		; OPT: %0 = insertelement <6 x float> %promotealloca, float %tmp71, i32 %tmp10
; OPT: .preheader:		; OPT: .preheader:
; OPT: %bc = bitcast <6 x float> %0 to <6 x i32>		; OPT: %bc = bitcast <6 x float> %0 to <6 x i32>
; OPT: %1 = extractelement <6 x i32> %bc, i32 %tmp20		; OPT: %1 = extractelement <6 x i32> %bc, i32 %tmp20

; GCN-LABEL: {{^}}vector_write_read_bitcast_to_float:		; GCN-LABEL: {{^}}vector_write_read_bitcast_to_float:
; GCN-ALLOCA: buffer_store_dword		; GCN-ALLOCA: buffer_store_dword

; GCN-PROMOTE-COUNT-6: v_cmp_eq_u16		; GCN-PROMOTE: v_cmp_eq_u16
; GCN-PROMOTE-COUNT-6: v_cndmask		; GCN-PROMOTE: v_cndmask

; GCN: s_cbranch		; GCN: s_cbranch

; GCN-ALLOCA: buffer_load_dword		; GCN-ALLOCA: buffer_load_dword

; GCN-PROMOTE: v_cmp_eq_u16
; GCN-PROMOTE: v_cndmask
; GCN-PROMOTE: v_cmp_eq_u16
; GCN-PROMOTE: v_cndmask
; GCN-PROMOTE: v_cmp_eq_u16
; GCN-PROMOTE: v_cndmask
; GCN-PROMOTE: v_cmp_eq_u16
; GCN-PROMOTE: v_cndmask
; GCN-PROMOTE: v_cmp_eq_u16
; GCN-PROMOTE: v_cndmask

; GCN-PROMOTE: ScratchSize: 0		; GCN-PROMOTE: ScratchSize: 0

define amdgpu_kernel void @vector_write_read_bitcast_to_float(ptr addrspace(1) %arg) {		define amdgpu_kernel void @vector_write_read_bitcast_to_float(ptr addrspace(1) %arg) {
bb:		bb:
%tmp = alloca [6 x float], align 4, addrspace(5)		%tmp = alloca [6 x float], align 4, addrspace(5)
call void @llvm.lifetime.start.p5(i64 24, ptr addrspace(5) %tmp) #2		call void @llvm.lifetime.start.p5(i64 24, ptr addrspace(5) %tmp) #2
br label %bb2		br label %bb2

Show All 29 Lines	.preheader: ; preds = %.preheader, %bb2
%tmp27 = add nuw nsw i32 %tmp16, 1		%tmp27 = add nuw nsw i32 %tmp16, 1
%tmp28 = icmp eq i32 %tmp27, 1000		%tmp28 = icmp eq i32 %tmp27, 1000
br i1 %tmp28, label %bb15, label %.preheader		br i1 %tmp28, label %bb15, label %.preheader
}		}

; OPT-LABEL: @vector_write_read_bitcast_to_double(		; OPT-LABEL: @vector_write_read_bitcast_to_double(
; OPT-NOT: alloca		; OPT-NOT: alloca
; OPT: bb2:		; OPT: bb2:
; OPT: %tmp.sroa.0.0 = phi <6 x double> [ undef, %bb ], [ %0, %bb2 ]		; OPT: %promotealloca = phi <6 x double> [ undef, %bb ], [ %0, %bb2 ]
; OPT: %0 = insertelement <6 x double> %tmp.sroa.0.0, double %tmp72, i32 %tmp10		; OPT: %0 = insertelement <6 x double> %promotealloca, double %tmp71, i32 %tmp10
; OPT: .preheader:		; OPT: .preheader:
; OPT: %bc = bitcast <6 x double> %0 to <6 x i64>		; OPT: %bc = bitcast <6 x double> %0 to <6 x i64>
; OPT: %1 = extractelement <6 x i64> %bc, i32 %tmp20		; OPT: %1 = extractelement <6 x i64> %bc, i32 %tmp20

; GCN-LABEL: {{^}}vector_write_read_bitcast_to_double:		; GCN-LABEL: {{^}}vector_write_read_bitcast_to_double:

; GCN-ALLOCA-COUNT-2: buffer_store_dword		; GCN-ALLOCA-COUNT-2: buffer_store_dword
; GCN-PROMOTE-COUNT-2: v_movreld_b32_e32		; GCN-PROMOTE-COUNT-2: v_movreld_b32_e32
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	.preheader: ; preds = %.preheader, %bb2
%tmp27 = add nuw nsw i32 %tmp16, 1		%tmp27 = add nuw nsw i32 %tmp16, 1
%tmp28 = icmp eq i32 %tmp27, 1000		%tmp28 = icmp eq i32 %tmp27, 1000
br i1 %tmp28, label %bb15, label %.preheader		br i1 %tmp28, label %bb15, label %.preheader
}		}

; OPT-LABEL: @vector_write_read_bitcast_to_i64(		; OPT-LABEL: @vector_write_read_bitcast_to_i64(
; OPT-NOT: alloca		; OPT-NOT: alloca
; OPT: bb2:		; OPT: bb2:
; OPT: %tmp.sroa.0.0 = phi <6 x i64> [ undef, %bb ], [ %0, %bb2 ]		; OPT: %promotealloca = phi <6 x i64> [ undef, %bb ], [ %0, %bb2 ]
; OPT: %0 = insertelement <6 x i64> %tmp.sroa.0.0, i64 %tmp6, i32 %tmp9		; OPT: %0 = insertelement <6 x i64> %promotealloca, i64 %tmp6, i32 %tmp9
; OPT: .preheader:		; OPT: .preheader:
; OPT: %1 = extractelement <6 x i64> %0, i32 %tmp18		; OPT: %1 = extractelement <6 x i64> %0, i32 %tmp18

; GCN-LABEL: {{^}}vector_write_read_bitcast_to_i64:		; GCN-LABEL: {{^}}vector_write_read_bitcast_to_i64:

; GCN-ALLOCA-COUNT-2: buffer_store_dword		; GCN-ALLOCA-COUNT-2: buffer_store_dword
; GCN-PROMOTE-COUNT-2: v_movreld_b32_e32		; GCN-PROMOTE-COUNT-2: v_movreld_b32_e32

▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines

; TODO: llvm.assume can be ingored		; TODO: llvm.assume can be ingored

; OPT-LABEL: @vector_read_alloca_bitcast_assume(		; OPT-LABEL: @vector_read_alloca_bitcast_assume(
; OPT: %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index		; OPT: %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index
; OPT: store i32 %0, ptr addrspace(1) %out, align 4		; OPT: store i32 %0, ptr addrspace(1) %out, align 4

; GCN-LABEL: {{^}}vector_read_alloca_bitcast_assume:		; GCN-LABEL: {{^}}vector_read_alloca_bitcast_assume:
; GCN-COUNT-4: buffer_store_dword		; GCN-COUNT: buffer_store_dword

define amdgpu_kernel void @vector_read_alloca_bitcast_assume(ptr addrspace(1) %out, i32 %index) {		define amdgpu_kernel void @vector_read_alloca_bitcast_assume(ptr addrspace(1) %out, i32 %index) {
entry:		entry:
%tmp = alloca [4 x i32], addrspace(5)		%tmp = alloca [4 x i32], addrspace(5)
%cmp = icmp ne ptr addrspace(5) %tmp, null		%cmp = icmp ne ptr addrspace(5) %tmp, null
call void @llvm.assume(i1 %cmp)		call void @llvm.assume(i1 %cmp)
%y = getelementptr [4 x i32], ptr addrspace(5) %tmp, i32 0, i32 1		%y = getelementptr [4 x i32], ptr addrspace(5) %tmp, i32 0, i32 1
%z = getelementptr [4 x i32], ptr addrspace(5) %tmp, i32 0, i32 2		%z = getelementptr [4 x i32], ptr addrspace(5) %tmp, i32 0, i32 2
▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use SSAUpdater in PromoteAllocaClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 543824

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-globals.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-pointer-array.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-vector-to-vector.ll

llvm/test/CodeGen/AMDGPU/sroa-before-unroll.ll

llvm/test/CodeGen/AMDGPU/vector-alloca-bitcast.ll

[AMDGPU] Use SSAUpdater in PromoteAlloca
ClosedPublic